Data Descriptors
Data descriptors (DataSpecs) define what market data your Datafye environment needs. They specify which datasets, symbols, live data streams, and historical retention your algo requires — without dealing with infrastructure details.
Purpose
A DataSpec is a declarative blueprint that tells the Datafye Data Cloud:
Which datasets you need (SIP, PrecisionAlpha, TotalView)
Which symbols within those datasets
Whether to include reference data for context
Which live tick streams to subscribe to
Which aggregate analytics to consume in real-time
How long to retain historical data per schema for replay, backtesting, or audits
Based on your DataSpec, Datafye provisions the Data Cloud with appropriate data sources, subscriptions, and storage — translating your functional requirements into running infrastructure.
When You Need Data Descriptors
Data descriptors are required for all Datafye scenarios:
Foundry: Data Cloud Only — Specifies what data to make available via APIs
Foundry: Full Stack — Defines data for backtesting and algo development
Trading: Data Cloud + Broker — Configures real-time feeds for paper/live trading
Trading: Full Stack — Provides data for both development and execution
Structure
A DataSpec is a YAML or JSON document with a Kubernetes-style structure:
API Version and Kind
These fields identify the descriptor format:
Fixed values:
apiVersion: datafye.io/v1— Current DataSpec versionkind: DataSpec— Identifies this as a data descriptor
Metadata
Human-readable information for identification and traceability:
Fields:
name— Unique identifier for this spec (lowercase, hyphens)description— Optional human-readable descriptionrequestedBy.actorType—useroralgo(who requested this data)requestedBy.actorId— Email (for user) or algo ID
Mode
Specifies how the data will be used:
Values:
live— Real-time trading with live market datapaper— Paper trading with live data, simulated executionbacktest— Historical data only for backtesting
Mode affects:
Data latency requirements
Subscription behavior
Storage and retention policies
Datasets
Datasets are the core building blocks. Each dataset represents a packaged data product:
SIP — Securities Information Processor (US equities via Polygon or other providers)
TotalView — Level 2 market depth data
PrecisionAlpha — Alternative data and pre-computed signals
Each dataset configuration includes:
Symbols — Which instruments to include
Reference data — Static metadata (boolean flag)
Live data — Real-time ticks and aggregates to subscribe to
History — Retention periods for each schema
Basic Dataset Structure
Symbols
Define which instruments you want data for:
Ticker Lists
Explicit list of symbols:
Wildcard support:
"NV*"— Matches all tickers starting with NV (NVDA, NVAX, etc.)"*"— Matches all symbols in the dataset"BRK.B"— Special characters supported
Universes
Pre-defined symbol sets:
Available universes:
SP500— S&P 500 constituentsNDX100— NASDAQ-100 constituentsRUSSELL2000— Russell 2000 constituents
Universe behavior:
Membership is dynamic (updates as constituents change)
Combines with explicit tickers (union)
Can specify multiple universes
All Symbols
To get all symbols in a dataset:
Or omit universes to use the dataset's full scope.
Reference Data
Boolean flag indicating whether to include static metadata:
Reference data includes:
Security master information
Corporate actions (splits, dividends)
Symbol metadata (name, type, exchange)
Calendar information (market hours, holidays)
When to enable:
Almost always
truefor trading algos (need corporate action adjustments)Can be
falsefor pure signal algos that don't need contextHistorical reference data retained per
history.reference.duration
Live Data
Defines which real-time data streams to subscribe to:
Ticks
Raw market data streams:
Format: Comma-separated list of schema IDs
SIP tick schemas:
trades— Trade executions (price, size, timestamp)quotes— Top-of-book bid/ask quotes
TotalView tick schemas:
trades— Trade executionsquotes-1— Top-of-book quotesquotes-2— Level 2 market depth (10 levels)
Crypto tick schemas:
trades— Trade executionsquotes-1— Top-of-book quotesquotes-10— 10-level order book depth
PrecisionAlpha tick schemas:
pa-1s— 1-second PrecisionAlpha signalspa-1m— 1-minute PrecisionAlpha signalspa-1d— Daily PrecisionAlpha signals
Reserved words:
none— No tick dataallor*— All available tick schemas for this dataset
Aggregates
Pre-computed analytics delivered in real-time:
Format: Comma-separated list of schema IDs
OHLC (Bar) schemas:
ohlc-1s— 1-second barsohlc-1m— 1-minute barsohlc-1h— 1-hour barsohlc-1d— Daily bars
Technical indicator schemas:
ema-1m-20— 20-period EMA on 1-minute barsema-1m-50— 50-period EMA on 1-minute barssma-1m-20— 20-period SMA on 1-minute barsvwap-1m— VWAP on 1-minute bars
Signal schemas:
signal-1m-12-26-9— MACD signal (12, 26, 9 periods on 1-minute bars)signal-1d-12-26-9— MACD signal on daily bars
Reserved words:
none— No aggregates
Why aggregates matter:
Pre-computed by Data Cloud (no need to calculate yourself)
Delivered in real-time as market data updates
Consistent calculations across all algos
Lower latency than computing yourself
History
Defines retention periods for historical data:
Tick History
Retention for tick-level data:
Each entry specifies:
schema— Tick schema ID (must match a schema fromlive.ticks)duration— How long to retain this data
Aggregate History
Retention for pre-computed analytics:
Each entry specifies:
schema— Aggregate schema ID (must match a schema fromlive.aggregates)duration— How long to retain this data
Reference History
Retention for reference data:
Use cases:
Backtesting with historical corporate actions
Replaying with accurate symbol metadata
Auditing historical reference data changes
Duration Format
Simple format:
7d— 7 days30d— 30 days6m— 6 months1y— 1 year0d— No retention (no historical data)
ISO-8601 format:
P90D— 90 daysP6M— 6 monthsP1Y— 1 year
Considerations:
Longer retention = more storage cost
Different schemas can have different retention
Tick data typically shorter retention (volume)
Aggregates typically longer retention (smaller)
Balance backtest needs vs. cost
Available Datasets
SIP (Securities Information Processor)
US equities market data via SIP feeds:
Characteristics:
US equities only
Trade and quote data
Multiple aggregate options
Provider: Currently Polygon (more providers coming)
Tick schemas:
trades— Every trade executionquotes— Top-of-book bid/ask updates
Aggregate schemas:
OHLC bars:
ohlc-1s,ohlc-1m,ohlc-1h,ohlc-1dTechnical indicators:
ema-1m-20,ema-1m-50,sma-1m-20,vwap-1mSignals:
signal-1m-12-26-9
PrecisionAlpha
Alternative data and pre-computed signals:
Characteristics:
Proprietary alternative data
Pre-computed signals
Multiple timeframes
Typically used alongside SIP data
Tick schemas:
pa-1s— 1-second PrecisionAlpha updatespa-1m— 1-minute PrecisionAlpha updatespa-1d— Daily PrecisionAlpha updates
Aggregate schemas:
signal-1m-12-26-9— MACD-style signal lineAdditional signal schemas available
TotalView
Level 2 market depth data (coming soon):
Characteristics:
Full market depth (10 levels)
Trade and multi-level quote data
Higher latency requirements
More expensive than SIP
Tick schemas:
trades— Trade executionsquotes-1— Top-of-book (level 1)quotes-2— 10-level market depth (level 2)
Complete Examples
Example 1: Simple Equity Strategy
Minimal spec for basic stock trading:
Example 2: Technical Analysis Strategy
Using pre-computed indicators:
Example 3: Multi-Dataset Strategy
Combining SIP and PrecisionAlpha:
Example 4: Backtest-Only Configuration
Historical data for backtesting:
Best Practices
Start Small, Scale Up
Begin with minimal data and expand as needed:
Match Retention to Usage
Different schemas need different retention:
Use Aggregates When Possible
Let the Data Cloud compute indicators:
Why:
Lower latency (already computed)
Consistent calculations
Less compute in your algo
Tested and validated by Datafye
Leverage Universes
Use universes for dynamic symbol sets:
Benefits:
No manual symbol list maintenance
Surviv
orship-bias-free
Automatically handles index rebalancing
Separate Dev and Prod Specs
Maintain different specs for different environments:
dev-dataspec.yaml:
prod-dataspec.yaml:
Validation
The CLI validates your DataSpec before provisioning:
Schema Validation
Required fields present (
apiVersion,kind,metadata.name,mode,datasets)Valid enum values (
mode, dataset names, schema IDs)Correct data types (strings, booleans, arrays)
Valid duration formats
Semantic Validation
Dataset names are supported (
SIP,PrecisionAlpha,TotalView)Schema IDs exist for the specified dataset
History schemas match live schemas
Ticker formats are valid
Universe names are recognized
Logical Validation
Mode is appropriate for usage (can't use
backtestmode with live subscriptions)History retention is reasonable (not excessive)
Symbol wildcards are valid
No conflicting configurations
Common validation errors:
Unknown dataset name
Invalid schema ID for dataset
History schema not in live schemas
Invalid duration format
Invalid mode value
Missing required fields
Version Control
Store DataSpecs in git with your algo code:
Best practices:
Use meaningful names (
myalgo-dev.yaml, notdata.yaml)Include description in metadata
Document why specific schemas/retention are chosen
Tag specs with algo versions
Review diffs carefully (data changes affect algo behavior)
Cost Optimization
DataSpec choices affect costs:
Data Volume
More symbols = more data = higher cost
Use specific tickers over
"*"when possibleUniverse memberships are dynamic (SP500 ~= 500 symbols)
Retention Duration
Longer retention = more storage = higher cost
Match retention to actual backtest needs
Tick data is highest volume (use shorter retention)
Daily aggregates are lowest volume (can retain longer)
Live Subscriptions
Each live schema = real-time subscription cost
Subscribe only to schemas you actually use
Aggregates generally lower cost than computing from ticks
Cost optimization example:
Next Steps
See complete reference — Data Descriptor Reference
Learn about algos — Algo Descriptors
Understand brokers — Broker Descriptors
Start building — Foundry: Data Cloud Only
Last updated: 2025-10-11
Last updated

