Data Descriptors

Data descriptors (DataSpecs) define what market data your Datafye environment needs. They specify which datasets, symbols, live data streams, and historical retention your algo requires — without dealing with infrastructure details.

Prerequisite: Before reading this page, make sure you understand what datasets are and how they're structured.

Purpose

A DataSpec is a declarative blueprint that tells the Datafye Data Cloud:

Which datasets you need (SIP, PrecisionAlpha, TotalView)
Which symbols within those datasets
Whether to include reference data for context
Which live tick streams to subscribe to
Which aggregate analytics to consume in real-time
How long to retain historical data per schema for replay, backtesting, or audits

Based on your DataSpec, Datafye provisions the Data Cloud with appropriate data sources, subscriptions, and storage — translating your functional requirements into running infrastructure.

When You Need Data Descriptors

Data descriptors are required for all Datafye scenarios:

Foundry: Data Cloud Only — Specifies what data to make available via APIs
Foundry: Full Stack — Defines data for backtesting and algo development
Trading: Data Cloud + Broker — Configures real-time feeds for paper/live trading
Trading: Full Stack — Provides data for both development and execution

Structure

A DataSpec is a YAML or JSON document with a Kubernetes-style structure:

apiVersion: datafye.io/v1
kind: DataSpec
metadata:
  name: <string>
  description: <string>
  requestedBy:
    actorType: user | algo
    actorId: <string>
mode: live | paper | backtest
datasets:
  - dataset: <dataset-name>
    provider: <provider-name>
    symbols:
      tickers: [<list>]
      universes: [<list>]
    reference: true | false
    live:
      ticks: <schema-list>
      aggregates: <schema-list>
    history:
      ticks:
        - { schema: <name>, duration: <period> }
      aggregates:
        - { schema: <name>, duration: <period> }
      reference:
        duration: <period>

API Version and Kind

These fields identify the descriptor format:

apiVersion: datafye.io/v1
kind: DataSpec

Fixed values:

apiVersion: datafye.io/v1 — Current DataSpec version
kind: DataSpec — Identifies this as a data descriptor

Metadata

Human-readable information for identification and traceability:

metadata:
  name: us-core-demo
  description: "Demo of SIP data with OHLC aggregates"
  requestedBy:
    actorType: user
    actorId: [email protected]

Fields:

name — Unique identifier for this spec (lowercase, hyphens)
description — Optional human-readable description
requestedBy.actorType — user or algo (who requested this data)
requestedBy.actorId — Email (for user) or algo ID

Mode

Specifies how the data will be used:

mode: live

Values:

live — Real-time trading with live market data
paper — Paper trading with live data, simulated execution
backtest — Historical data only for backtesting

Mode affects:

Data latency requirements
Subscription behavior
Storage and retention policies

Datasets

Datasets are the core building blocks. Each dataset represents a packaged data product:

SIP — Securities Information Processor (US equities via Polygon or other providers)
TotalView — Level 2 market depth data
PrecisionAlpha — Alternative data and pre-computed signals

Each dataset configuration includes:

Symbols — Which instruments to include
Reference data — Static metadata (boolean flag)
Live data — Real-time ticks and aggregates to subscribe to
History — Retention periods for each schema

Basic Dataset Structure

datasets:
  - dataset: SIP
    provider: Polygon
    symbols:
      tickers: ["AAPL", "MSFT"]
      universes: ["SP500"]
    reference: true
    live:
      ticks: "trades,quotes"
      aggregates: "ohlc-1m,ema-1m-20"
    history:
      ticks:
        - { schema: trades, duration: 30d }
        - { schema: quotes, duration: 7d }
      aggregates:
        - { schema: ohlc-1m, duration: 180d }
        - { schema: ema-1m-20, duration: 180d }
      reference:
        duration: 365d

Symbols

Define which instruments you want data for:

symbols:
  tickers: ["AAPL", "MSFT", "NV*", "BRK.B"]
  universes: ["SP500", "NDX100"]

Ticker Lists

Explicit list of symbols:

symbols:
  tickers: ["AAPL", "GOOGL", "MSFT"]

Wildcard support:

"NV*" — Matches all tickers starting with NV (NVDA, NVAX, etc.)
"*" — Matches all symbols in the dataset
"BRK.B" — Special characters supported

Universes

Pre-defined symbol sets:

symbols:
  tickers: ["AAPL"]  # Explicit tickers
  universes: ["SP500", "NDX100"]  # Plus universe members

Available universes:

SP500 — S&P 500 constituents
NDX100 — NASDAQ-100 constituents
RUSSELL2000 — Russell 2000 constituents

Universe behavior:

Membership is dynamic (updates as constituents change)
Combines with explicit tickers (union)
Can specify multiple universes

All Symbols

To get all symbols in a dataset:

symbols:
  tickers: "*"

Or omit universes to use the dataset's full scope.

Reference Data

Boolean flag indicating whether to include static metadata:

reference: true   # Include reference data
reference: false  # Exclude reference data

Reference data includes:

Security master information
Corporate actions (splits, dividends)
Symbol metadata (name, type, exchange)
Calendar information (market hours, holidays)

When to enable:

Almost always true for trading algos (need corporate action adjustments)
Can be false for pure signal algos that don't need context
Historical reference data retained per history.reference.duration

Live Data

Defines which real-time data streams to subscribe to:

live:
  ticks: "trades,quotes"
  aggregates: "ohlc-1m,ema-1m-20,vwap-1m"

Ticks

Raw market data streams:

live:
  ticks: "trades,quotes"

Format: Comma-separated list of schema IDs

SIP tick schemas:

trades — Trade executions (price, size, timestamp)
quotes — Top-of-book bid/ask quotes

TotalView tick schemas:

trades — Trade executions
quotes-1 — Top-of-book quotes
quotes-2 — Level 2 market depth (10 levels)

Crypto tick schemas:

trades — Trade executions
quotes-1 — Top-of-book quotes
quotes-10 — 10-level order book depth

PrecisionAlpha tick schemas:

pa-1s — 1-second PrecisionAlpha signals
pa-1m — 1-minute PrecisionAlpha signals
pa-1d — Daily PrecisionAlpha signals

Reserved words:

none — No tick data
all or * — All available tick schemas for this dataset

Aggregates

Pre-computed analytics delivered in real-time:

live:
  aggregates: "ohlc-1m,ema-1m-20,signal-1m-12-26-9,vwap-1m"

Format: Comma-separated list of schema IDs

OHLC (Bar) schemas:

ohlc-1s — 1-second bars
ohlc-1m — 1-minute bars
ohlc-1h — 1-hour bars
ohlc-1d — Daily bars

Technical indicator schemas:

ema-1m-20 — 20-period EMA on 1-minute bars
ema-1m-50 — 50-period EMA on 1-minute bars
sma-1m-20 — 20-period SMA on 1-minute bars
vwap-1m — VWAP on 1-minute bars

Signal schemas:

signal-1m-12-26-9 — MACD signal (12, 26, 9 periods on 1-minute bars)
signal-1d-12-26-9 — MACD signal on daily bars

Reserved words:

none — No aggregates

Why aggregates matter:

Pre-computed by Data Cloud (no need to calculate yourself)
Delivered in real-time as market data updates
Consistent calculations across all algos
Lower latency than computing yourself

History

Defines retention periods for historical data:

history:
  ticks:
    - { schema: trades, duration: 30d }
    - { schema: quotes, duration: 7d }
  aggregates:
    - { schema: ohlc-1m, duration: 180d }
    - { schema: ema-1m-20, duration: 180d }
  reference:
    duration: 365d

Tick History

Retention for tick-level data:

history:
  ticks:
    - { schema: trades, duration: 30d }
    - { schema: quotes, duration: 7d }

Each entry specifies:

schema — Tick schema ID (must match a schema from live.ticks)
duration — How long to retain this data

Aggregate History

Retention for pre-computed analytics:

history:
  aggregates:
    - { schema: ohlc-1m, duration: 180d }
    - { schema: ema-1m-20, duration: 180d }
    - { schema: vwap-1m, duration: 90d }

Each entry specifies:

schema — Aggregate schema ID (must match a schema from live.aggregates)
duration — How long to retain this data

Reference History

Retention for reference data:

history:
  reference:
    duration: 365d

Use cases:

Backtesting with historical corporate actions
Replaying with accurate symbol metadata
Auditing historical reference data changes

Duration Format

Simple format:

7d — 7 days
30d — 30 days
6m — 6 months
1y — 1 year
0d — No retention (no historical data)

ISO-8601 format:

P90D — 90 days
P6M — 6 months
P1Y — 1 year

Considerations:

Longer retention = more storage cost
Different schemas can have different retention
Tick data typically shorter retention (volume)
Aggregates typically longer retention (smaller)
Balance backtest needs vs. cost

Available Datasets

SIP (Securities Information Processor)

US equities market data via SIP feeds:

datasets:
  - dataset: SIP
    provider: Polygon
    symbols:
      tickers: ["AAPL", "MSFT"]
      universes: ["SP500"]
    reference: true
    live:
      ticks: "trades,quotes"
      aggregates: "ohlc-1m,ohlc-1d,ema-1m-20,vwap-1m"
    history:
      ticks:
        - { schema: trades, duration: 30d }
        - { schema: quotes, duration: 7d }
      aggregates:
        - { schema: ohlc-1m, duration: 180d }
        - { schema: ohlc-1d, duration: 365d }
        - { schema: ema-1m-20, duration: 180d }
      reference:
        duration: 365d

Characteristics:

US equities only
Trade and quote data
Multiple aggregate options
Provider: Currently Polygon (more providers coming)

Tick schemas:

trades — Every trade execution
quotes — Top-of-book bid/ask updates

Aggregate schemas:

OHLC bars: ohlc-1s, ohlc-1m, ohlc-1h, ohlc-1d
Technical indicators: ema-1m-20, ema-1m-50, sma-1m-20, vwap-1m
Signals: signal-1m-12-26-9

PrecisionAlpha

Alternative data and pre-computed signals:

datasets:
  - dataset: PrecisionAlpha
    symbols:
      tickers: "*"  # Typically all symbols
    reference: false  # Usually no reference needed
    live:
      ticks: "pa-1m"
      aggregates: "signal-1m-12-26-9"
    history:
      ticks:
        - { schema: pa-1m, duration: 90d }
      aggregates:
        - { schema: signal-1m-12-26-9, duration: 90d }
      reference:
        duration: 0d

Characteristics:

Proprietary alternative data
Pre-computed signals
Multiple timeframes
Typically used alongside SIP data

Tick schemas:

pa-1s — 1-second PrecisionAlpha updates
pa-1m — 1-minute PrecisionAlpha updates
pa-1d — Daily PrecisionAlpha updates

Aggregate schemas:

signal-1m-12-26-9 — MACD-style signal line
Additional signal schemas available

TotalView

Level 2 market depth data (coming soon):

datasets:
  - dataset: TotalView
    provider: NASDAQ
    symbols:
      tickers: ["AAPL", "GOOGL"]
    reference: true
    live:
      ticks: "trades,quotes-1,quotes-2"
      aggregates: "ohlc-1m,vwap-1m"
    history:
      ticks:
        - { schema: trades, duration: 30d }
        - { schema: quotes-1, duration: 7d }
        - { schema: quotes-2, duration: 7d }
      aggregates:
        - { schema: ohlc-1m, duration: 180d }
      reference:
        duration: 365d

Characteristics:

Full market depth (10 levels)
Trade and multi-level quote data
Higher latency requirements
More expensive than SIP

Tick schemas:

trades — Trade executions
quotes-1 — Top-of-book (level 1)
quotes-2 — 10-level market depth (level 2)

Complete Examples

Example 1: Simple Equity Strategy

Minimal spec for basic stock trading:

apiVersion: datafye.io/v1
kind: DataSpec
metadata:
  name: simple-equity
  description: "Basic equity trading with OHLC bars"
mode: live

datasets:
  - dataset: SIP
    provider: Polygon
    symbols:
      tickers: ["AAPL", "GOOGL", "MSFT"]
    reference: true
    live:
      ticks: "trades"
      aggregates: "ohlc-1m"
    history:
      ticks:
        - { schema: trades, duration: 7d }
      aggregates:
        - { schema: ohlc-1m, duration: 30d }
      reference:
        duration: 90d

Example 2: Technical Analysis Strategy

Using pre-computed indicators:

apiVersion: datafye.io/v1
kind: DataSpec
metadata:
  name: tech-analysis
  description: "EMA and VWAP based strategy"
mode: live

datasets:
  - dataset: SIP
    provider: Polygon
    symbols:
      universes: ["SP500"]
    reference: true
    live:
      ticks: "trades,quotes"
      aggregates: "ohlc-1m,ema-1m-20,ema-1m-50,vwap-1m"
    history:
      ticks:
        - { schema: trades, duration: 30d }
      aggregates:
        - { schema: ohlc-1m, duration: 180d }
        - { schema: ema-1m-20, duration: 180d }
        - { schema: ema-1m-50, duration: 180d }
        - { schema: vwap-1m, duration: 180d }
      reference:
        duration: 365d

Example 3: Multi-Dataset Strategy

Combining SIP and PrecisionAlpha:

apiVersion: datafye.io/v1
kind: DataSpec
metadata:
  name: sip-pa-combo
  description: "SIP data with PrecisionAlpha signals"
mode: live

datasets:
  - dataset: SIP
    provider: Polygon
    symbols:
      tickers: ["AAPL", "MSFT"]
      universes: ["SP500"]
    reference: true
    live:
      ticks: "trades,quotes"
      aggregates: "ohlc-1s,ohlc-1m,ema-1m-20,vwap-1m"
    history:
      ticks:
        - { schema: trades, duration: 30d }
        - { schema: quotes, duration: 7d }
      aggregates:
        - { schema: ohlc-1s, duration: 30d }
        - { schema: ohlc-1m, duration: 180d }
        - { schema: ema-1m-20, duration: 180d }
      reference:
        duration: 365d

  - dataset: PrecisionAlpha
    symbols:
      tickers: "*"
    reference: false
    live:
      ticks: "pa-1s,pa-1m"
      aggregates: "signal-1m-12-26-9"
    history:
      ticks:
        - { schema: pa-1m, duration: 90d }
      aggregates:
        - { schema: signal-1m-12-26-9, duration: 90d }
      reference:
        duration: 0d

Example 4: Backtest-Only Configuration

Historical data for backtesting:

apiVersion: datafye.io/v1
kind: DataSpec
metadata:
  name: backtest-data
  description: "Historical data for backtesting"
mode: backtest

datasets:
  - dataset: SIP
    provider: Polygon
    symbols:
      universes: ["SP500"]
    reference: true
    live:
      ticks: none  # No live subscription for backtest
      aggregates: none
    history:
      ticks:
        - { schema: trades, duration: 365d }  # 1 year of trades
      aggregates:
        - { schema: ohlc-1m, duration: 365d }
        - { schema: ohlc-1d, duration: 1825d }  # 5 years of daily bars
      reference:
        duration: 1825d  # 5 years of reference data

Best Practices

Start Small, Scale Up

Begin with minimal data and expand as needed:

# Development
datasets:
  - dataset: SIP
    symbols:
      tickers: ["AAPL"]  # Single symbol
    live:
      ticks: "trades"
      aggregates: "ohlc-1m"
    history:
      ticks:
        - { schema: trades, duration: 7d }  # Short retention
      aggregates:
        - { schema: ohlc-1m, duration: 30d }

# Production
datasets:
  - dataset: SIP
    symbols:
      universes: ["SP500"]  # Full universe
    live:
      ticks: "trades,quotes"
      aggregates: "ohlc-1m,ema-1m-20,vwap-1m"
    history:
      ticks:
        - { schema: trades, duration: 30d }
      aggregates:
        - { schema: ohlc-1m, duration: 180d }
        - { schema: ema-1m-20, duration: 180d }

Match Retention to Usage

Different schemas need different retention:

history:
  ticks:
    - { schema: trades, duration: 30d }     # Tick data = short retention
    - { schema: quotes, duration: 7d }      # Quotes even shorter
  aggregates:
    - { schema: ohlc-1m, duration: 180d }   # Minute bars = medium retention
    - { schema: ohlc-1d, duration: 365d }   # Daily bars = long retention
    - { schema: ema-1m-20, duration: 180d } # Indicators match their timeframe
  reference:
    duration: 365d                           # Reference data = long retention

Use Aggregates When Possible

Let the Data Cloud compute indicators:

# Better: Use pre-computed aggregates
live:
  aggregates: "ohlc-1m,ema-1m-20,vwap-1m"

# Avoid: Computing from ticks yourself (unless needed)
live:
  ticks: "trades"  # Then compute OHLC, EMA, VWAP in your algo

Why:

Lower latency (already computed)
Consistent calculations
Less compute in your algo
Tested and validated by Datafye

Leverage Universes

Use universes for dynamic symbol sets:

symbols:
  universes: ["SP500"]  # Automatically updates as constituents change

Benefits:

No manual symbol list maintenance
Surviv

orship-bias-free

Automatically handles index rebalancing

Separate Dev and Prod Specs

Maintain different specs for different environments:

dev-dataspec.yaml:

metadata:
  name: myalgo-dev
mode: paper
datasets:
  - dataset: SIP
    symbols:
      tickers: ["AAPL", "MSFT"]  # Limited symbols
    history:
      aggregates:
        - { schema: ohlc-1m, duration: 30d }  # Short retention

prod-dataspec.yaml:

metadata:
  name: myalgo-prod
mode: live
datasets:
  - dataset: SIP
    symbols:
      universes: ["SP500"]  # Full universe
    history:
      aggregates:
        - { schema: ohlc-1m, duration: 180d }  # Longer retention

Validation

The CLI validates your DataSpec before provisioning:

Schema Validation

Required fields present (apiVersion, kind, metadata.name, mode, datasets)
Valid enum values (mode, dataset names, schema IDs)
Correct data types (strings, booleans, arrays)
Valid duration formats

Semantic Validation

Dataset names are supported (SIP, PrecisionAlpha, TotalView)
Schema IDs exist for the specified dataset
History schemas match live schemas
Ticker formats are valid
Universe names are recognized

Logical Validation

Mode is appropriate for usage (can't use backtest mode with live subscriptions)
History retention is reasonable (not excessive)
Symbol wildcards are valid
No conflicting configurations

Common validation errors:

Unknown dataset name
Invalid schema ID for dataset
History schema not in live schemas
Invalid duration format
Invalid mode value
Missing required fields

Version Control

Store DataSpecs in git with your algo code:

my-algo/
├── src/
├── dataspec-dev.yaml
├── dataspec-prod.yaml
└── README.md

Best practices:

Use meaningful names (myalgo-dev.yaml, not data.yaml)
Include description in metadata
Document why specific schemas/retention are chosen
Tag specs with algo versions
Review diffs carefully (data changes affect algo behavior)

Cost Optimization

DataSpec choices affect costs:

Data Volume

More symbols = more data = higher cost
Use specific tickers over "*" when possible
Universe memberships are dynamic (SP500 ~= 500 symbols)

Retention Duration

Longer retention = more storage = higher cost
Match retention to actual backtest needs
Tick data is highest volume (use shorter retention)
Daily aggregates are lowest volume (can retain longer)

Live Subscriptions

Each live schema = real-time subscription cost
Subscribe only to schemas you actually use
Aggregates generally lower cost than computing from ticks

Cost optimization example:

# Higher cost
history:
  ticks:
    - { schema: trades, duration: 365d }   # 1 year of tick data = expensive
    - { schema: quotes, duration: 365d }

# Lower cost (same functionality)
history:
  ticks:
    - { schema: trades, duration: 30d }    # Short tick retention
  aggregates:
    - { schema: ohlc-1m, duration: 365d }  # Long aggregate retention (smaller)
    - { schema: ohlc-1d, duration: 1825d } # 5 years daily (very small)

Next Steps

See complete reference — Data Descriptor Reference
Learn about algos — Algo Descriptors
Understand brokers — Broker Descriptors
Start building — Foundry: Data Cloud Only

Last updated: 2025-10-11

PreviousData APIs and Datasets NextThe Algo Container

Last updated 3 months ago

hashtagPurpose

hashtagWhen You Need Data Descriptors

hashtagStructure

hashtagAPI Version and Kind

hashtagMetadata

hashtagMode

hashtagDatasets

hashtagBasic Dataset Structure

hashtagSymbols

hashtagTicker Lists

hashtagUniverses

hashtagAll Symbols

hashtagReference Data

hashtagLive Data

hashtagTicks

hashtagAggregates

hashtagHistory

hashtagTick History

hashtagAggregate History

hashtagReference History

hashtagDuration Format

hashtagAvailable Datasets

hashtagSIP (Securities Information Processor)

hashtagPrecisionAlpha

hashtagTotalView

hashtagComplete Examples

hashtagExample 1: Simple Equity Strategy

hashtagExample 2: Technical Analysis Strategy

hashtagExample 3: Multi-Dataset Strategy

hashtagExample 4: Backtest-Only Configuration

hashtagBest Practices

hashtagStart Small, Scale Up

hashtagMatch Retention to Usage

hashtagUse Aggregates When Possible

hashtagLeverage Universes

hashtagSeparate Dev and Prod Specs

hashtagValidation

hashtagSchema Validation

hashtagSemantic Validation

hashtagLogical Validation

hashtagVersion Control

hashtagCost Optimization

hashtagData Volume

hashtagRetention Duration

hashtagLive Subscriptions

hashtagNext Steps

Purpose

When You Need Data Descriptors

Structure

API Version and Kind

Metadata

Mode

Datasets

Basic Dataset Structure

Symbols

Ticker Lists

Universes

All Symbols

Reference Data

Live Data

Ticks

Aggregates

History

Tick History

Aggregate History

Reference History

Duration Format

Available Datasets

SIP (Securities Information Processor)

PrecisionAlpha

TotalView

Complete Examples

Example 1: Simple Equity Strategy

Example 2: Technical Analysis Strategy

Example 3: Multi-Dataset Strategy

Example 4: Backtest-Only Configuration

Best Practices

Start Small, Scale Up

Match Retention to Usage

Use Aggregates When Possible

Leverage Universes

Separate Dev and Prod Specs

Validation

Schema Validation

Semantic Validation

Logical Validation

Version Control

Cost Optimization

Data Volume

Retention Duration

Live Subscriptions

Next Steps