Data Descriptors

Data descriptors (DataSpecs) define what market data your Datafye environment needs. They specify which datasets, symbols, live data streams, and historical retention your algo requires — without dealing with infrastructure details.

Prerequisite: Before reading this page, make sure you understand what datasets are and how they're structured.

Purpose

A DataSpec is a declarative blueprint that tells the Datafye Data Cloud:

  • Which datasets you need (SIP, PrecisionAlpha, TotalView)

  • Which symbols within those datasets

  • Whether to include reference data for context

  • Which live tick streams to subscribe to

  • Which aggregate analytics to consume in real-time

  • How long to retain historical data per schema for replay, backtesting, or audits

Based on your DataSpec, Datafye provisions the Data Cloud with appropriate data sources, subscriptions, and storage — translating your functional requirements into running infrastructure.

When You Need Data Descriptors

Data descriptors are required for all Datafye scenarios:

  • Foundry: Data Cloud Only — Specifies what data to make available via APIs

  • Foundry: Full Stack — Defines data for backtesting and algo development

  • Trading: Data Cloud + Broker — Configures real-time feeds for paper/live trading

  • Trading: Full Stack — Provides data for both development and execution

Structure

A DataSpec is a YAML or JSON document with a Kubernetes-style structure:

API Version and Kind

These fields identify the descriptor format:

Fixed values:

  • apiVersion: datafye.io/v1 — Current DataSpec version

  • kind: DataSpec — Identifies this as a data descriptor

Metadata

Human-readable information for identification and traceability:

Fields:

  • name — Unique identifier for this spec (lowercase, hyphens)

  • description — Optional human-readable description

  • requestedBy.actorTypeuser or algo (who requested this data)

  • requestedBy.actorId — Email (for user) or algo ID

Mode

Specifies how the data will be used:

Values:

  • live — Real-time trading with live market data

  • paper — Paper trading with live data, simulated execution

  • backtest — Historical data only for backtesting

Mode affects:

  • Data latency requirements

  • Subscription behavior

  • Storage and retention policies

Datasets

Datasets are the core building blocks. Each dataset represents a packaged data product:

  • SIP — Securities Information Processor (US equities via Polygon or other providers)

  • TotalView — Level 2 market depth data

  • PrecisionAlpha — Alternative data and pre-computed signals

Each dataset configuration includes:

  1. Symbols — Which instruments to include

  2. Reference data — Static metadata (boolean flag)

  3. Live data — Real-time ticks and aggregates to subscribe to

  4. History — Retention periods for each schema

Basic Dataset Structure

Symbols

Define which instruments you want data for:

Ticker Lists

Explicit list of symbols:

Wildcard support:

  • "NV*" — Matches all tickers starting with NV (NVDA, NVAX, etc.)

  • "*" — Matches all symbols in the dataset

  • "BRK.B" — Special characters supported

Universes

Pre-defined symbol sets:

Available universes:

  • SP500 — S&P 500 constituents

  • NDX100 — NASDAQ-100 constituents

  • RUSSELL2000 — Russell 2000 constituents

Universe behavior:

  • Membership is dynamic (updates as constituents change)

  • Combines with explicit tickers (union)

  • Can specify multiple universes

All Symbols

To get all symbols in a dataset:

Or omit universes to use the dataset's full scope.

Reference Data

Boolean flag indicating whether to include static metadata:

Reference data includes:

  • Security master information

  • Corporate actions (splits, dividends)

  • Symbol metadata (name, type, exchange)

  • Calendar information (market hours, holidays)

When to enable:

  • Almost always true for trading algos (need corporate action adjustments)

  • Can be false for pure signal algos that don't need context

  • Historical reference data retained per history.reference.duration

Live Data

Defines which real-time data streams to subscribe to:

Ticks

Raw market data streams:

Format: Comma-separated list of schema IDs

SIP tick schemas:

  • trades — Trade executions (price, size, timestamp)

  • quotes — Top-of-book bid/ask quotes

TotalView tick schemas:

  • trades — Trade executions

  • quotes-1 — Top-of-book quotes

  • quotes-2 — Level 2 market depth (10 levels)

Crypto tick schemas:

  • trades — Trade executions

  • quotes-1 — Top-of-book quotes

  • quotes-10 — 10-level order book depth

PrecisionAlpha tick schemas:

  • pa-1s — 1-second PrecisionAlpha signals

  • pa-1m — 1-minute PrecisionAlpha signals

  • pa-1d — Daily PrecisionAlpha signals

Reserved words:

  • none — No tick data

  • all or * — All available tick schemas for this dataset

Aggregates

Pre-computed analytics delivered in real-time:

Format: Comma-separated list of schema IDs

OHLC (Bar) schemas:

  • ohlc-1s — 1-second bars

  • ohlc-1m — 1-minute bars

  • ohlc-1h — 1-hour bars

  • ohlc-1d — Daily bars

Technical indicator schemas:

  • ema-1m-20 — 20-period EMA on 1-minute bars

  • ema-1m-50 — 50-period EMA on 1-minute bars

  • sma-1m-20 — 20-period SMA on 1-minute bars

  • vwap-1m — VWAP on 1-minute bars

Signal schemas:

  • signal-1m-12-26-9 — MACD signal (12, 26, 9 periods on 1-minute bars)

  • signal-1d-12-26-9 — MACD signal on daily bars

Reserved words:

  • none — No aggregates

Why aggregates matter:

  • Pre-computed by Data Cloud (no need to calculate yourself)

  • Delivered in real-time as market data updates

  • Consistent calculations across all algos

  • Lower latency than computing yourself

History

Defines retention periods for historical data:

Tick History

Retention for tick-level data:

Each entry specifies:

  • schema — Tick schema ID (must match a schema from live.ticks)

  • duration — How long to retain this data

Aggregate History

Retention for pre-computed analytics:

Each entry specifies:

  • schema — Aggregate schema ID (must match a schema from live.aggregates)

  • duration — How long to retain this data

Reference History

Retention for reference data:

Use cases:

  • Backtesting with historical corporate actions

  • Replaying with accurate symbol metadata

  • Auditing historical reference data changes

Duration Format

Simple format:

  • 7d — 7 days

  • 30d — 30 days

  • 6m — 6 months

  • 1y — 1 year

  • 0d — No retention (no historical data)

ISO-8601 format:

  • P90D — 90 days

  • P6M — 6 months

  • P1Y — 1 year

Considerations:

  • Longer retention = more storage cost

  • Different schemas can have different retention

  • Tick data typically shorter retention (volume)

  • Aggregates typically longer retention (smaller)

  • Balance backtest needs vs. cost

Available Datasets

SIP (Securities Information Processor)

US equities market data via SIP feeds:

Characteristics:

  • US equities only

  • Trade and quote data

  • Multiple aggregate options

  • Provider: Currently Polygon (more providers coming)

Tick schemas:

  • trades — Every trade execution

  • quotes — Top-of-book bid/ask updates

Aggregate schemas:

  • OHLC bars: ohlc-1s, ohlc-1m, ohlc-1h, ohlc-1d

  • Technical indicators: ema-1m-20, ema-1m-50, sma-1m-20, vwap-1m

  • Signals: signal-1m-12-26-9

PrecisionAlpha

Alternative data and pre-computed signals:

Characteristics:

  • Proprietary alternative data

  • Pre-computed signals

  • Multiple timeframes

  • Typically used alongside SIP data

Tick schemas:

  • pa-1s — 1-second PrecisionAlpha updates

  • pa-1m — 1-minute PrecisionAlpha updates

  • pa-1d — Daily PrecisionAlpha updates

Aggregate schemas:

  • signal-1m-12-26-9 — MACD-style signal line

  • Additional signal schemas available

TotalView

Level 2 market depth data (coming soon):

Characteristics:

  • Full market depth (10 levels)

  • Trade and multi-level quote data

  • Higher latency requirements

  • More expensive than SIP

Tick schemas:

  • trades — Trade executions

  • quotes-1 — Top-of-book (level 1)

  • quotes-2 — 10-level market depth (level 2)

Complete Examples

Example 1: Simple Equity Strategy

Minimal spec for basic stock trading:

Example 2: Technical Analysis Strategy

Using pre-computed indicators:

Example 3: Multi-Dataset Strategy

Combining SIP and PrecisionAlpha:

Example 4: Backtest-Only Configuration

Historical data for backtesting:

Best Practices

Start Small, Scale Up

Begin with minimal data and expand as needed:

Match Retention to Usage

Different schemas need different retention:

Use Aggregates When Possible

Let the Data Cloud compute indicators:

Why:

  • Lower latency (already computed)

  • Consistent calculations

  • Less compute in your algo

  • Tested and validated by Datafye

Leverage Universes

Use universes for dynamic symbol sets:

Benefits:

  • No manual symbol list maintenance

  • Surviv

orship-bias-free

  • Automatically handles index rebalancing

Separate Dev and Prod Specs

Maintain different specs for different environments:

dev-dataspec.yaml:

prod-dataspec.yaml:

Validation

The CLI validates your DataSpec before provisioning:

Schema Validation

  • Required fields present (apiVersion, kind, metadata.name, mode, datasets)

  • Valid enum values (mode, dataset names, schema IDs)

  • Correct data types (strings, booleans, arrays)

  • Valid duration formats

Semantic Validation

  • Dataset names are supported (SIP, PrecisionAlpha, TotalView)

  • Schema IDs exist for the specified dataset

  • History schemas match live schemas

  • Ticker formats are valid

  • Universe names are recognized

Logical Validation

  • Mode is appropriate for usage (can't use backtest mode with live subscriptions)

  • History retention is reasonable (not excessive)

  • Symbol wildcards are valid

  • No conflicting configurations

Common validation errors:

  • Unknown dataset name

  • Invalid schema ID for dataset

  • History schema not in live schemas

  • Invalid duration format

  • Invalid mode value

  • Missing required fields

Version Control

Store DataSpecs in git with your algo code:

Best practices:

  • Use meaningful names (myalgo-dev.yaml, not data.yaml)

  • Include description in metadata

  • Document why specific schemas/retention are chosen

  • Tag specs with algo versions

  • Review diffs carefully (data changes affect algo behavior)

Cost Optimization

DataSpec choices affect costs:

Data Volume

  • More symbols = more data = higher cost

  • Use specific tickers over "*" when possible

  • Universe memberships are dynamic (SP500 ~= 500 symbols)

Retention Duration

  • Longer retention = more storage = higher cost

  • Match retention to actual backtest needs

  • Tick data is highest volume (use shorter retention)

  • Daily aggregates are lowest volume (can retain longer)

Live Subscriptions

  • Each live schema = real-time subscription cost

  • Subscribe only to schemas you actually use

  • Aggregates generally lower cost than computing from ticks

Cost optimization example:

Next Steps


Last updated: 2025-10-11

Last updated