Datasets

Datasets are the foundational building blocks of a Datafye Data Cloud. Understanding what datasets are and how they're structured is essential for configuring your deployment and accessing market data.

What is a Dataset?

A dataset is a collection of market data that provides a specific view of the financial markets. Each dataset represents a particular data product, such as:

SIP (Securities Information Processor) - Consolidated US equities data
Nasdaq TotalView - Level 2 market depth data with full order book
PrecisionAlpha - Alternative data and proprietary signals
Alpaca - Market data optimized for retail algorithmic trading

Dataset Structure

Every dataset in Datafye contains four types of services that work together to provide comprehensive market data access:

1. Reference

The reference service provides security master data and static metadata:

Symbol definitions and identifiers
Corporate actions (splits, dividends)
Trading venue information
Instrument classifications

This data changes infrequently and provides the foundation for understanding what instruments are available.

2. Live Ticks

The live ticks service delivers real-time tick-level market data:

Trades - Every executed transaction with price, size, and timestamp
Quotes - Bid and ask prices with depth information
Level 1 - Top of book quotes (best bid/ask)
Level 2 - Full order book depth (for datasets that support it)

This is the raw, granular market data stream as events occur.

3. Live Aggregates

The live aggregates service provides real-time pre-computed analytics:

OHLC Bars - Open, High, Low, Close bars at various intervals (1s, 1m, 1h, 1d)
Technical Indicators - Moving averages, RSI, MACD, etc.
Volume Profiles - Trading volume analysis across price levels
Market Statistics - Real-time market breadth, advances/declines, etc.

These aggregates are computed in real-time from the tick data, saving you the overhead of computing them yourself.

4. Historical

The historical service stores and retrieves historical market data:

Historical Ticks - Complete tick history for backtesting and analysis
Historical Aggregates - Pre-computed historical bars and indicators
Reference Snapshots - Historical security master data for point-in-time analysis

Historical data is essential for backtesting strategies, validating ideas, and understanding long-term market patterns.

Asset Classes

Each dataset belongs to a specific asset class that defines the type of financial instrument it covers:

stocks - Equity securities (US equities, international equities)
crypto - Cryptocurrencies and digital assets
options - Listed options contracts
futures - Futures contracts
forex - Foreign exchange pairs

The asset class determines:

The structure of the data (schemas)
Available symbols and universes
Trading hours and market sessions
Supported data providers

Providers

A single dataset can be sourced from multiple providers. For example, SIP data is available from:

Polygon.io - Real-time and historical US equities data
Alpaca - Commission-free trading with integrated market data
IEX Cloud - Exchange data with transparent pricing

Each provider may offer different features:

Latency - How quickly data arrives after market events
Coverage - Which symbols and exchanges are included
History Depth - How far back historical data extends
Cost - Pricing model and subscription tiers

One Provider Per Dataset

While a dataset may have multiple provider options, each deployment can only use one provider per dataset. When you provision your Datafye environment, you specify which provider to use for each dataset in your Data Descriptor.

For example, you might choose:

datasets:
  - dataset: SIP
    provider: Polygon     # Using Polygon for SIP data
    symbols:
      tickers: [AAPL, MSFT, GOOGL]

If you later want to switch from Polygon to Alpaca for SIP data, you update your Data Descriptor and reprovision.

Non-Overlapping Datasets

A Datafye deployment consists of a non-overlapping set of datasets. This means:

Each dataset serves a distinct purpose
No two datasets in your deployment provide the same data
You can deploy multiple datasets that complement each other

Example: Complementary Datasets

datasets:
  - dataset: SIP              # Primary market data
    provider: Polygon
    symbols:
      tickers: [AAPL, MSFT, GOOGL]

  - dataset: PrecisionAlpha   # Alternative data signals
    symbols:
      tickers: [AAPL, MSFT, GOOGL]

In this deployment:

SIP provides standard market data (trades, quotes, OHLC)
PrecisionAlpha provides proprietary signals and alternative data
These datasets complement each other without overlap

Why Non-Overlapping?

This design ensures:

Clear Data Lineage - You always know which dataset a piece of data came from
No Conflicts - Different providers can't give conflicting data for the same instrument
Simplified Management - Each dataset is managed independently with its own configuration

Symbol Universes

Within each dataset, you specify which symbols you want to access. Datafye supports two ways to define symbol coverage:

Explicit Tickers

List specific symbols you need:

symbols:
  tickers: [AAPL, MSFT, GOOGL, AMZN]

This is useful when you have a focused strategy that trades a small set of instruments.

Symbol Universes

Reference predefined symbol groups:

symbols:
  universes: [SP500, NASDAQ100]

Universes automatically include all symbols that belong to that group. As symbols are added or removed from the universe (e.g., index rebalancing), your deployment is updated automatically.

How Datasets Relate to Your Deployment

When you provision a Datafye environment, you specify your dataset requirements in a Data Descriptor. Based on this descriptor:

Datafye provisions the Data Cloud with the appropriate datasets
Services are deployed for each dataset (Reference, Live Ticks, Live Aggregates, Historical)
Data connections are established to the specified providers using your credentials
APIs become available for accessing the data via REST and WebSocket

The REST and WebSocket APIs are organized by asset class (stocks, crypto, etc.) and accept a dataset parameter to route requests to the appropriate dataset service. See Datasets and APIs for details on how this routing works.

Dataset Lifecycle

Provisioning

When you first provision your deployment with a Data Descriptor:

Dataset services are created
Provider connections are established
Historical data backfill begins (if specified)
Live data subscriptions start

Updates

You can update your dataset configuration by modifying your Data Descriptor:

Add or remove symbols
Change history retention periods
Enable or disable specific data streams

Changes are applied during the next provisioning operation.

Deprovisioning

When you deprovision a deployment:

Live data subscriptions stop
Services are shut down
Historical data may be retained or deleted based on your configuration

Best Practices

Start with One Dataset

If you're new to Datafye, start with a single dataset to understand how it works:

datasets:
  - dataset: SIP
    provider: Polygon
    symbols:
      tickers: [AAPL]  # Just one symbol
    reference: true
    live:
      ticks: [trades, quotes]

Once comfortable, expand to more symbols and additional datasets.

Match Dataset to Strategy Requirements

Choose datasets based on what your strategy needs:

High-frequency strategies → TotalView (full order book depth)
Daily/swing trading → SIP (consolidated Level 1 data)
Alternative signals → PrecisionAlpha (proprietary indicators)

Consider Cost vs. Value

More comprehensive datasets typically cost more. Evaluate:

Do you need Level 2 depth, or is Level 1 sufficient?
Is real-time data required, or is 15-minute delayed acceptable?
How much historical data do you actually need?

Document Dataset Dependencies

Make it clear which datasets your algo requires:

# In your algo documentation
required_datasets:
  - dataset: SIP
    provider: Polygon
    reason: Primary market data for US equities
    minimum_history: 1 year

  - dataset: PrecisionAlpha
    reason: Alternative data signals for entry/exit decisions

This helps others understand your algo's data requirements.

Next Steps

Now that you understand datasets, learn about:

Data Access Modes - The two API mechanisms (REST/WebSocket vs SDK) and data delivery modes
Data APIs and Datasets - How REST/WebSocket APIs relate to datasets
Data Descriptors - How to configure datasets in your deployment

Last updated: 2025-10-14

PreviousThe Data Cloud NextData Access Modes

Last updated 3 months ago

hashtagWhat is a Dataset?

hashtagDataset Structure

hashtag1. Reference

hashtag2. Live Ticks

hashtag3. Live Aggregates

hashtag4. Historical

hashtagAsset Classes

hashtagProviders

hashtagOne Provider Per Dataset

hashtagNon-Overlapping Datasets

hashtagExample: Complementary Datasets

hashtagWhy Non-Overlapping?

hashtagSymbol Universes

hashtagExplicit Tickers

hashtagSymbol Universes

hashtagHow Datasets Relate to Your Deployment

hashtagDataset Lifecycle

hashtagProvisioning

hashtagUpdates

hashtagDeprovisioning

hashtagBest Practices

hashtagStart with One Dataset

hashtagMatch Dataset to Strategy Requirements

hashtagConsider Cost vs. Value

hashtagDocument Dataset Dependencies

hashtagNext Steps

What is a Dataset?

Dataset Structure

1. Reference

2. Live Ticks

3. Live Aggregates

4. Historical

Asset Classes

Providers

One Provider Per Dataset

Non-Overlapping Datasets

Example: Complementary Datasets

Why Non-Overlapping?

Symbol Universes

Explicit Tickers

Symbol Universes

How Datasets Relate to Your Deployment

Dataset Lifecycle

Provisioning

Updates

Deprovisioning

Best Practices

Start with One Dataset

Match Dataset to Strategy Requirements

Consider Cost vs. Value

Document Dataset Dependencies

Next Steps