Datasets

Datasets are the foundational building blocks of a Datafye Data Cloud. Understanding what datasets are and how they're structured is essential for configuring your deployment and accessing market data.

What is a Dataset?

A dataset is a collection of market data that provides a specific view of the financial markets. Each dataset represents a particular data product, such as:

  • SIP (Securities Information Processor) - Consolidated US equities data

  • Nasdaq TotalView - Level 2 market depth data with full order book

  • PrecisionAlpha - Alternative data and proprietary signals

  • Alpaca - Market data optimized for retail algorithmic trading

Dataset Structure

Every dataset in Datafye contains four types of services that work together to provide comprehensive market data access:

1. Reference

The reference service provides security master data and static metadata:

  • Symbol definitions and identifiers

  • Corporate actions (splits, dividends)

  • Trading venue information

  • Instrument classifications

This data changes infrequently and provides the foundation for understanding what instruments are available.

2. Live Ticks

The live ticks service delivers real-time tick-level market data:

  • Trades - Every executed transaction with price, size, and timestamp

  • Quotes - Bid and ask prices with depth information

  • Level 1 - Top of book quotes (best bid/ask)

  • Level 2 - Full order book depth (for datasets that support it)

This is the raw, granular market data stream as events occur.

3. Live Aggregates

The live aggregates service provides real-time pre-computed analytics:

  • OHLC Bars - Open, High, Low, Close bars at various intervals (1s, 1m, 1h, 1d)

  • Technical Indicators - Moving averages, RSI, MACD, etc.

  • Volume Profiles - Trading volume analysis across price levels

  • Market Statistics - Real-time market breadth, advances/declines, etc.

These aggregates are computed in real-time from the tick data, saving you the overhead of computing them yourself.

4. Historical

The historical service stores and retrieves historical market data:

  • Historical Ticks - Complete tick history for backtesting and analysis

  • Historical Aggregates - Pre-computed historical bars and indicators

  • Reference Snapshots - Historical security master data for point-in-time analysis

Historical data is essential for backtesting strategies, validating ideas, and understanding long-term market patterns.

Asset Classes

Each dataset belongs to a specific asset class that defines the type of financial instrument it covers:

  • stocks - Equity securities (US equities, international equities)

  • crypto - Cryptocurrencies and digital assets

  • options - Listed options contracts

  • futures - Futures contracts

  • forex - Foreign exchange pairs

The asset class determines:

  • The structure of the data (schemas)

  • Available symbols and universes

  • Trading hours and market sessions

  • Supported data providers

Providers

A single dataset can be sourced from multiple providers. For example, SIP data is available from:

  • Polygon.io - Real-time and historical US equities data

  • Alpaca - Commission-free trading with integrated market data

  • IEX Cloud - Exchange data with transparent pricing

Each provider may offer different features:

  • Latency - How quickly data arrives after market events

  • Coverage - Which symbols and exchanges are included

  • History Depth - How far back historical data extends

  • Cost - Pricing model and subscription tiers

One Provider Per Dataset

While a dataset may have multiple provider options, each deployment can only use one provider per dataset. When you provision your Datafye environment, you specify which provider to use for each dataset in your Data Descriptor.

For example, you might choose:

If you later want to switch from Polygon to Alpaca for SIP data, you update your Data Descriptor and reprovision.

Non-Overlapping Datasets

A Datafye deployment consists of a non-overlapping set of datasets. This means:

  • Each dataset serves a distinct purpose

  • No two datasets in your deployment provide the same data

  • You can deploy multiple datasets that complement each other

Example: Complementary Datasets

In this deployment:

  • SIP provides standard market data (trades, quotes, OHLC)

  • PrecisionAlpha provides proprietary signals and alternative data

  • These datasets complement each other without overlap

Why Non-Overlapping?

This design ensures:

  • Clear Data Lineage - You always know which dataset a piece of data came from

  • No Conflicts - Different providers can't give conflicting data for the same instrument

  • Simplified Management - Each dataset is managed independently with its own configuration

Symbol Universes

Within each dataset, you specify which symbols you want to access. Datafye supports two ways to define symbol coverage:

Explicit Tickers

List specific symbols you need:

This is useful when you have a focused strategy that trades a small set of instruments.

Symbol Universes

Reference predefined symbol groups:

Universes automatically include all symbols that belong to that group. As symbols are added or removed from the universe (e.g., index rebalancing), your deployment is updated automatically.

How Datasets Relate to Your Deployment

When you provision a Datafye environment, you specify your dataset requirements in a Data Descriptor. Based on this descriptor:

  1. Datafye provisions the Data Cloud with the appropriate datasets

  2. Services are deployed for each dataset (Reference, Live Ticks, Live Aggregates, Historical)

  3. Data connections are established to the specified providers using your credentials

  4. APIs become available for accessing the data via REST and WebSocket

The REST and WebSocket APIs are organized by asset class (stocks, crypto, etc.) and accept a dataset parameter to route requests to the appropriate dataset service. See Datasets and APIs for details on how this routing works.

Dataset Lifecycle

Provisioning

When you first provision your deployment with a Data Descriptor:

  • Dataset services are created

  • Provider connections are established

  • Historical data backfill begins (if specified)

  • Live data subscriptions start

Updates

You can update your dataset configuration by modifying your Data Descriptor:

  • Add or remove symbols

  • Change history retention periods

  • Enable or disable specific data streams

Changes are applied during the next provisioning operation.

Deprovisioning

When you deprovision a deployment:

  • Live data subscriptions stop

  • Services are shut down

  • Historical data may be retained or deleted based on your configuration

Best Practices

Start with One Dataset

If you're new to Datafye, start with a single dataset to understand how it works:

Once comfortable, expand to more symbols and additional datasets.

Match Dataset to Strategy Requirements

Choose datasets based on what your strategy needs:

  • High-frequency strategies → TotalView (full order book depth)

  • Daily/swing trading → SIP (consolidated Level 1 data)

  • Alternative signals → PrecisionAlpha (proprietary indicators)

Consider Cost vs. Value

More comprehensive datasets typically cost more. Evaluate:

  • Do you need Level 2 depth, or is Level 1 sufficient?

  • Is real-time data required, or is 15-minute delayed acceptable?

  • How much historical data do you actually need?

Document Dataset Dependencies

Make it clear which datasets your algo requires:

This helps others understand your algo's data requirements.

Next Steps

Now that you understand datasets, learn about:


Last updated: 2025-10-14

Last updated