Datasets
Datasets are the foundational building blocks of a Datafye Data Cloud. Understanding what datasets are and how they're structured is essential for configuring your deployment and accessing market data.
What is a Dataset?
A dataset is a collection of market data that provides a specific view of the financial markets. Each dataset represents a particular data product, such as:
SIP (Securities Information Processor) - Consolidated US equities data
Nasdaq TotalView - Level 2 market depth data with full order book
PrecisionAlpha - Alternative data and proprietary signals
Alpaca - Market data optimized for retail algorithmic trading
Dataset Structure
Every dataset in Datafye contains four types of services that work together to provide comprehensive market data access:
1. Reference
The reference service provides security master data and static metadata:
Symbol definitions and identifiers
Corporate actions (splits, dividends)
Trading venue information
Instrument classifications
This data changes infrequently and provides the foundation for understanding what instruments are available.
2. Live Ticks
The live ticks service delivers real-time tick-level market data:
Trades - Every executed transaction with price, size, and timestamp
Quotes - Bid and ask prices with depth information
Level 1 - Top of book quotes (best bid/ask)
Level 2 - Full order book depth (for datasets that support it)
This is the raw, granular market data stream as events occur.
3. Live Aggregates
The live aggregates service provides real-time pre-computed analytics:
OHLC Bars - Open, High, Low, Close bars at various intervals (1s, 1m, 1h, 1d)
Technical Indicators - Moving averages, RSI, MACD, etc.
Volume Profiles - Trading volume analysis across price levels
Market Statistics - Real-time market breadth, advances/declines, etc.
These aggregates are computed in real-time from the tick data, saving you the overhead of computing them yourself.
4. Historical
The historical service stores and retrieves historical market data:
Historical Ticks - Complete tick history for backtesting and analysis
Historical Aggregates - Pre-computed historical bars and indicators
Reference Snapshots - Historical security master data for point-in-time analysis
Historical data is essential for backtesting strategies, validating ideas, and understanding long-term market patterns.
Asset Classes
Each dataset belongs to a specific asset class that defines the type of financial instrument it covers:
stocks - Equity securities (US equities, international equities)
crypto - Cryptocurrencies and digital assets
options - Listed options contracts
futures - Futures contracts
forex - Foreign exchange pairs
The asset class determines:
The structure of the data (schemas)
Available symbols and universes
Trading hours and market sessions
Supported data providers
Providers
A single dataset can be sourced from multiple providers. For example, SIP data is available from:
Polygon.io - Real-time and historical US equities data
Alpaca - Commission-free trading with integrated market data
IEX Cloud - Exchange data with transparent pricing
Each provider may offer different features:
Latency - How quickly data arrives after market events
Coverage - Which symbols and exchanges are included
History Depth - How far back historical data extends
Cost - Pricing model and subscription tiers
One Provider Per Dataset
While a dataset may have multiple provider options, each deployment can only use one provider per dataset. When you provision your Datafye environment, you specify which provider to use for each dataset in your Data Descriptor.
For example, you might choose:
If you later want to switch from Polygon to Alpaca for SIP data, you update your Data Descriptor and reprovision.
Non-Overlapping Datasets
A Datafye deployment consists of a non-overlapping set of datasets. This means:
Each dataset serves a distinct purpose
No two datasets in your deployment provide the same data
You can deploy multiple datasets that complement each other
Example: Complementary Datasets
In this deployment:
SIP provides standard market data (trades, quotes, OHLC)
PrecisionAlpha provides proprietary signals and alternative data
These datasets complement each other without overlap
Why Non-Overlapping?
This design ensures:
Clear Data Lineage - You always know which dataset a piece of data came from
No Conflicts - Different providers can't give conflicting data for the same instrument
Simplified Management - Each dataset is managed independently with its own configuration
Symbol Universes
Within each dataset, you specify which symbols you want to access. Datafye supports two ways to define symbol coverage:
Explicit Tickers
List specific symbols you need:
This is useful when you have a focused strategy that trades a small set of instruments.
Symbol Universes
Reference predefined symbol groups:
Universes automatically include all symbols that belong to that group. As symbols are added or removed from the universe (e.g., index rebalancing), your deployment is updated automatically.
How Datasets Relate to Your Deployment
When you provision a Datafye environment, you specify your dataset requirements in a Data Descriptor. Based on this descriptor:
Datafye provisions the Data Cloud with the appropriate datasets
Services are deployed for each dataset (Reference, Live Ticks, Live Aggregates, Historical)
Data connections are established to the specified providers using your credentials
APIs become available for accessing the data via REST and WebSocket
The REST and WebSocket APIs are organized by asset class (stocks, crypto, etc.) and accept a dataset parameter to route requests to the appropriate dataset service. See Datasets and APIs for details on how this routing works.
Dataset Lifecycle
Provisioning
When you first provision your deployment with a Data Descriptor:
Dataset services are created
Provider connections are established
Historical data backfill begins (if specified)
Live data subscriptions start
Updates
You can update your dataset configuration by modifying your Data Descriptor:
Add or remove symbols
Change history retention periods
Enable or disable specific data streams
Changes are applied during the next provisioning operation.
Deprovisioning
When you deprovision a deployment:
Live data subscriptions stop
Services are shut down
Historical data may be retained or deleted based on your configuration
Best Practices
Start with One Dataset
If you're new to Datafye, start with a single dataset to understand how it works:
Once comfortable, expand to more symbols and additional datasets.
Match Dataset to Strategy Requirements
Choose datasets based on what your strategy needs:
High-frequency strategies → TotalView (full order book depth)
Daily/swing trading → SIP (consolidated Level 1 data)
Alternative signals → PrecisionAlpha (proprietary indicators)
Consider Cost vs. Value
More comprehensive datasets typically cost more. Evaluate:
Do you need Level 2 depth, or is Level 1 sufficient?
Is real-time data required, or is 15-minute delayed acceptable?
How much historical data do you actually need?
Document Dataset Dependencies
Make it clear which datasets your algo requires:
This helps others understand your algo's data requirements.
Next Steps
Now that you understand datasets, learn about:
Data Access Modes - The two API mechanisms (REST/WebSocket vs SDK) and data delivery modes
Data APIs and Datasets - How REST/WebSocket APIs relate to datasets
Data Descriptors - How to configure datasets in your deployment
Last updated: 2025-10-14
Last updated

