Data APIs and Datasets

Understanding how datasets relate to the REST and WebSocket APIs is key to effectively using Datafye in own-container scenarios. This page explains how the API structure (organized by asset class) works with your deployment structure (organized by datasets).

Prerequisites: Before reading this page, make sure you understand:

This page focuses specifically on REST and WebSocket APIs used in own-container scenarios.

Note on API URLs: Examples on this page use api.rumi.local for brevity, but actual URLs vary by deployment model (localhost:8080, api.rumi.local, or <user>-<type>-<env>-api.datafye.io). See API Reference for complete details.

The Orthogonal Design

The Datafye REST and WebSocket APIs are organized by asset class (stocks, crypto, etc.), while your deployment contains specific datasets (SIP, TotalView, PrecisionAlpha, etc.). These two structures work together but are orthogonal to each other — meaning they're independent organizational systems that intersect through dataset routing.

API Structure (Asset Class Based)

APIs are organized by asset class and data category:

http://api.rumi.local/datafye-api/v1/<assetClass>/<category>/<path>

Examples:

  • /stocks/live/trades/lasttrade - Get last trade for stocks

  • /stocks/reference/securities - Get security master for stocks

  • /stocks/history/ohlcs - Get historical OHLC bars for stocks

  • /crypto/live/trades/lasttrade - Get last trade for crypto

This structure is consistent regardless of which datasets you've deployed. The API surface remains the same whether you have one dataset or ten.

Deployment Structure (Dataset Based)

Your deployment consists of one or more datasets, where each dataset contains four types of services (see Datasets for details):

  • Reference - Security master and static metadata

  • Live Ticks - Real-time tick-level market data

  • Live Aggregates - Real-time pre-computed analytics

  • Historical - Historical data storage and retrieval

For example, if you deploy SIP and TotalView datasets for stocks, you have:

How They Work Together

The API uses the asset class to determine which endpoints are available, and the dataset parameter to route requests to the appropriate dataset service within that asset class.

Dataset Routing

Most API endpoints accept a dataset parameter that bridges these two structures. The API routes your request to the appropriate service within the specified dataset.

Example: Multiple Datasets

Let's say your deployment has both SIP and Nasdaq TotalView datasets running:

The same endpoint (/stocks/live/trades/lasttrade) can serve data from different datasets. The API internally routes to:

  • SIP's live ticks service when dataset=SIP

  • TotalView's live ticks service when dataset=TotalView

Example: Different Categories

The routing applies across all API categories:

Why This Design?

This orthogonal design provides several benefits:

1. Consistent API Surface

The API endpoints remain the same regardless of which datasets you've deployed. Your code structure doesn't change when you:

  • Add new datasets to your deployment

  • Switch providers for a dataset (e.g., Polygon → Alpaca for SIP)

  • Deploy different dataset combinations for dev vs prod

2. Dataset Flexibility

You can easily switch between datasets without restructuring your code:

3. Multi-Dataset Support

Query different datasets for the same symbol to:

  • Compare data quality across providers

  • Implement failover logic (primary dataset → backup dataset)

  • Use specialized datasets for specific symbols (e.g., TotalView for high-frequency, SIP for others)

4. Provider Independence

The API abstracts away which provider serves a dataset. You specify:

  • In your DataSpec: SIP dataset provided by Polygon

  • In your API calls: dataset=SIP

Your code doesn't need to know that SIP comes from Polygon. If you later switch to Alpaca for SIP, your API calls remain unchanged.

Default Dataset Behavior

If you omit the dataset parameter, the API applies default logic:

Single Dataset Deployed

If only one dataset for that asset class is deployed, it's used automatically:

Multiple Datasets Deployed

If multiple datasets are deployed, the API returns an error asking you to specify which dataset:

You must be explicit:

Best Practices

1. Always Specify Dataset in Multi-Dataset Deployments

Even if you primarily use one dataset, explicitly specify it:

2. Parameterize Dataset Selection

Make dataset selection configurable:

3. Document Dataset Dependencies

Document which datasets your algo requires:

4. Handle Dataset-Specific Schemas

Different datasets may have different schemas for the same data type:

Real-World Scenarios

Scenario 1: Development vs Production

Use different datasets for different environments:

Scenario 2: Dataset Comparison

Compare data quality across datasets:

Scenario 3: Specialized Dataset Usage

Use different datasets for different symbols:


Last updated: 2025-10-14

Last updated