Data APIs and Datasets

Understanding how datasets relate to the REST and WebSocket APIs is key to effectively using Datafye in own-container scenarios. This page explains how the API structure (organized by asset class) works with your deployment structure (organized by datasets).

Prerequisites: Before reading this page, make sure you understand:

What datasets are and how they're structured
Data access modes - the two API mechanisms available (REST/WebSocket vs SDK)

This page focuses specifically on REST and WebSocket APIs used in own-container scenarios.

Note on API URLs: Examples on this page use api.rumi.local for brevity, but actual URLs vary by deployment model (localhost:8080, api.rumi.local, or <user>-<type>-<env>-api.datafye.io). See API Reference for complete details.

The Orthogonal Design

The Datafye REST and WebSocket APIs are organized by asset class (stocks, crypto, etc.), while your deployment contains specific datasets (SIP, TotalView, PrecisionAlpha, etc.). These two structures work together but are orthogonal to each other — meaning they're independent organizational systems that intersect through dataset routing.

API Structure (Asset Class Based)

APIs are organized by asset class and data category:

http://api.rumi.local/datafye-api/v1/<assetClass>/<category>/<path>

Examples:

/stocks/live/trades/lasttrade - Get last trade for stocks
/stocks/reference/securities - Get security master for stocks
/stocks/history/ohlcs - Get historical OHLC bars for stocks
/crypto/live/trades/lasttrade - Get last trade for crypto

This structure is consistent regardless of which datasets you've deployed. The API surface remains the same whether you have one dataset or ten.

Deployment Structure (Dataset Based)

Your deployment consists of one or more datasets, where each dataset contains four types of services (see Datasets for details):

Reference - Security master and static metadata
Live Ticks - Real-time tick-level market data
Live Aggregates - Real-time pre-computed analytics
Historical - Historical data storage and retrieval

For example, if you deploy SIP and TotalView datasets for stocks, you have:

SIP Dataset (stocks)
  ├── Reference service
  ├── Live Ticks service
  ├── Live Aggregates service
  └── Historical service

TotalView Dataset (stocks)
  ├── Reference service
  ├── Live Ticks service
  ├── Live Aggregates service
  └── Historical service

How They Work Together

The API uses the asset class to determine which endpoints are available, and the dataset parameter to route requests to the appropriate dataset service within that asset class.

API Endpoint: /stocks/live/trades/lasttrade
               ↓
         Asset Class: stocks
               ↓
      Dataset Parameter: SIP
               ↓
   Routes to: SIP's Live Ticks service

Dataset Routing

Most API endpoints accept a dataset parameter that bridges these two structures. The API routes your request to the appropriate service within the specified dataset.

Example: Multiple Datasets

Let's say your deployment has both SIP and Nasdaq TotalView datasets running:

# Query SIP dataset for last trade
curl "http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=AAPL&dataset=SIP"

# Query TotalView dataset for last trade
curl "http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=AAPL&dataset=TotalView"

The same endpoint (/stocks/live/trades/lasttrade) can serve data from different datasets. The API internally routes to:

SIP's live ticks service when dataset=SIP
TotalView's live ticks service when dataset=TotalView

Example: Different Categories

The routing applies across all API categories:

# Reference data from SIP
curl "http://api.rumi.local/datafye-api/v1/stocks/reference/securities?dataset=SIP"

# Historical data from TotalView
curl "http://api.rumi.local/datafye-api/v1/stocks/history/ohlcs?symbol=AAPL&from=2024-01-01T00:00:00Z&to=2024-01-31T23:59:59Z&interval=1d&dataset=TotalView"

# Live quotes from SIP
curl "http://api.rumi.local/datafye-api/v1/stocks/live/quotes/topofbook?symbols=AAPL,MSFT&dataset=SIP"

Why This Design?

This orthogonal design provides several benefits:

1. Consistent API Surface

The API endpoints remain the same regardless of which datasets you've deployed. Your code structure doesn't change when you:

Add new datasets to your deployment
Switch providers for a dataset (e.g., Polygon → Alpaca for SIP)
Deploy different dataset combinations for dev vs prod

2. Dataset Flexibility

You can easily switch between datasets without restructuring your code:

// Same function works with any dataset
async function getLastTrade(symbol, dataset) {
  const response = await fetch(
    `http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=${symbol}&dataset=${dataset}`
  );
  return response.json();
}

// Use with different datasets
const sipTrade = await getLastTrade('AAPL', 'SIP');
const totalViewTrade = await getLastTrade('AAPL', 'TotalView');

3. Multi-Dataset Support

Query different datasets for the same symbol to:

Compare data quality across providers
Implement failover logic (primary dataset → backup dataset)
Use specialized datasets for specific symbols (e.g., TotalView for high-frequency, SIP for others)

async function getLastTradeWithFallback(symbol) {
  try {
    // Try TotalView first (more granular data)
    return await getLastTrade(symbol, 'TotalView');
  } catch (error) {
    // Fall back to SIP
    return await getLastTrade(symbol, 'SIP');
  }
}

4. Provider Independence

The API abstracts away which provider serves a dataset. You specify:

In your DataSpec: SIP dataset provided by Polygon
In your API calls: dataset=SIP

Your code doesn't need to know that SIP comes from Polygon. If you later switch to Alpaca for SIP, your API calls remain unchanged.

Default Dataset Behavior

If you omit the dataset parameter, the API applies default logic:

Single Dataset Deployed

If only one dataset for that asset class is deployed, it's used automatically:

# Only SIP deployed for stocks
curl "http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=AAPL"
# Automatically uses SIP dataset

Multiple Datasets Deployed

If multiple datasets are deployed, the API returns an error asking you to specify which dataset:

# Both SIP and TotalView deployed for stocks
curl "http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=AAPL"
# Error: "Multiple datasets available for stocks. Please specify dataset parameter."

You must be explicit:

curl "http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=AAPL&dataset=SIP"

Best Practices

1. Always Specify Dataset in Multi-Dataset Deployments

Even if you primarily use one dataset, explicitly specify it:

// Good: Explicit
const trade = await fetch(
  'http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=AAPL&dataset=SIP'
);

// Avoid: Implicit (will break if you add another dataset)
const trade = await fetch(
  'http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=AAPL'
);

2. Parameterize Dataset Selection

Make dataset selection configurable:

class MarketDataClient {
  constructor(defaultDataset = 'SIP') {
    this.defaultDataset = defaultDataset;
  }

  async getLastTrade(symbol, dataset = this.defaultDataset) {
    const response = await fetch(
      `http://api.rumi.local/datafye-api/v1/stocks/live/trades/lasttrade?symbols=${symbol}&dataset=${dataset}`
    );
    return response.json();
  }
}

// Dev: Use SIP
const devClient = new MarketDataClient('SIP');

// Prod: Use TotalView
const prodClient = new MarketDataClient('TotalView');

3. Document Dataset Dependencies

Document which datasets your algo requires:

# algo-requirements.yaml
required_datasets:
  - name: SIP
    provider: Polygon
    reason: Primary market data source

  - name: PrecisionAlpha
    reason: Alternative data signals

4. Handle Dataset-Specific Schemas

Different datasets may have different schemas for the same data type:

async function getQuotes(symbol, dataset) {
  const response = await fetch(
    `http://api.rumi.local/datafye-api/v1/stocks/live/quotes/topofbook?symbols=${symbol}&dataset=${dataset}`
  );
  const data = await response.json();

  // Handle dataset-specific differences
  if (dataset === 'TotalView') {
    // TotalView includes level 2 depth
    return data.quotes[0]; // Has depth field
  } else {
    // SIP is level 1 only
    return data.quotes[0]; // No depth field
  }
}

Real-World Scenarios

Scenario 1: Development vs Production

Use different datasets for different environments:

const config = {
  dev: {
    dataset: 'SIP',
    provider: 'Polygon'
  },
  prod: {
    dataset: 'TotalView',
    provider: 'NASDAQ'
  }
};

const currentDataset = process.env.ENV === 'prod'
  ? config.prod.dataset
  : config.dev.dataset;

Scenario 2: Dataset Comparison

Compare data quality across datasets:

async function compareDatasets(symbol) {
  const [sipData, totalViewData] = await Promise.all([
    getLastTrade(symbol, 'SIP'),
    getLastTrade(symbol, 'TotalView')
  ]);

  console.log('SIP price:', sipData.price);
  console.log('TotalView price:', totalViewData.price);
  console.log('Difference:', Math.abs(sipData.price - totalViewData.price));
}

Scenario 3: Specialized Dataset Usage

Use different datasets for different symbols:

async function getOptimalLastTrade(symbol) {
  // Use TotalView for high-frequency symbols
  const highFreqSymbols = ['AAPL', 'MSFT', 'GOOGL'];
  const dataset = highFreqSymbols.includes(symbol) ? 'TotalView' : 'SIP';

  return await getLastTrade(symbol, dataset);
}

Datasets - Understanding what datasets are and how they're structured
Data Access Modes - The two API mechanisms and data delivery modes
Data Descriptors - How to configure datasets in your deployment
What is Datafye? - Overall platform architecture

Last updated: 2025-10-14

PreviousData Access Modes NextData Descriptors

Last updated 2 months ago

hashtagThe Orthogonal Design

hashtagAPI Structure (Asset Class Based)

hashtagDeployment Structure (Dataset Based)

hashtagHow They Work Together

hashtagDataset Routing

hashtagExample: Multiple Datasets

hashtagExample: Different Categories

hashtagWhy This Design?

hashtag1. Consistent API Surface

hashtag2. Dataset Flexibility

hashtag3. Multi-Dataset Support

hashtag4. Provider Independence

hashtagDefault Dataset Behavior

hashtagSingle Dataset Deployed

hashtagMultiple Datasets Deployed

hashtagBest Practices

hashtag1. Always Specify Dataset in Multi-Dataset Deployments

hashtag2. Parameterize Dataset Selection

hashtag3. Document Dataset Dependencies

hashtag4. Handle Dataset-Specific Schemas

hashtagReal-World Scenarios

hashtagScenario 1: Development vs Production

hashtagScenario 2: Dataset Comparison

hashtagScenario 3: Specialized Dataset Usage

hashtagRelated Concepts