Datasets

Datasets YAML reference

A Spicepod can contain one or more datasets referenced by relative path, or defined inline.

`datasets`

Inline example:

spicepod.yaml

datasets:
  - from: spice.ai/eth/beacon/eigenlayer
    name: strategy_manager_deposits
    params:
      app: goerli-app
    acceleration:
      enabled: true
      mode: inmemory # / file
      engine: arrow # / duckdb
      refresh_interval: 1h
      refresh_mode: full / append # update / incremental
      retention: 30m

spicepod.yaml

datasets:
  - from: databricks.com/spiceai/datasets
    name: uniswap_eth_usd
    params:
      environment: prod
    acceleration:
      enabled: true
      mode: inmemory # / file
      engine: arrow # / duckdb
      refresh_interval: 1h
      refresh_mode: full / append # update / incremental
      retention: 30m

spicepod.yaml

datasets:
  - from: local/Users/phillip/data/test.parquet
    name: test
    acceleration:
      enabled: true
      mode: inmemory # / file
      engine: arrow # / duckdb
      refresh_interval: 1h
      refresh_mode: full / append # update / incremental
      retention: 30m

Relative path example:

spicepod.yaml

datasets:
  - from: datasets/uniswap_v2_eth_usdc

datasets/uniswap_v2_eth_usdc/dataset.yaml

name: spiceai.uniswap_v2_eth_usdc
type: overwrite
source: spice.ai
auth: spice.ai
acceleration:
  enabled: true
  refresh: 1h

`name`

The name of the dataset. This is used to reference the dataset in the pod manifest, as well as in external data sources.

`type`

The type of dataset. The following types are supported:

overwrite - Overwrites the dataset with the contents of the dataset source.
append - Appends new data from dataset source to the dataset.

`source`

The source of the dataset. The following sources are supported:

spice.ai
dremio (coming soon)
databricks (coming soon)

`auth`

Optional. The authentication profile to use to connect to the dataset source. Use spice login to create a new authentication profile.

If not specified, the default profile for the data source is used.

`acceleration`

Optional. Accelerate queries to the dataset by caching data locally.

`acceleration.enabled`

Enable or disable acceleration, defaults to true.

`acceleration.engine`

The acceleration engine to use, defaults to arrow. The following engines are supported:

arrow - Accelerated in-memory backed by Apache Arrow DataTables.
duckdb - Accelerated by an embedded DuckDB database.
postgres - Accelerated by an embedded DuckDB database.

`acceleration.mode`

Optional. The mode of acceleration. The following values are supported:

memory - Store acceleration data in-memory.
file - Store acceleration data in a file.

mode is currently only supported for the duckdb engine.

`acceleration.refresh_mode`

Optional. How to refresh the dataset. The following values are supported:

full - Refresh the entire dataset.
append - Append new data to the dataset.

`acceleration.refresh_interval`

Optional. How often data should be refreshed. Only supported for full datasets. For append datasets, the refresh interval not used.

i.e. 1h for 1 hour, 1m for 1 minute, 1s for 1 second, etc.

`acceleration.retention`

Optional. Only supported for append datasets. Specifies how long to retain data updates from the data source before they are deleted.

If not specified, the default retention is to keep all data.

`acceleration.params`

Optional. Parameters to pass to the acceleration engine. The parameters are specific to the acceleration engine used.

`acceleration.engine_secret`

Optional. The secret store key to use the acceleration engine connection credential. For supported data connectors, use spice login to store the secret.