> ## Documentation Index > Fetch the complete documentation index at: https://docs.allium.so/llms.txt > Use this file to discover all available pages before exploring further. # Data Integration Guide > Build a reliable sync pipeline using Allium's delivery metadata tables When ingesting Allium blockchain data into your own warehouse (e.g., as a dbt source), a key challenge is knowing **what data was updated** and **when**. Blockchain data is not strictly append-only — late-arriving records, reorgs, and backfills mean that records with older `block_timestamp` values can be delivered after newer ones. Allium provides **delivery metadata tables** that give you full transparency into every data update, enabling you to build a reliable and cost-efficient sync pipeline. ## Which table answers which question? | Your question | Table | Indexed by | | -------------------------------------------------------------------------------------------------- | -------------------------------------------- | --------------------- | | "What just arrived?" — what data intervals did Allium deliver in a given wall-clock window | `delivery_metadata.snapshots.aggregated` | delivery time | | "Which of *my* partitions are now stale?" — which hourly partitions were touched/patched, and when | `delivery_metadata.intervals.changes` | data partition (hour) | | Up to what point is this table verified complete and delivered? | `delivery_metadata.intervals.verified_until` | — | The first two are **sync** tables that drive your ingest pipeline. They are two views of the *same* delivery metadata, just indexed differently: * `snapshots.aggregated` is indexed by **delivery time** (each row is one batch and the exact `block_timestamp` ranges it carried) — best for pulling new data at the **tip of chain**. * `intervals.changes` is indexed by **data partition** (each row is one hour and when it was last touched) — best for **full-history partition taint detection** (a backfill or patch anywhere in the past). See [Recommended Integration Pattern](#recommended-integration-pattern) for how to combine them. The third table, `verified_until`, is a **quality gate** — it answers *how far you can trust the data* (see [Gating on Verification](#gating-on-verification-delivery_metadata-intervals-verified_until)). `snapshots.aggregated` is the superset: because it carries every delivery's exact load ranges, you *can* run your entire pipeline off it alone, including backfill detection. `intervals.changes` exists as a convenience layer — it pre-computes the per-partition "last touched" rollup so you don't have to derive it yourself (see [Single-table fallback](#single-table-fallback)). ## Key Concepts | Term | Definition | | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `block_timestamp` | The on-chain timestamp of a record — when the event happened on the blockchain. | | **Snapshot** | A versioned point-in-time view of a table. Each snapshot may contain new records, backfilled records, or both. Snapshots are created regularly, typically every hour or less. | | **Interval** | An hourly slice of `block_timestamp` (e.g., all records where `block_timestamp` falls within `2026-01-12 05:00`(inclusive) to `2026-01-12 06:00`(exclusive)). | `block_timestamp` does not always increase monotonically with delivery time. A record delivered today could have a `block_timestamp` from last week (due to a backfill or patch). The metadata tables make this visible so your pipeline can handle it correctly. ## Metadata Tables ### `delivery_metadata.snapshots.aggregated` Shows the latest state of each snapshot delivered, including what data intervals were loaded in each snapshot. This is the primary table for **steady-state syncing**. | Column | Type | Description | | --------------------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------ | | `chain` | `VARCHAR` | Blockchain name (e.g., `ethereum`, `base`) | | `table_name` | `VARCHAR` | Table name (e.g., `raw.logs`, `raw.transactions`) | | `watermark_column` | `VARCHAR` | The timestamp column used for watermarking (typically `block_timestamp`) | | `watermark_level` | `TIMESTAMP_NTZ` | The high-water mark of the table at this snapshot | | `delivery_interval` | `VARCHAR` | The delivery cadence for this table | | `snapshot_id` | `VARCHAR` | Unique snapshot version identifier | | `is_full_refresh` | `BOOLEAN` | Whether the dataset was fully refreshed during this snapshot | | `snapshot_loaded_intervals__count` | `NUMBER` | Number of data loads (delivery events) assigned to this snapshot | | `snapshot_loaded_intervals__minutes` | `FLOAT` | Total minutes of `block_timestamp` coverage loaded | | `snapshot_loaded_intervals__min_timestamp` | `TIMESTAMP_NTZ` | Earliest `block_timestamp` of data loaded in this snapshot | | `snapshot_loaded_intervals__max_timestamp` | `TIMESTAMP_NTZ` | Latest `block_timestamp` of data loaded in this snapshot | | `snapshot_loaded_intervals__merged_intervals` | `ARRAY` | Consolidated time ranges of loaded data (array of `{start, end, minutes}` objects) | | `snapshot_loaded_intervals__loads` | `ARRAY` | Individual data loads within the snapshot (each with `load_id`, `trigger_time`, `interval_start`, `interval_end`, `is_full_refresh`) | | `snapshot_created_at` | `TIMESTAMP_NTZ` | When this snapshot was created and became available to you | | `created_at` | `TIMESTAMP_NTZ` | When this metadata record was created | Each row represents a single snapshot. The `snapshot_loaded_intervals__min_timestamp` and `snapshot_loaded_intervals__max_timestamp` columns tell you the overall time range of data that was loaded. Note that this can be a sparse interval — the `snapshot_loaded_intervals__loads` and `snapshot_loaded_intervals__merged_intervals` fields provide the exact sub-ranges if you need finer granularity. ### `delivery_metadata.intervals.changes` Shows, for each hourly partition of data, when it was last updated. This is the primary table for **backfill detection**. | Column | Type | Description | | ----------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `chain` | `VARCHAR` | Blockchain name | | `table_name` | `VARCHAR` | Table name | | `hour` | `TIMESTAMP_NTZ` | The hourly partition (`date_trunc('hour', block_timestamp)`). One row = one hour of data where `block_timestamp >= hour AND block_timestamp < hour + 1 hour`. | | `last_updated_at` | `TIMESTAMP_NTZ` | The most recent time this hour's data was modified | | `updated_timestamps` | `ARRAY` | List of timestamps when this hour's data was updated | | `last_full_refresh_timestamp` | `TIMESTAMP_NTZ` | The last time this table was fully refreshed | | `max_hours_late` | `FLOAT` | Maximum hours between `hour` and when the data was actually delivered — useful for measuring backfill lag | | `_created_at` | `TIMESTAMP_NTZ` | Time the metadata record was created — bookkeeping only, don't use for change detection | | `_updated_at` | `TIMESTAMP_NTZ` | When this metadata record was last updated | For example, if `base.raw.logs` has `hour = 2025-11-06 02:00` with `last_updated_at = 2026-02-20`, that means the hourly partition for Nov 6 2AM was last updated (e.g., patched/backfilled) on Feb 20. ## Recommended Integration Pattern For most pipelines we recommend **two jobs** — not because you need two tables, but because the two access patterns scan very different time ranges: the steady-state job only looks at the **tip of chain** (recent hours, via `snapshots.aggregated`), while the backfill job scans **all of history** to catch partition taints anywhere in the past (via `intervals.changes`): 1. A **steady-state job** (high cadence) — handles new data at the tip of chain 2. A **backfill job** (low cadence) — catches late-arriving historical patches ```mermaid theme={null} flowchart TD A["Allium Data Delivery"] --> B["Delivery Metadata Tables
(same events, indexed two ways)"] subgraph B["Delivery Metadata Tables"] B1["snapshots.aggregated
indexed by delivery
What batches arrived, and
what intervals did they carry?"] B2["intervals.changes
indexed by data partition
Which hourly partitions
were touched, and when?"] end B1 --> C1 B2 --> C2 subgraph C["Your Pipeline"] C1["Job 1: Steady-state (hourly)
Query snapshots.aggregated for new data
Pull block_timestamp range from source
Cap lookback to e.g. 24–36 hours"] C2["Job 2: Backfill (daily/weekly)
Query intervals.changes for late patches
Re-sync affected hourly partitions
Check for full refreshes"] end ``` ### Job 1: Steady-State Sync Runs every hour (or at whatever cadence your pipeline operates). Picks up newly delivered data snapshots. Query `delivery_metadata.snapshots.aggregated` to find snapshots created since your last run: ```sql theme={null} SELECT snapshot_id, snapshot_created_at, -- floor the pull range to a 36h lookback so steady-state scans stay bounded; -- older changes are caught by the backfill job (Job 2) GREATEST( snapshot_loaded_intervals__min_timestamp, CURRENT_TIMESTAMP - INTERVAL '36 hours' ) AS pull_from_block_timestamp, snapshot_loaded_intervals__max_timestamp, snapshot_loaded_intervals__loads, is_full_refresh FROM delivery_metadata.snapshots.aggregated WHERE chain = 'base' AND table_name = 'base.raw.logs' AND snapshot_created_at >= - INTERVAL '1 hour' ``` Then use `pull_from_block_timestamp` / `snapshot_loaded_intervals__max_timestamp` to determine the `block_timestamp` range to pull from the source table. The `GREATEST(..., CURRENT_TIMESTAMP - INTERVAL '36 hours')` floor keeps scan costs predictable — any changes older than the lookback window are caught by the backfill job. Tune the window (24–36 hours is typical) to your delivery latency. On a **full refresh**, the `snapshot_loaded_intervals__*` columns (including `min_timestamp` / `max_timestamp`) are **NULL** while `is_full_refresh = true`. Read that combination as "re-pull the whole table," not "nothing changed." ### Job 2: Backfill Sync Runs daily or weekly. Catches any data that was patched or backfilled for historical time periods — records too old for the steady-state lookback window to catch. Query `delivery_metadata.intervals.changes` to find hourly partitions that were recently updated: ```sql theme={null} SELECT chain, table_name, hour, last_updated_at, last_full_refresh_timestamp FROM delivery_metadata.intervals.changes WHERE chain = 'base' AND table_name = 'base.raw.logs' AND last_updated_at >= CURRENT_TIMESTAMP - INTERVAL '1 day' ORDER BY hour ``` The returned `hour` values are the partitions you need to re-sync. Compare `last_updated_at` against your own internal tracking timestamp to determine which hours actually need patching. You can also check `last_full_refresh_timestamp` to detect if a full table refresh occurred — compare it against your internal timestamp to know if a full re-ingestion is needed. Filter on **`last_updated_at`**, not `_created_at`. `_created_at` is just the time the metadata record was created — don't use it for change detection. ### How the Two Jobs Work Together **Steady-state** optimizes for speed and cost — it handles the common case of new data arriving at the leading edge. **Backfill** ensures full correctness by catching the uncommon case of late-arriving historical patches (data from weeks or months ago being corrected). Together, these two jobs ensure your warehouse stays fully aligned with Allium. ### Apply changes with delete + insert (not upsert) Both jobs share the same write rule: for each changed `block_timestamp` range, **delete all existing rows in that range in your warehouse, then insert the fresh data for that range**. Job 1 applies it over the recent lookback window; Job 2 applies it over each patched hourly partition. Do not apply changes with an upsert/`MERGE` keyed on a row identifier. Blockchain data is not append-only — reorgs, corrections, and re-processing can **remove** previously delivered rows. Deletions are hard deletes: the row simply disappears from the source table; there is no `_deleted` flag, soft-delete marker, or versioned replacement. A merge only adds or overwrites keys present in the new batch, so a row deleted upstream — whose key never appears again — is left behind stale in your warehouse. The delivery metadata never flags individual rows as deleted — it flags the **containing time range** as changed. Deletes appear as though the encapsulating time range has changed: a new load range shows up in `snapshots.aggregated`, and the affected `hour` gets a fresh `last_updated_at` in `intervals.changes`. Follow the changed time ranges and delete + insert each one, and deletions are applied automatically. ### Single-table fallback If you'd rather not maintain two query paths, you can run everything off `delivery_metadata.snapshots.aggregated` alone — it carries the exact load ranges of every delivery, including backfills, so both the forward-sync and backfill cases are derivable from it. The trade-off is on your side: * A single partition's update history is **spread across multiple snapshot rows** (one per delivery that touched it), so a per-partition "is my copy stale?" check means scanning every snapshot whose interval overlaps that partition and taking the latest. * `min_timestamp` / `max_timestamp` is a **coarse envelope** — one batch can carry both tip data and a historical patch — so the exact coverage lives in the `snapshot_loaded_intervals__loads` / `__merged_intervals` JSON. Filtering on the envelope alone re-syncs more than strictly necessary (safe, but costlier). `intervals.changes` exists precisely so you don't have to re-implement that rollup. Use the single-table approach only if your pipeline is simple enough that the extra precision isn't worth a second query path. ## Gating on Verification: `delivery_metadata.intervals.verified_until` `verified_until` reports, per table, the latest point in time up to which Allium has **verified the data is complete and made it available in your environment** — with no verification gaps before that point. Gate on it when your use case needs **guaranteed, already-verified data (at lower freshness)** rather than fast, eventually-consistent data at the tip of chain — i.e. you'd rather wait for verification than consume the leading edge as soon as it lands. | Column | Type | Description | | ---------------- | --------------- | ---------------------------------------------------------------------------------------- | | `chain` | `VARCHAR` | Blockchain name | | `table_name` | `VARCHAR` | Table name | | `verified_until` | `TIMESTAMP_NTZ` | Latest `block_timestamp` up to which the data is **contiguously** verified and delivered | | `_updated_at` | `TIMESTAMP_NTZ` | When this result was last computed | ```sql theme={null} SELECT verified_until FROM delivery_metadata.intervals.verified_until WHERE chain = 'ethereum' AND table_name = 'ethereum.raw.logs' ``` Use it as a quality gate: treat only records with `block_timestamp <= verified_until` as final, and hold back anything newer until verification catches up. `verified_until` is **contiguous and gap-aware** — it is the end of the first unbroken verified run, reading forward over a trailing 7-day window. It **stops at the first gap**; it does not jump to the most recent verified point. A table with no verified data at the start of the window returns **no row** (not `NULL`). Reading forward from the start of the window, `verified_until` advances only while each interval is verified — the first gap (a failed *or* missing verification) stops it, even if later intervals are verified: ```text theme={null} interval 1 2 3 4 5 6 7 verified ✓ ✓ ✓ ✗ ✓ ✓ ✓ └─ verified_until = end of interval 3 │ └ interval 4 is the first gap; intervals 5–7 are verified but NOT contiguous with the start, so verified_until does not advance. A window that starts with a gap (interval 1 = ✗ or missing) yields no row: interval 1 2 3 4 verified ✗ ✓ ✓ ✓ → no verified_until row ``` The most recent \~1 hour is a normal settling tail and may not yet read as verified even when the data is fine. A `verified_until` that trails `now` by up to \~1–2 hours is expected, not an error. Coverage: `verified_until` is currently populated for raw (`*.raw.*`) datasets. Enriched/derived tables are delivered and tracked by the two sync tables above but may not yet appear in `verified_until`. ## dbt Integration Example If you use dbt with Allium as a source, here's how you might structure your incremental model: A key-based `merge`/upsert will **not** remove rows that were deleted upstream (reorgs, corrections) — the deleted row's key never appears in the new batch, so nothing removes it. Delete + insert over the changed time range is required for correctness; see [Apply changes with delete + insert (not upsert)](#apply-changes-with-delete--insert-not-upsert). ```sql theme={null} -- models/staging/stg_base_raw_logs.sql {{ config( materialized='incremental', incremental_strategy='delete+insert', unique_key='block_hour' ) }} -- block_hour is the hourly partition, NOT a row key: delete+insert on it -- replaces every changed hour wholesale, so upstream deletions are applied too SELECT *, DATE_TRUNC('hour', block_timestamp) AS block_hour FROM {{ source('allium', 'base_raw_logs') }} {% if is_incremental() %} WHERE block_timestamp >= ( SELECT MIN(snapshot_loaded_intervals__min_timestamp) FROM {{ source('allium_delivery_metadata', 'snapshots_aggregated') }} WHERE chain = 'base' AND table_name = 'base.raw.logs' AND snapshot_created_at >= ( SELECT MAX(_loaded_at) FROM {{ this }} ) - INTERVAL '1 hour' ) {% endif %} ``` The `WHERE block_timestamp >= (SELECT MIN(...) ...)` subquery above is convenient but can defeat **partition pruning** — the engine often can't use a non-constant subquery to prune the scan (see [Filter as much as you can, in CTEs and on time columns](/historical-data/overview/query-optimizations#filter-as-much-as-you-can-in-ctes-and-on-time-columns)). If performance suffers, compute the bound as a **literal at dbt render time** so the predicate becomes a constant the engine can prune on. Use dbt-utils [`get_single_value`](https://github.com/dbt-labs/dbt-utils?tab=readme-ov-file#get_single_value-source) to resolve the timestamp, then inline it: ```sql theme={null} {% if is_incremental() %} {% set lookback_query %} SELECT MIN(snapshot_loaded_intervals__min_timestamp) FROM {{ source('allium_delivery_metadata', 'snapshots_aggregated') }} WHERE chain = 'base' AND table_name = 'base.raw.logs' AND snapshot_created_at >= (SELECT MAX(_loaded_at) FROM {{ this }}) - INTERVAL '1 hour' {% endset %} {% set min_block_timestamp = dbt_utils.get_single_value(lookback_query, default="1970-01-01 00:00:00") %} WHERE block_timestamp >= '{{ min_block_timestamp }}' {% endif %} ``` See [dbt tips and tricks](/historical-data/overview/query-optimizations#dbt-tips-and-tricks) for more patterns. For the backfill job, create a separate model or macro that queries `intervals.changes` and re-processes the affected partitions. ## FAQ: Late-Arriving Data, Deletions, and Finality Common questions when building a sync pipeline against enriched tables (e.g. `dex.trades`) where late-arriving data, reorgs, and backfills matter. Anchor on the **per-chain tables** (`ethereum.dex.trades`, `arbitrum.dex.trades`, …). Cross-chain "all-chains" tables such as `crosschain.dex.trades` are **views** — a `UNION ALL` over the per-chain tables — and views don't emit delivery events, so they never appear in `snapshots.aggregated` or `intervals.changes`. Each per-chain table is delivered independently with its own watermark and its own metadata key. Point your sync jobs at the per-chain keys and the metadata lines up one-to-one. Late-arriving data falls into two regimes: 1. **Tip settling (continuous, shallow).** A block-hour is delivered within minutes of the chain tip, then re-touched a handful of times over the next few hours as the rolling ingestion window re-scans the leading edge for late-arriving records and shallow reorgs. On EVM chains an hour is typically final within **single-digit hours**. 2. **Reprocessing sweeps (infrequent, wide).** Periodically a batch job re-derives a large contiguous span of history at once — for example when new DEX-protocol coverage is added and backfilled, or a decoding / USD-pricing methodology is revised and re-applied. These are rare (on the order of a handful per quarter) but can touch a large share of a rolling quarter's partitions in a single pass, reaching back **weeks**. On any ordinary day the deep-backfill footprint is effectively zero; the width comes entirely from the occasional sweep. High-throughput chains carry a heavier ongoing backfill — Solana enriched trades, for instance, can keep settling for **weeks** and be revised **months** back. Size your safety net to the chains you actually ingest. Two distinct drivers, mapping onto the two regimes above. Tip settling is **chain-mechanical** — late blocks and shallow reorgs — and is expected and self-healing. Reprocessing sweeps are **data-quality / coverage** driven — a new protocol or version added to coverage and backfilled, or a revised decoding / pricing methodology re-applied across history. Sweeps are deliberate quality improvements, not incident recovery. Well beyond the tip. Reprocessing sweeps routinely revise partitions weeks old on EVM, and months old on the highest-throughput chains. This is the crucial point for pipeline design: **a fixed lookback window is necessary but not sufficient.** A short steady-state re-pull (24–36h) captures the common case of tip settling but will entirely miss a sweep that reaches back weeks. The correct pattern is the two-job split described in [Recommended Integration Pattern](#recommended-integration-pattern): a high-cadence steady-state job with a short lookback, **plus** a low-cadence backfill job that queries `intervals.changes` for *any* historical hour whose `last_updated_at` has advanced, with no lookback bound. The backfill job — not the lookback window — is what keeps you aligned against sweeps. Rows are removed — reorgs, and sweep re-derivations that drop previously mis-attributed records. These are **hard deletes**: there is no `_deleted` flag, tombstone, or versioned replacement row. Do **not** try to infer removals by diffing full snapshots. The metadata never flags an individual row as deleted; it flags the **containing hour** as changed (a bumped `last_updated_at` in `intervals.changes`), so a delete surfaces exactly like an insert or update. Apply changes with a **delete-aware re-pull per changed hour** — delete every row you hold for that block-hour, then insert the fresh batch — never an upsert/`MERGE` on the row key. See [Apply changes with delete + insert (not upsert)](#apply-changes-with-delete--insert-not-upsert) for why a merge strands upstream-deleted rows. Two options, strongest first: 1. **Settle-gate on `intervals.changes` (works today for every table).** Treat an hour as ready once its `last_updated_at` has been stable for a settle buffer, and keep honoring later advances through the backfill job. For **irreversible** downstream actions, add margin: an EVM hour is empirically final within single-digit hours in normal operation, but a later reprocessing sweep can still revise it — your backfill job will catch it, so build in a way to reverse or replay affected actions. 2. **`verified_until` (strongest primitive, with a coverage caveat).** [`intervals.verified_until`](#gating-on-verification-delivery_metadata-intervals-verified_until) reports the latest `block_timestamp` that is contiguously verified complete and delivered; you gate on `block_timestamp <= verified_until`. It is currently populated for **raw** (`*.raw.*`) datasets, so enriched/derived tables (e.g. `dex.trades`) may not yet carry a row. For enriched tables today, the settle-gate plus backfill job is the practical finality mechanism. One reassuring property: late-arriving data to date has been applied as incremental delete+insert of specific hour ranges — full-table refreshes are rare. Keep a guard for them anyway (`is_full_refresh` / `last_full_refresh_timestamp` in the metadata), but you should not expect routine "re-ingest the whole table" events. Some Allium tables expose `_created_at` (first write) and `_updated_at` (bumped on every re-write, including sweeps). Cursoring on `_updated_at` with a \~1h overlap buffer captures inserts and sweep re-derivations without touching the metadata tables. Three caveats: * These columns are **experimental / not officially supported** (see [Metadata columns](/historical-data/overview/data-faq/metadata-columns)). * A **deleted row leaves no `_updated_at` to advance past** — so you still need the same delete-aware trailing re-pull to apply reorgs and removals. * These tables are **partitioned by their timestamp column** (e.g. `block_timestamp`), so a filter on `_updated_at` alone does **not** trigger partition pruning — the query full-scans the target table. Adding a `block_timestamp` filter restores pruning, but any late-arriving data in a partition outside that range is then missed. You're forced to trade scan cost against completeness; the delivery metadata tables avoid the bind because they're indexed for exactly this lookup. The delivery metadata tables remain the recommended source of truth. ## Further Reading * [Metadata columns](/historical-data/overview/data-faq/metadata-columns)