Managing Temporal Change in Data Pipelines

May 16, 2025

Modern data pipelines must delicately handle data that evolves over time. In real-world scenarios, source data is rarely static. New records get created, existing records are updated, and records are deleted as needed. Managing temporal (time-based) change means designing your pipeline to capture these changes, process them efficiently, and store results in a way that preserves the right amount of history. This design process involves three specific stages:

Ingestion – Change Data Capture (CDC): How to capture and ingest only the changes (deltas) from source systems, rather than full data extracts.
Transformation – Incremental Models: How to process data in an incremental fashion, so that each pipeline run handles just new or changed data (rather than recomputing everything).
Storage/Serving – SCD Patterns: How to store transformed data, especially dimensional data, in ways that reflect changes over time (using Slowly Changing Dimensions patterns), and how to serve both current and historical views.

Each of these tiers addresses temporal change at a different stage of the data pipeline. Together, they form a continuum – from the moment a change occurs in a source system, to how it flows through transformations, and finally to how it is stored for end-user consumption. Let's explore each stage, understand the key concepts (CDC, incremental processing, SCD Type 1 vs Type 2, etc.), and show where they fit in a modern pipeline architecture. We’ll also touch on related concepts like snapshot tables, late-arriving data, and backfills to round out the picture.

Ingestion (Change Data Capture)

Example: A change data capture feed captures just the changed row. Here, between January 1 and January 2, an employee’s role changed from "Owner" to "Manager," so the CDC feed emits a single UPDATE event for that row (id=5) instead of a full data dump.

Change Data Capture (CDC) is a data integration pattern that identifies and tracks changes in a source system and delivers those changes downstream. Rather than extracting entire tables, CDC allows pipelines to ingest only the deltas – the inserts, updates, and deletes that have occurred since the last capture. This stream of changes (called the CDC feed) includes metadata about each change, such as the type of operation (INSERT/UPDATE/DELETE) and an ordering mechanism (e.g. an incrementing sequence number or timestamp) to ensure changes can be replayed in the correct order. By operating a CDC feed instead of full snapshots, data engineers can move data much faster and more efficiently.

In practice, CDC is implemented via database transaction logs, triggers, or publication/subscription (usually referred to as "pub/sub") mechanisms. For example, a tool like Debezium can monitor a relational database’s write-ahead log and publish every row-level change to a messaging system. Similarly, modern cloud data warehouses and lakes often have built-in CDC or change feed capabilities (for instance, Delta Lake’s Change Data Feed). The result is an ingestion pipeline that continuously (or periodically in micro-batches) streams only the new and changed records into the data platform.

Without CDC, you could alternatively do a periodic snapshot ingestion. In a snapshot-based approach, the entire source table (or the portion of it for a given date) is extracted on a schedule (daily, hourly, etc.). This is simpler but comes at the cost of handling a lot of redundant data and missing fine-grained change information. Downstream processes then have to diff snapshots or use techniques like SCD to figure out what changed. For instance, a source system might export a full dump of a table each day (a daily snapshot). The data pipeline could ingest these daily snapshots and later apply logic to detect changes between one day and the next. This works, but it’s far less efficient than CDC and can delay the availability of new changes. It becomes less and less efficient at scale.

Where CDC fits in the pipeline: CDC primarily addresses the ingestion layer of the pipeline. It’s how raw changes enter your data lake or warehouse. By capturing changes as they happen, CDC lays the groundwork for incremental processing further down the line. The ingestion tier using CDC ensures that downstream transformation jobs always have the latest changes available (often in a raw change-log table). It also guarantees that no change is lost: even deletions and updates are explicitly represented, rather than being implicit differences between snapshots. This reliable capture of change events is crucial for building temporal awareness into the rest of the pipeline.

One important aspect of CDC ingestion is handling out-of-order events or late arrivals. Production systems may generate changes in one order, but network delays or batching could deliver them out of sequence (what? not everything works perfectly all of the time? 😂). The metadata in a CDC feed (such as an event timestamp or a log sequence number) applies changes in the correct order. If an event arrives late, the pipeline can still process it correctly because the event carries a timestamp or sequence indicating its position in time. Late arrivals do pose challenges for the transformation layer because the downstream jobs need to reconcile out-of-order data.

Change data capture (CDC) ingestion is about capturing the “when” and “what” of data changes as they happen. It is the first step in managing temporal change: by the time data hits your platform, you already know what changed and when. This is the foundational input for your incremental transformation logic.

Transformation (Incremental Models)

Once changes are flowing into the system, the next challenge is how to transform and aggregate this data without reprocessing everything from scratch each time. This is where incremental models come in. An incremental transformation process only processes new or updated data in each run, rather than recomputing an entire dataset. In contrast, a naive approach (often called a full refresh or full recompute) would read all input data and rebuild the output from zero on every execution, which is wasteful when most data hasn’t changed. Fun fact - this is what other transformation frameworks do...

Tobiko SQLMesh is the only open-source transformation framework to provide explicit support for incremental processing. You may think, "Oh, well, all open source transformation frameworks must do this...it seems really important". Well, you'd be right that it's really important, but you'd be wrong to assume that the other alternatives provide the same level of support. SQLMesh is the only open-source transformation framework to fully address incremental models 🤯. Read that again. It's important. Okay, enough pointing out the flaws in other solutions. Let's continue learning: For example, one of the common patterns is incremental processing by time range. In a model of kind INCREMENTAL_BY_TIME_RANGE, SQLMesh knows to break the data into time-based windows and process only the missing intervals on each run. If yesterday’s data has already been processed and only today’s new data is needed, the framework will automatically restrict the query to today’s partition. This is in contrast to a FULL model kind, which would scan and recompute the entire dataset every time. The incremental model requires two key pieces of configuration: a designated time column to partition the data, and a filter (WHERE clause) on that time column to limit each run’s scope. The framework uses this to determine what range of time to process on a given run (for example, “process only data where event_date is between X and Y”).

A simple example of defining an incremental model by time in SQLMesh might look like this:

MODEL (
  name db.events,
  kind INCREMENTAL_BY_TIME_RANGE (
    time_column event_date  -- designate the timestamp column for partitioning
  )
);
SELECT
  *
FROM source_table
WHERE event_date BETWEEN @start_date AND @end_date;

In this snippet, the model is configured to be incremental on event_date, and the query uses the special macros @start_date and @end_date to filter the source data to the correct interval. Each time the pipeline runs, those macros resolve to the next missing time range that needs processing. Only that slice of data is read and inserted into the target table, leaving previously processed data intact.

Incremental modeling can be tuned to handle real-world complications:

Late-arriving data: What if an event arrives late, i.e., with a timestamp that falls into a past time period that’s already been processed? This is a common issue in distributed systems (for example, a mobile app might record an event with yesterday’s date but only sync to the server today). To handle this, incremental models often include a lookback window. The lookback tells the pipeline to re-process a small slice of recent past data every run, just in case new data shows up for that interval. In SQLMesh, for instance, you can configure lookback to a number of periods (days, hours, etc.) to include in each run. For example:

MODEL (
  name demo.incrementals_demo,
  kind INCREMENTAL_BY_TIME_RANGE (
    time_column transaction_date,
    lookback 2  -- reprocess past 2 days for late events
  ),
  start '2024-10-25',  -- Don't backfill data before this date:contentReference[oaicite:10]{index=10}
  cron '@daily'
);
SELECT
  ...
FROM transactions
WHERE transaction_date BETWEEN @start_date AND @end_date;

In the above model configuration, lookback 2 ensures that each daily run will not only process the new day’s partition but also reprocess the previous 2 days as a safety net for late-arriving records. This means if any data for "yesterday" or the day before arrives late, it will still be included. The start '2024-10-25' parameter indicates the pipeline’s start date, effectively telling SQLMesh not to process any data older than October 25, 2024 (i.e., do not backfill before this date) Using a start date is a way to limit or avoid an initial huge backfill, especially if the source has history going back years – you might choose to start processing from a certain point and handle earlier history separately.

Backfills: A backfill is the process of loading historical data that wasn’t previously processed (or reprocessing data due to a fix or change). In incremental pipelines, you typically don’t want to backfill everything by default (to save time), hence the use of a start date as above. However, you do need the capability to perform backfills when required. With SQLMesh, you could adjust the start date backward or run a special backfill job to process older partitions in batches. For example, if your pipeline initially started at 2024-10-25 but later you need data from September 2024, you might run a one-time backfill for the earlier dates (possibly using the same model definition but in a separate plan or by temporarily changing the start). The key is that the incremental model definition stays the same; you are simply instructing the system to cover an earlier range of dates. Since INCREMENTAL_BY_TIME_RANGE models process intervals independently, they make it straightforward to backfill missing ranges without reprocessing the entire dataset.
Incremental by keys (upserts): Not all incremental logic is time-based. Sometimes you have a dimension or fact table that should be kept up-to-date by applying new records and updates based on a unique key (for example, an up-to-date product catalog or user profile table). SQLMesh addresses this with INCREMENTAL_BY_UNIQUE_KEY models. These models use a key (or composite key) to merge incoming rows into the existing table. Each run will take new or changed rows from the source and perform an upsert: if a row with the same key already exists in the target, it will be updated; if not, it will be inserted. Under the hood, this is often implemented with a MERGE operation in SQL. By default, all columns from the source replace the target’s columns on a match (i.e. a full overwrite of the record). However, you can customize the update behavior. For example, perhaps you only want to update certain columns and leave others untouched if they are NULL in the incoming data. SQLMesh allows a when_matched clause in the model definition to override the default merge logic. Consider this example:

MODEL (
  name db.employees,
  kind INCREMENTAL_BY_UNIQUE_KEY (
    unique_key name,
    when_matched (
      WHEN MATCHED THEN UPDATE SET target.salary = COALESCE(source.salary, target.salary)
    )
  )
);

Here, the unique_key is name (Just for illustration; in practice, a more stable key like an ID would be used.) The when_matched clause specifies that when a matching employee record is found, only the salary column should be updated – and even then, use COALESCE to keep the existing salary if the incoming source.salary is NULL. This kind of fine-grained control is important for incremental merges, especially if not all source changes should overwrite target data. Engines that support SQL MERGE (e.g. BigQuery, Snowflake, Databricks, Postgres, etc.) can take advantage of this syntax natively.

Incremental-by-key models are essentially how you would implement an SCD Type 1 change in a transformation: you apply updates in place, so the target table always has the latest version of each record (with no history of old values). In fact, performing an upsert on a dimension table based on the primary key is a typical way to maintain an SCD Type 1 dimension. We’ll discuss SCD types more in the next section, but it’s worth noting that an incremental pipeline can handle both kinds of slowly changing dimensions: Type 1 (overwrite on update) via merges, and Type 2 (history on update) via a different strategy.

Idempotency and re-processing: With incremental models, ensuring idempotency (that re-running the same batch yields the same result without duplicates or inconsistencies) is critical. Time-range incremental models are usually idempotent by design – e.g., if you re-run the job for the same date partition, it will typically overwrite or replace that partition’s data, ending up with one clean set of rows for that date. Key-based incremental models (upserts) can be trickier: if not handled carefully, re-processing the same input could double-update or duplicate data. SQLMesh notes that INCREMENTAL_BY_UNIQUE_KEY models are inherently non-idempotent across arbitrary restatements – you cannot partially "undo" an update to one row without potentially affecting others, so if you need to restate data (fix past mistakes), the safest route is to rebuild the entire table from scratch. In practice, this means frameworks often disable partial backfill or selective restatement for such models. Instead, a full refresh of the table would be done if absolutely needed to correct historical errors.

The transformation layer’s job in managing temporal change is to process new data incrementally while preserving correctness. This layer sits in the middle of the pipeline: it consumes the change data (from CDC or snapshots) and produces up-to-date derived tables. By leveraging incremental processing, the pipeline can scale to large volumes of data and long time spans, since each run does a bounded amount of work. This is crucial when dealing with temporal data, because the total volume of data (across all time) keeps growing – a full recompute would grow more expensive every day. Incremental models ensure that adding the next slice of time or batch of changes is a manageable effort.

Storage/Serving (SCD Patterns)

After transformation, we have data ready to be stored and served for analysis. At this stage, a key question is how to represent the data’s changing nature in the storage schema. For datasets that don’t require history, we might only store the latest state (called a current table or Type 1 dimension). But for many important entities – customers, products, accounts, etc. – the business wants to ask questions about how they changed over time. This is where Slowly Changing Dimension (SCD) patterns come into play. SCD techniques allow a table to retain historical versions of records in addition to the current version, enabling point-in-time analysis.

The two most common SCD patterns are Type 1 and Type 2. SCD Type 1 means no history is kept: when an attribute changes, you update the existing record in place, and the old value is lost. This is straightforward and storage-efficient – for example, if a customer’s address changes, the table simply overwrites the old address with the new one. SCD Type 1 is useful when you only care about the current state (or when corrections need to be applied retroactively with no need to preserve the error). However, it cannot answer questions like “what was this customer’s address last month?” because the last month’s value is gone 🪦

SCD Type 1: The dimension table keeps only the latest data. In this example, Chris’s role has been updated to "Manager," and the previous role ("Owner") is no longer stored.

Conversely, SCD Type 2 creates a new record whenever a change happens, preserving the old record for history. An SCD Type 2 dimension table typically has additional columns to indicate the validity period of each record version – commonly valid_from and valid_to timestamps (or effective start/end dates). When a record changes, the existing version is closed out by setting its valid_to to the change time, and a new version is inserted with a fresh valid_from (the change time) and valid_to = NULL to signify it is currently active. This way, the table keeps all historical changes. For example, if a product’s price changes from $100 to $120 on Jan 5, 2025, a Type 2 table would still have the old price $100 record with valid_to = '2025-01-05', and a new record with price $120 starting on that date. Analysts can then reconstruct the price at any given date by examining these validity ranges.

SCD Type 2: The dimension table keeps all historical changes. Here, for Chris (id 5), the table has two records: one for the "Owner" role (valid from 2024-01-01 until 2024-01-01 12:00:00), and one for the "Manager" role (valid from 2024-01-02 onward, with no end date). This allows queries to see the state as of any point in time.

SQLMesh provides first-class support for SCD Type 2 as a model kind. When you declare a model as SCD_TYPE_2 (with either the “by time” or “by column” method), SQLMesh will automatically add the necessary valid_from and valid_to fields and manage the generation of new records on change. The basic configuration for an SCD2 model includes specifying the unique key that identifies a record (e.g. a primary key like id or composite business key), and depending on the variant, either specifying an update timestamp column or a set of columns to watch for changes. Let’s look at the two approaches:

SCD Type 2 by Time: This variant assumes the source data has an "updated_at" timestamp (or similar) that indicates when the row was last updated. It is the recommended approach because having an explicit timestamp for changes allows precise tracking. In the model configuration, you provide the name of the timestamp column. For example:

MODEL (
  name db.menu_items,
  kind SCD_TYPE_2_BY_TIME (
    unique_key id
    -- (optionally, updated_at_name <colName> if the timestamp column isn’t the default name)
  )
);
SELECT
  id::INT,
  name::STRING,
  price::DOUBLE,
  updated_at::TIMESTAMP
FROM stg.current_menu_items;

In this definition, unique_key id tells SQLMesh that the id column uniquely identifies a menu item. With SCD_TYPE_2_BY_TIME, SQLMesh will look at the updated_at timestamp in the source (here we use the default name updated_at; we could rename it via updated_at_name if needed) to decide when a change has occurred. Every time the pipeline runs, it will compare incoming records to the latest version in the target table:

If a record is new (a new id not seen before), it inserts it with valid_from = updated_at (the time it became valid) and valid_to = NULL (currently valid).
If a record with an existing id has a different updated_at (meaning the source says it was changed) or different values, then the existing record in the SCD table is expired (its valid_to set to the new record’s valid_from time) and a new record version is inserted with the updated values.
If nothing changed for that id, no new record is added; the existing version remains with valid_to = NULL.

The resulting table schema from the above model will include the business columns plus the two audit columns. For example, after materialization the table might look like:

TABLE db.menu_items (
  id INT,
  name STRING,
  price DOUBLE,
  updated_at TIMESTAMP,
  valid_from TIMESTAMP,
  valid_to TIMESTAMP
);

SCD Type 2 by Column: This variant is for sources that do not have a reliable timestamp on each row update. Instead of relying on a timestamp, the pipeline will detect changes by comparing specified columns. In the model configuration, you list which columns should be monitored for changes. On each run, for each key, SQLMesh will check if any of those columns have different values than the last version in the dimension table:

If a change in any monitored column is detected, it treats that as a new version event. The valid_from of the new version will be the time the pipeline run is executing (essentially the ingestion time, since we don’t have an explicit event time from the source)sqlmesh.readthedocs.iosqlmesh.readthedocs.io.
As before, the old version’s valid_to is set to that timestamp and the new version gets valid_to = NULL.

For example:

MODEL (
  name db.menu_items,
  kind SCD_TYPE_2_BY_COLUMN (
    unique_key id,
    columns [name, price]
  )
);
SELECT
  id::INT,
  name::STRING,
  price::DOUBLE
FROM stg.current_menu_items;

Here we watch the name and price columns. If either changes for a given id, a new record version is created. The structure of the table will again have valid_from and valid_to (but no updated_at column, unless we carry a snapshot date or so). The valid_from in this case is effectively the pipeline’s execution time when the change was detected. This approach is slightly less precise in terms of historical fidelity (since you only know when the change was observed, not exactly when it happened in the source), but it is useful for sources that simply emit the current state of records with no timestamps.

Both SCD Type 2 methods in SQLMesh support additional configurations. For instance, you can rename the valid_from/valid_to columns if you need different names (perhaps for clarity or to meet schema standards). You can also control how deletions (hard deletes) from the source are handled. By default, if a record disappears from the source, SQLMesh’s SCD Type 2 will not immediately invalidate the last version in the dimension table – it will keep the valid_to as NULL (as if the record is still valid). This is because sometimes a temporary glitch or omission could remove a record, and if it reappears later, you might not want to treat it as two separate periods of validity. However, if the record does reappear later, the framework will then close out the old entry at the reappearance time to avoid overlapping intervals. Alternatively, you can choose to invalidate on delete (invalidate_hard_deletes = true), which will mark the record as expired (set valid_to to the time of deletion detection) whenever a source deletion is detected. This choice depends on how you want to model deleted records – either keep them "alive" until confirmed gone for good, or mark them as gone immediately. In any case, SCD modeling requires thinking about these edge cases (deletions, reactivations, etc.) so that the history remains consistent.

The relationship between SCD Type 2 models and upstream CDC ingestion. If your ingestion layer is already providing a change log (with explicit INSERT/UPDATE/DELETE events), you have some flexibility in implementation:

One approach is to feed those changes directly into an SCD2 table via a merge or processing logic that mirrors the SCD2 algorithm (closing out records on update/delete, etc.). Essentially, the CDC feed drives the Type 2 table maintenance.
Another approach, if the source does not directly give change events but only full snapshots (like a daily full extract of a table), is to use the snapshot method in combination with SCD Type 2. As discussed, a source "daily snapshot" table contains multiple versions of each record (one per day, say). SQLMesh’s SCD Type 2 can consume such snapshots by processing them one snapshot at a time in order. For example, setting batch_size = 1 along with a daily schedule will make the SCD model process one day’s partition of data at a time sequentially. This ensures that the SCD logic evaluates changes in the correct chronological sequence of days. A simplified example configuration for that scenario might look like:

MODEL (
  name db.daily_dim_customer,
  kind SCD_TYPE_2_BY_COLUMN (
    unique_key customer_id,
    columns [status, plan],
    updated_at_name ds,   -- 'ds' is the date stamp of the snapshot
    batch_size 1
  ),
  start '2025-01-01',
  cron '@daily'
);
SELECT
  customer_id,
  status,
  plan,
  ds
FROM source_system.daily_customer_snapshot
WHERE ds BETWEEN @start_ds AND @end_ds;

In this example, the source daily_customer_snapshot table contains a full snapshot of customers each day (partitioned by ds date). We configure the SCD model to process one day at a time (batch_size 1 on a daily cron) so that it will first load Jan 1, then next run load Jan 2, and so on, applying the Type 2 logic for changes between days. We also use updated_at_name ds so that SQLMesh treats the snapshot date as the change timestamp for new versions. This way, if a customer’s status or plan changes from one day to the next, the SCD table will get a new version with valid_from = that day's ds. If on a particular day the value doesn’t change, no new record is added for that customer on that day. This approach essentially leverages a snapshot source to build a Type 2 dimension. The continuum is preserved: even without a CDC event stream, the daily snapshots serve as a pseudo-change log when processed sequentially.

Querying SCD data: Once you have SCD Type 2 tables in your serving layer, consuming them effectively is the final step. For current-state reporting (where you only care about the latest value per entity), you can simply filter for valid_to IS NULL to get one record per key – the current version. In fact, many data teams create a view on top of an SCD2 table that selects only the current records (sometimes called a "Type 1 view" or current view), making it easy for downstream consumers who aren’t concerned with history. For point-in-time analysis, you’ll write predicates against the validity interval. For example, to find the version of a record active at a specific time T, you query for valid_from <= T < valid_to (with valid_to treated as infinity for current records). SQLMesh’s documentation shows an example of such a query: to get the state of menu_items as of 2020-01-02 01:00:00, you ensure the record’s valid_from is before that timestamp and valid_to is either NULL or after that timestamp. When joining fact tables to a Type 2 dimension, you include similar predicates so that each fact (e.g. an order) joins to the correct dimension record effective at the time of the fact. All of this means that with SCD Type 2, you can answer historical questions like “What was the product price at the time of this sale?” or “How did this customer’s attributes change over the past year?” which would be impossible with only a Type 1 approach.

One important consideration with SCD Type 2 storage is performance and maintenance. Keeping history naturally means these tables grow indefinitely as changes accumulate. This can impact query performance and storage costs. It’s wise to partition SCD tables by a date (e.g. by valid_from or an ingestion date) if supported, and to age off history if it becomes truly irrelevant (some organizations keep Type 2 history only for X years for regulatory or practical reasons). Also, because SCD2 tables are not idempotent in the sense of partial updates (once history is written, modifying it retrospectively is complex), corrections often involve either soft corrections (inserting another row to represent a corrected value effective from now) or a complete rebuild. SQLMesh explicitly disables partial data restatements on SCD Type 2 models by default – if you attempt to “fix” a past interval, the system would end up recomputing the entire table from scratch to maintain consistency. This is a safety measure to prevent unintentional loss of historical data. In practice, if a mistake is found in an SCD2 dimension (say a bug caused wrong values in certain historical records), the typical approach is either to live with it and possibly annotate corrections, or to rebuild the dimension from scratch with corrected logic. This is why careful testing of SCD logic (and perhaps keeping a backup of the raw change logs or snapshots) is important – once history is in place, updating it is non-trivial.

Where SCD fits in the pipeline: The SCD patterns operate at the storage/serving layer – they define how the final tables in your warehouse or lake are organized to handle change over time. SCD Type 1 gives you a clean, easy-to-query current state (sacrificing history), whereas SCD Type 2 gives you a rich historical record at the cost of complexity. It’s common to use a mix of both in a single warehouse: for example, fact tables (transactions, events) are often just appended and are intrinsically historical (each fact has a timestamp and we keep them all), whereas dimension tables can be Type 1 or Type 2 depending on requirements. If an attribute is useful for analysis in its historical context (e.g. a customer’s segment or a product’s category over time), then Type 2 is warranted. If not (e.g. a typo in a name that was corrected – you don’t care to ever see the old misspelling), Type 1 is sufficient. Additionally, there’s the concept of snapshot tables or Type 2 fact tables – for instance, taking a snapshot of account balances at the end of every day. These are effectively another way to manage temporal data in storage, by creating time-versioned records of an entire table’s state at intervals. They are useful for certain kinds of analysis (like “as of daily close” comparisons) and can be a complement to or even a feeding mechanism for SCD Type 2 dimensions (as we saw with daily snapshot sources).

Finally, tying it all together along the temporal continuum: A modern data pipeline might capture raw changes from a production database via CDC (ingestion of change events), use incremental transformations to continually merge or append those changes into analytical tables (ensuring each update is processed once, efficiently), and store the results in dimensions and facts that either reflect the latest state (Type 1) or maintain full history (Type 2) as needed. For example, imagine a pipeline for customer data: a CDC stream from the app database sends every insert/update/delete of customers into the data lake; an incremental job picks up those changes and applies them, perhaps first landing them in a "customer_staging" table; and then a final SCD Type 2 model populates a historical customers_dim table with valid_from/valid_to for each change in a customer’s details. Upstream, analysts can easily query the current customer info or join facts to the customers_dim with time filters to get correct historical attributes. Meanwhile, if a late-arriving update for a customer comes through the CDC a day late, the incremental job’s lookback catches it and updates the SCD dimension accordingly (adjusting the valid_to of the prior record). If a large backfill of historical customers is needed (like if we onboard a new source with past data), we could run the incremental process in backfill mode to load those older dates, and the SCD model would simply treat them as earlier valid_from versions.

Managing temporal change in data pipelines is about end-to-end time-travel for your data: capturing changes in real-time, computing results in small time-aware increments, and storing outcomes in a temporally queryable form. By combining CDC ingestion, incremental processing, and SCD storage patterns, data engineers and architects can ensure that their pipelines are efficient, scalable, and capable of answering both “right now” questions and “as of then” questions. Each technique addresses a phase of the data’s journey – CDC for the moment of change, incremental models for the movement and transformation of data through the pipeline, and SCD for the state of data at rest over time. Together, they form a cohesive framework for handling the ever-evolving tapestry of enterprise data.

Sources: I stole the pictures from Databricks documentation. I could make them myself...but why? 😂

Andrew Madson MSc, MBA

Head of Education & Evangelism at Tobiko

Founder, Insights x Design

Insights x Design

Managing Temporal Change in Data Pipelines

Ingestion (Change Data Capture)

Transformation (Incremental Models)

Storage/Serving (SCD Patterns)

Discussion about this post