Recent from talks
Nothing was collected or created yet.
Slowly changing dimension
View on WikipediaThis article needs additional citations for verification. (March 2015) |
In data management and data warehousing, a slowly changing dimension (SCD) is a dimension that stores data which, while generally stable, may change over time, often in an unpredictable manner.[1] This contrasts with a rapidly changing dimension, such as transactional parameters like customer ID, product ID, quantity, and price, which undergo frequent updates. Common examples of SCDs include geographical locations, customer details, or product attributes.
Various methodologies address the complexities of SCD management. Ralph Kimball's Data Warehouse Toolkit has popularized a categorization of techniques for handling SCD attributes as Types 1 through 6.[1] These range from simple overwrites (Type 1), to creating new rows for each change (Type 2), adding new attributes (Type 3), maintaining separate history tables (Type 4), or employing hybrid approaches (Type 6 and 7). Type 0 is available to model an attribute as not really changing at all. Each type offers a trade-off between historical accuracy, data complexity, and system performance, catering to different analytical and reporting needs.
The challenge with SCDs lies in preserving historical accuracy while maintaining data integrity and referential integrity. For instance, a fact table tracking sales might be linked to a dimension table containing information about salespeople and their assigned regional offices. If a salesperson is transferred to a new office, historical sales reports need to reflect their previous assignment without breaking the relationships between the fact and dimension tables. SCDs provide mechanisms to manage such changes effectively.
Type 0: retain original
[edit]The Type 0 dimension attributes never change and are assigned to attributes that have durable values or are described as 'Original'. Examples: Date of Birth, Original Credit Score. Type 0 applies to most date dimension attributes.[2]
Type 1: overwrite
[edit]This method overwrites old with new data, and therefore does not track historical data.
Example of a supplier table:
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
|---|---|---|---|
| 123 | ABC | Acme Supply Co | CA |
In the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the row will be unique by the natural key (Supplier_Code).
If the supplier relocates the headquarters to Illinois the record would be overwritten:
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
|---|---|---|---|
| 123 | ABC | Acme Supply Co | IL |
The disadvantage of the Type 1 method is that there is no history in the data warehouse. It has the advantage however that it's easy to maintain.
If one has calculated an aggregate table summarizing facts by supplier state, it will need to be recalculated when the Supplier_State is changed.[1]
Type 2: add new row
[edit]This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert. The natural key in these examples is the "Supplier_Code" of "ABC".
For example, if the supplier relocates to Illinois the version numbers will be incremented sequentially:
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Version |
|---|---|---|---|---|
| 123 | ABC | Acme Supply Co | CA | 0 |
| 123 | ABC | Acme Supply Co | IL | 1 |
Another method is to add 'effective date' columns.
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Start_Date | End_Date |
|---|---|---|---|---|---|
| 123 | ABC | Acme Supply Co | CA | 2000-01-01T00:00:00 | 2004-12-22T00:00:00 |
| 123 | ABC | Acme Supply Co | IL | 2004-12-22T00:00:00 | NULL
|
The Start date/time of the second row is equal to the End date/time (or next) of the previous row. The null End_Date in row two indicates the current tuple version. A standardized surrogate high date (e.g. 9999-12-31) may instead be used as an end date so that null-value substitution is not required when querying. In some database software, using an artificial high date value could cause performance issues, that using a null value would prevent.
And a third method uses an effective date and a current flag.
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Effective_Date | Current_Flag |
|---|---|---|---|---|---|
| 123 | ABC | Acme Supply Co | CA | 2000-01-01T00:00:00 | N |
| 123 | ABC | Acme Supply Co | IL | 2004-12-22T00:00:00 | Y |
The Current_Flag value of 'Y' indicates the current tuple version.
Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by supplier state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed. To reference the entity via the natural key, it is necessary to remove the unique constraint making referential integrity by DBMS (DataBase Management System) impossible.
If there are retroactive changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to frequent change.[1]
Type 3: add new attribute
[edit]This method tracks changes using separate columns and preserves limited history. The Type 3 preserves limited history as it is limited to the number of columns designated for storing historical data. The original table structure in Type 1 and Type 2 is the same but Type 3 adds additional columns. In the following example, an additional column has been added to the table to record the supplier's original state - only the previous history is stored.
| Supplier_Key | Supplier_Code | Supplier_Name | Original_Supplier_State | Effective_Date | Current_Supplier_State |
|---|---|---|---|---|---|
| 123 | ABC | Acme Supply Co | CA | 2004-12-22T00:00:00 | IL |
This record contains a column for the original state and current state—cannot track the changes if the supplier relocates a second time.
One variation of this is to create the field Previous_Supplier_State instead of Original_Supplier_State which would track only the most recent historical change.[1]
Type 4: add history table
[edit]The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes. Both the surrogate keys are referenced in the fact table to enhance query performance.
For the example below, the original table name is Supplier and the history table is Supplier_History:
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
|---|---|---|---|
| 123 | ABC | Acme & Johnson Supply Co | IL |
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Create_Date |
|---|---|---|---|---|
| 123 | ABC | Acme Supply Co | CA | 2003-06-14T00:00:00 |
| 123 | ABC | Acme & Johnson Supply Co | IL | 2004-12-22T00:00:00 |
This method resembles how database audit tables and change data capture techniques function.
Type 5
[edit]The type 5 technique builds on the type 4 mini-dimension by embedding a “current profile” mini-dimension key in the base dimension that's overwritten as a type 1 attribute. This approach is called type 5 because 4 + 1 equals 5. The type 5 slowly changing dimension allows the currently-assigned mini-dimension attribute values to be accessed along with the base dimension's others without linking through a fact table. Logically, we typically represent the base dimension and current mini-dimension profile outrigger as a single table in the presentation layer. The outrigger attributes should have distinct column names, like “Current Income Level,” to differentiate them from attributes in the mini-dimension linked to the fact table. The ETL team must update/overwrite the type 1 mini-dimension reference whenever the current mini-dimension changes over time. If the outrigger approach does not deliver satisfactory query performance, then the mini-dimension attributes could be physically embedded (and updated) in the base dimension.[2]
Type 6: combined approach
[edit]The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace from Kalido[citation needed]. Ralph Kimball calls this method "Unpredictable Changes with Single-Version Overlay" in The Data Warehouse Toolkit.[1]
The Supplier table starts out with one record for our example supplier:
| Supplier_Key | Row_Key | Supplier_Code | Supplier_Name | Current_State | Historical_State | Start_Date | End_Date | Current_Flag |
|---|---|---|---|---|---|---|---|---|
| 123 | 1 | ABC | Acme Supply Co | CA | CA | 2000-01-01T00:00:00 | 9999-12-31T23:59:59 | Y |
The Current_State and the Historical_State are the same. The optional Current_Flag attribute indicates that this is the current or most recent record for this supplier.
When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing, however a row key is included to ensure we have a unique key for each row:
| Supplier_Key | Row_Key | Supplier_Code | Supplier_Name | Current_State | Historical_State | Start_Date | End_Date | Current_Flag |
|---|---|---|---|---|---|---|---|---|
| 123 | 1 | ABC | Acme Supply Co | IL | CA | 2000-01-01T00:00:00 | 2004-12-22T00:00:00 | N |
| 123 | 2 | ABC | Acme Supply Co | IL | IL | 2004-12-22T00:00:00 | 9999-12-31T23:59:59 | Y |
We overwrite the Current_State information in the first record (Row_Key = 1) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing.
For example, if the supplier were to relocate again, we would add another record to the Supplier dimension, and we would overwrite the contents of the Current_State column:
| Supplier_Key | Row_Key | Supplier_Code | Supplier_Name | Current_State | Historical_State | Start_Date | End_Date | Current_Flag |
|---|---|---|---|---|---|---|---|---|
| 123 | 1 | ABC | Acme Supply Co | NY | CA | 2000-01-01T00:00:00 | 2004-12-22T00:00:00 | N |
| 123 | 2 | ABC | Acme Supply Co | NY | IL | 2004-12-22T00:00:00 | 2008-02-04T00:00:00 | N |
| 123 | 3 | ABC | Acme Supply Co | NY | NY | 2008-02-04T00:00:00 | 9999-12-31T23:59:59 | Y |
Type 2 / type 6 fact implementation
[edit]Type 2 surrogate key with type 3 attribute
[edit]In many Type 2 and Type 6 SCD implementations, the surrogate key from the dimension is put into the fact table in place of the natural key when the fact data is loaded into the data repository.[1] The surrogate key is selected for a given fact record based on its effective date and the Start_Date and End_Date from the dimension table. This allows the fact data to be easily joined to the correct dimension data for the corresponding effective date.
Here is the Supplier table as we created it above using Type 6 Hybrid methodology:
| Supplier_Key | Supplier_Code | Supplier_Name | Current_State | Historical_State | Start_Date | End_Date | Current_Flag |
|---|---|---|---|---|---|---|---|
| 123 | ABC | Acme Supply Co | NY | CA | 2000-01-01T00:00:00 | 2004-12-22T00:00:00 | N |
| 124 | ABC | Acme Supply Co | NY | IL | 2004-12-22T00:00:00 | 2008-02-04T00:00:00 | N |
| 125 | ABC | Acme Supply Co | NY | NY | 2008-02-04T00:00:00 | 9999-12-31T23:59:59 | Y |
Once the Delivery table contains the correct Supplier_Key, it can easily be joined to the Supplier table using that key. The following SQL retrieves, for each fact record, the current supplier state and the state the supplier was located in at the time of the delivery:
SELECT
delivery.delivery_cost,
supplier.supplier_name,
supplier.historical_state,
supplier.current_state
FROM delivery
INNER JOIN supplier
ON delivery.supplier_key = supplier.supplier_key;
Pure type 6 implementation
[edit]Having a Type 2 surrogate key for each time slice can cause problems if the dimension is subject to change.[1] A pure Type 6 implementation does not use this, but uses a surrogate key for each master data item (e.g. each unique supplier has a single surrogate key). This avoids any changes in the master data having an impact on the existing transaction data. It also allows more options when querying the transactions.
Here is the Supplier table using the pure Type 6 methodology:
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Start_Date | End_Date |
|---|---|---|---|---|---|
| 456 | ABC | Acme Supply Co | CA | 2000-01-01T00:00:00 | 2004-12-22T00:00:00 |
| 456 | ABC | Acme Supply Co | IL | 2004-12-22T00:00:00 | 2008-02-04T00:00:00 |
| 456 | ABC | Acme Supply Co | NY | 2008-02-04T00:00:00 | 9999-12-31T23:59:59 |
The following example shows how the query must be extended to ensure a single supplier record is retrieved for each transaction.
SELECT
supplier.supplier_code,
supplier.supplier_state
FROM supplier
INNER JOIN delivery
ON supplier.supplier_key = delivery.supplier_key
AND delivery.delivery_date >= supplier.start_date AND delivery.delivery_date < supplier.end_date;
A fact record with an effective date (Delivery_Date) of August 9, 2001 will be linked to Supplier_Code of ABC, with a Supplier_State of 'CA'. A fact record with an effective date of October 11, 2007 will also be linked to the same Supplier_Code ABC, but with a Supplier_State of 'IL'.
While more complex, there are a number of advantages of this approach, including:
- Referential integrity by DBMS is now possible, but one cannot use Supplier_Code as foreign key on Product table and using Supplier_Key as foreign key each product is tied on specific time slice.
- If there is more than one date on the fact (e.g. Order_Date, Delivery_Date, Invoice_Payment_Date) one can choose which date to use for a query.
- You can do "as at now", "as at transaction time" or "as at a point in time" queries by changing the date filter logic.
- You don't need to reprocess the fact table if there is a change in the dimension table (e.g. adding additional fields retrospectively which change the time slices, or if one makes a mistake in the dates on the dimension table one can correct them easily).
- You can introduce bi-temporal dates in the dimension table.
- You can join the fact to the multiple versions of the dimension table to allow reporting of the same information with different effective dates, in the same query.
The following example shows how a specific date such as '2012-01-01T00:00:00' (which could be the current datetime) can be used.
SELECT
supplier.supplier_code,
supplier.supplier_state
FROM supplier
INNER JOIN delivery
ON supplier.supplier_key = delivery.supplier_key
AND supplier.start_date <= '2012-01-01T00:00:00' AND supplier.end_date > '2012-01-01T00:00:00';
Type 7: Hybrid - Both surrogate and natural key
[edit]Source:[3]
An alternative implementation is to place both the surrogate key and the natural key into the fact table.[4] This allows the user to select the appropriate dimension records based on:
- the primary effective date on the fact record (above),
- the most recent or current information,
- any other date associated with the fact record.
This method allows more flexible links to the dimension, even if one has used the Type 2 approach instead of Type 6.
Here is the Supplier table as we might have created it using Type 2 methodology:
| Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Start_Date | End_Date | Current_Flag |
|---|---|---|---|---|---|---|
| 123 | ABC | Acme Supply Co | CA | 2000-01-01T00:00:00 | 2004-12-22T00:00:00 | N |
| 124 | ABC | Acme Supply Co | IL | 2004-12-22T00:00:00 | 2008-02-04T00:00:00 | N |
| 125 | ABC | Acme Supply Co | NY | 2008-02-04T00:00:00 | 9999-12-31T23:59:59 | Y |
To get current records:
SELECT
delivery.delivery_cost,
supplier.supplier_name,
supplier.supplier_state
FROM delivery
INNER JOIN supplier
ON delivery.supplier_code = supplier.supplier_code
WHERE supplier.current_flag = 'Y';
To get history records:
SELECT
delivery.delivery_cost,
supplier.supplier_name,
supplier.supplier_state
FROM delivery
INNER JOIN supplier
ON delivery.supplier_key = supplier.supplier_key;
To get history records based on a specific date (if more than one date exists in the fact table):
SELECT
delivery.delivery_cost,
supplier.supplier_name,
supplier.supplier_state
FROM delivery
INNER JOIN supplier
ON delivery.supplier_code = supplier.supplier_code
AND delivery.delivery_date BETWEEN supplier.Start_Date AND supplier.End_Date
Some cautions:
- Referential integrity by DBMS is not possible since there is not a unique key to create the relationship.
- If relationship is made with surrogate to solve problem above then one ends with entity tied to a specific time slice.
- If the join query is not written correctly, it may return duplicate rows and/or give incorrect answers.
- The date comparison might not perform well.
- Some business intelligence tools do not handle generating complex joins well.
- The ETL processes needed to create the dimension table needs to be carefully designed to ensure that there are no overlaps in the time periods for each distinct item of reference data.
Combining types
[edit]
Different SCD Types can be applied to different columns of a table. For example, we can apply Type 1 to the Supplier_Name column and Type 2 to the Supplier_State column of the same table.
See also
[edit]References
[edit]- ^ a b c d e f g h Kimball, Ralph; Ross, Margy. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling.
- ^ a b "Design Tip #152 Slowly Changing Dimension Types 0, 4, 5, 6 and 7". 5 February 2013.
- ^ Kimball, Ralph; Ross, Margy (July 1, 2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition. John Wiley & Sons, Inc. p. 122. ISBN 978-1-118-53080-1.
- ^ Ross, Margy; Kimball, Ralph (March 1, 2005). "Slowly Changing Dimensions Are Not Always as Easy as 1, 2, 3". Intelligent Enterprise.
Bibliography
[edit]- Bruce Ottmann, Chris Angus: Data processing system, US Patent Office, Patent Number 7,003,504. February 21, 2006
- Ralph Kimball:Kimball University: Handling Arbitrary Restatements of History [1]. December 9, 2007
Slowly changing dimension
View on GrokipediaIntroduction
Definition and Purpose
A slowly changing dimension (SCD) is a data warehousing technique designed to manage changes in dimension attributes that occur infrequently and unpredictably over time, in contrast to the more frequent updates typical of fact tables that capture transactional metrics.[6] These dimensions represent descriptive entities, such as customers or products, whose attributes evolve gradually rather than rapidly, requiring special handling to preserve data integrity across analytical queries.[7] The primary purpose of SCD is to enable accurate historical analysis and business intelligence by maintaining multiple versions of dimension data, thereby avoiding distortions in past reports when attributes change.[6] This approach supports trend analysis, compliance reporting, and decision-making by ensuring that metrics from prior periods reflect the dimension states valid at that time, without retroactive alterations that could skew insights.[4] In dimension tables of a data warehouse's star schema, SCD techniques thus balance current accuracy with historical fidelity.[7] For instance, consider a sales database where a customer's address changes from New York to Los Angeles; without SCD, updating the address could incorrectly attribute past sales to the new location, inflating regional revenue figures retroactively and misleading trend analysis.[4] By preserving the original address alongside the update, SCD prevents such errors, allowing queries to correctly associate historical transactions with the appropriate geographic context.[6] Common scenarios for SCD include employee roles, where promotions or transfers occur sporadically but affect compensation or performance reporting over time, and product categories, which may shift due to reclassifications impacting inventory or sales categorization without frequent revisions.[8][9] These cases highlight how SCD addresses infrequent yet impactful changes to support reliable longitudinal analysis in business intelligence systems.[6]Historical Development
The concept of slowly changing dimensions (SCDs) emerged in the 1990s as a key element of dimensional modeling in data warehousing, pioneered by Ralph Kimball to address how dimension attributes evolve over time without frequent updates.[2] Kimball formalized the initial approaches in his 1996 book The Data Warehouse Toolkit, introducing Type 1 (overwrite), Type 2 (versioning), and Type 3 (limited history) methods as practical solutions for maintaining historical accuracy in star schema designs.[1] This framework quickly became foundational for business intelligence applications, emphasizing the need to balance current and historical data integrity.[10] In the early 2000s, SCD techniques gained widespread adoption through the integration into extract, transform, and load (ETL) tools, facilitating automated implementation in enterprise environments. Informatica PowerCenter, a leading ETL platform since the late 1990s, supports SCD through mapping templates.[11] Similarly, Microsoft's SQL Server Integration Services (SSIS), introduced in 2005, featured a dedicated Slowly Changing Dimension transformation component, streamlining Type 1 and Type 2 processing for data warehouse pipelines.[12] These tools accelerated SCD deployment across industries, reducing manual coding for historical data management.[13] The evolution of SCD concepts continued into the 2010s with expansions beyond the original types, including Type 0 (fixed), Type 4 (mini-dimensions), and hybrids like Types 5, 6, and 7, as refined by Kimball and industry practitioners to handle more complex scenarios such as frequently changing attributes.[2] A significant milestone occurred in 2011 with the SQL:2011 standard (ISO/IEC 9075-2), which introduced temporal table specifications supporting period-based validity and system-versioning, directly influencing SCD patterns by providing native database mechanisms for bitemporal data tracking.[14] By 2025, SCD practices have integrated seamlessly with cloud data warehouses, leveraging features in platforms like Snowflake and BigQuery—such as MERGE operations and automated change data capture—to scale historical versioning for big data volumes.[15]Core Concepts
Dimension Tables in Data Warehousing
Dimension tables serve as the foundational descriptive components in dimensional modeling, particularly within star and snowflake schemas of data warehouses. These tables store non-numeric attributes that provide context to the quantitative measures held in fact tables, such as customer demographics, product categories, or geographic locations.[16][17] In a star schema, dimension tables radiate from a central fact table, forming a structure that simplifies querying by denormalizing data to minimize joins while preserving business meaning.[18] In online analytical processing (OLAP) environments, dimension tables enable key operations like slicing, dicing, and drill-down, allowing analysts to explore data from multiple perspectives without complex restructuring. Slicing reduces the dataset to a single dimension value, such as sales for a specific region; dicing selects sub-cubes across multiple dimensions, like quarterly sales by product line in certain geographies; and drill-down navigates hierarchies within a dimension, such as from yearly to monthly sales data.[19][20] This dimensionality supports efficient aggregation and analysis against fact tables, which contain the core metrics like sales amounts or quantities.[21] Key components of dimension tables include natural keys, which are business identifiers like customer IDs or product codes that uniquely identify entities in source systems; surrogate keys, which are system-generated integers (often starting from 1) that ensure referential integrity and handle versioning without relying on potentially unstable natural keys; and hierarchies, such as time (year-quarter-month) or geography (country-region-city), that organize attributes for navigational queries.[22][7] For example, a customer dimension table might include columns for surrogate key, natural customer ID, name, address, status, and join date, linked via foreign keys to a sales fact table that records transaction details like amount and quantity.[23][17] Unlike fact tables, which capture volatile, additive measures that change frequently with each business event, dimension tables require support for versioning to maintain historical accuracy in reporting, as their attributes evolve slowly over time but impact the interpretation of past facts.[4][7] This prerequisite arises because preserving dimension history ensures consistent analysis of trends, such as how a customer's segment changes affect prior sales evaluations, without altering immutable fact records.[4]Challenges of Slowly Changing Data
Slowly changing data refers to dimension attributes in a data warehouse that update infrequently, such as customer addresses or product categories that might change annually or less often, yet require precise historical tracking to support accurate analytics and reporting.[6] These attributes contrast with rapidly changing or transactional data, as their sporadic evolution demands methods to preserve past states without disrupting ongoing queries. Failure to handle such changes appropriately can undermine the integrity of dimensional models, where dimensions serve as stable descriptors for facts.[24] A primary challenge arises from the potential loss of historical context when dimension attributes are simply overwritten during updates, leading to distorted retrospective analyses; for instance, sales reports tied to a customer's former location would incorrectly reflect current data, misrepresenting past trends.[6] This issue is compounded by performance overhead in maintaining multiple versions of dimension rows, which can inflate table sizes and slow query execution, particularly in environments with precomputed aggregates that require frequent recomputation.[6] Additionally, key management becomes problematic, as duplicate natural keys may emerge from versioning attempts, complicating joins between fact and dimension tables and risking data inconsistencies across distributed systems.[25] From a business perspective, these challenges result in inaccurate trend analysis, where historical queries yield unreliable insights, potentially misleading strategic decisions in areas like marketing or inventory planning. Regulatory non-compliance is another critical impact, especially in financial or audit-heavy sectors, where overwriting historical data violates requirements for verifiable audit trails and data provenance.[6] Extract, transform, and load (ETL) processes also face heightened complexity, as detecting subtle, infrequent changes demands robust mechanisms to avoid missing updates or introducing errors. Detection of slowly changing data typically involves comparing source data against the target dimension table using techniques like hashing entire rows for quick change identification or leveraging timestamps to flag modifications since the last load.[9] These methods enable efficient ETL pipelines but require careful tuning to handle large volumes without excessive computational load. Balancing these elements involves inherent trade-offs: preserving granular historical data enhances accuracy but escalates storage costs and query complexity, while simplifying updates improves performance at the expense of analytical depth.[26] In practice, organizations must weigh these factors against specific use cases, such as compliance needs versus operational speed, to maintain a viable data warehouse architecture.SCD Implementation Types
Type 0: Fixed Dimensions
Type 0 slowly changing dimensions, also known as fixed or immutable dimensions, treat dimension attributes as unchanging after their initial population, retaining the original values indefinitely without any updates or versioning.[2][27] In this approach, facts are always associated with the original attribute values, ensuring consistency for static data elements in the data warehouse.[2] This type is particularly suited for static reference data where changes are either nonexistent or extremely rare, such as currency codes, country lists, fixed tax rates, customer birth dates, product launch dates, or durable identifiers like original credit scores.[2][4] Date dimensions, which include attributes like day of the week or month names, are commonly implemented as Type 0 since they remain constant over time.[2] If modifications are needed, they are typically managed through a full reload of the dimension rather than incremental updates.[3] Implementation of Type 0 dimensions is straightforward, involving no change detection or versioning mechanisms; natural keys are used directly to link facts to the dimension, and ETL processes simply insert new rows for any additions while ignoring potential updates to existing ones.[3] For instance, a dimension table for measurement units—such as kilograms (kg) or pounds (lb)—would load initial values once and never alter them, providing a stable lookup for fact records involving weights.[4] The primary advantages include minimal storage requirements due to the absence of historical rows and optimal query performance from simplified joins without surrogate keys or version checks.[4] However, this rigidity poses disadvantages, as any required changes demand a complete rebuild of the dimension and potentially affected fact tables, making it unsuitable for data prone to even infrequent modifications.[3][27]Type 1: Overwrite Updates
Type 1 slowly changing dimensions involve updating existing rows in the dimension table by overwriting changed attributes with the most current values, thereby retaining only the latest state of the data without preserving any historical versions.[28][29] This approach treats the dimension as a snapshot of current reality, discarding prior attribute values to reflect ongoing corrections or minor updates.[25] This method is particularly suited for non-critical attributes where historical tracking is unnecessary, such as minor corrections to product names (e.g., fixing spelling errors) or supplementary customer details like email addresses or phone numbers that do not impact analytical integrity over time.[29][25] It is commonly applied in scenarios prioritizing the current data profile for reporting, such as real-time dashboards or predictive models that rely solely on up-to-date information.[29] Implementation typically occurs within an extract, transform, and load (ETL) process, where changes are detected through full or incremental data loads from source systems.[29] The ETL logic compares incoming records against existing dimension rows using natural keys to identify matches, then performs upsert operations—updating matching rows with new values or inserting entirely new rows for unmatched records—without adding version indicators or timestamps.[29] This straightforward update mechanism ensures the dimension table remains aligned with the latest source data.[28] The primary advantages of Type 1 include its simplicity, as it requires minimal schema changes or additional storage, making it space-efficient and quick to process for current-state queries.[28][29] It also facilitates fast performance in reporting environments focused on the present, avoiding the overhead of managing historical rows.[28] However, a key disadvantage is the permanent loss of historical data, which can distort retrospective analytics by retroactively applying current values to past facts.[28][25] For instance, if a customer's address is overwritten, prior sales reports linked to the old address would inaccurately reflect the new location, potentially skewing geographic trend analysis.[28] This necessitates recomputation of affected aggregates or OLAP structures whenever updates occur, adding maintenance complexity.[28] A practical example is managing a customer's contact information in a retail dimension table: if the phone number changes from (555) 123-4567 to (555) 987-6543, the ETL process overwrites the existing entry, ensuring all subsequent queries use the updated number while rendering the previous one irretrievable.[29][25]Type 2: New Row Versioning
Type 2 slowly changing dimensions (SCD Type 2) maintain a complete historical record of changes by inserting a new row into the dimension table each time an attribute value updates, while preserving all previous versions of the data. This approach uses a surrogate key as the primary identifier to distinguish between versions, rather than relying solely on the natural business key, which remains unchanged across rows.[30] To track the timeline of changes, SCD Type 2 implementations typically include additional attributes such as an effective date (marking when the version becomes active), an expiration date (indicating when it is superseded, often set to a distant future date like 9999-12-31 for current records), and a current row indicator (a flag, such as 'Y' or 'N', to identify the active version). These attributes enable precise point-in-time queries by filtering on dates or flags.[30][17] SCD Type 2 is particularly suited for use cases requiring accurate historical analysis, such as tracking employee department assignments over time or shifts in customer segmentation that impact reporting. For instance, in sales analytics, it ensures that past transactions reflect the customer's status at the time of the sale, avoiding distortions from later changes.[30][31] Implementation involves extract, transform, and load (ETL) processes that detect changes in source data by comparing against existing dimension rows using the natural key. Upon detecting an update, the ETL generates a new surrogate key, inserts a row with the updated attributes and sets the effective date to the current timestamp while updating the prior row's expiration date to match. In the Kimball-recommended approach, during the ETL process for loading fact tables, the dimension version active at the time of the fact event is determined, and its surrogate key is loaded into the fact table as the foreign key. This enables simple equality joins on the surrogate key alone for historical accuracy, without needing date-range filters in queries. An alternative approach uses joins on the natural key combined with date-range checks on effective/expiration dates (e.g., fact_date >= start_date AND fact_date < end_date), but the surrogate key method is preferred for performance and simplicity.[30][17] The primary advantages of SCD Type 2 include providing a full audit trail for compliance and enabling accurate historical and trend reporting without data loss. It supports complex analyses, such as year-over-year comparisons that account for attribute evolution.[30][31] However, disadvantages encompass significant table growth due to row proliferation, which can strain storage and performance, as well as increased query complexity from mandatory date-based filtering to retrieve current or historical views. Managing numerous versions also heightens maintenance overhead in ETL pipelines.[30][31] As an example, consider a product dimension where a item's price changes from $10 to $15 on January 1, 2024. The original row retains its surrogate key, effective date (e.g., 2023-01-01), and updated expiration date (2024-01-01), while a new row is added with a fresh surrogate key, the updated price, effective date (2024-01-01), and expiration date (9999-12-31). This allows sales reports from 2023 to use the $10 price accurately.[30][17]Type 3: Additional Attribute Columns
Type 3 slowly changing dimensions (SCD) extend the dimension table by adding new columns to capture a limited historical state, typically storing both the current value and one previous value of a changing attribute alongside the natural key. This method preserves the prior attribute value in a dedicated column while updating the primary attribute with the new value, enabling limited tracking without creating new rows.[32][31] This approach is suitable for scenarios where only recent history is required, such as tracking dual statuses in human resources systems, like an employee's current and former manager, or monitoring basic migrations in customer data without needing extensive versioning.[32][7] Implementation involves detecting changes in the source data using the natural key for row identification; upon detection, the existing current value is shifted to the previous-value column, and the new value populates the current column, avoiding the need for surrogate keys or timestamps in this limited context.[31][33] Advantages include compact storage that avoids row proliferation, simpler query logic for comparing current and immediate prior states, and reduced complexity compared to full versioning methods.[32][7] Disadvantages encompass its restriction to tracking only one level of history, necessitating schema modifications to accommodate additional attributes, and its unsuitability for dimensions requiring deeper historical analysis.[31][33] For example, in a customer dimension table, columns such ascustomer_id (natural key), current_address, and previous_address allow tracking a single address change: if a customer moves, the old address shifts to previous_address while the new one updates current_address, facilitating queries on recent relocation patterns without expanding the table row count.[33][32]
| Customer ID | Current Address | Previous Address |
|---|---|---|
| 123 | 456 New St | 789 Old Ave |
| 456 | 101 Main St | (null) |
Type 4: Mini-Dimension History
Type 4 slowly changing dimensions address rapidly changing attributes by creating a separate mini-dimension table to store combinations of those volatile attributes, while the main dimension holds stable attributes.[2] This keeps the primary dimension compact and the mini-dimension manageable in size, with surrogate keys linking both to the fact table for comprehensive point-in-time reporting.[4] The approach is suited for dimensions with a subset of attributes that change frequently, such as customer demographics or product specifications in retail analytics.[34] In implementation, the main dimension uses a surrogate key and includes foreign keys to the mini-dimension surrogate key. When volatile attributes change, a new row is added to the mini-dimension with the updated values and an effective date, while the main dimension row is updated to reference the new mini-dimension surrogate key (Type 1 style). The fact table captures both surrogate keys at the time of the transaction, enabling historical reconstruction via joins. ETL processes manage the mini-dimension as a Type 2 SCD for versioning, ensuring low cardinality in the mini-table.[2][34] Advantages include controlled growth in the main dimension, improved query performance by isolating changes, and efficient handling of high-velocity attributes without bloating the core schema. It supports detailed auditing through the mini-dimension's history. However, it introduces additional join complexity in queries and requires careful design to partition attributes appropriately, with potential ETL overhead for maintaining links. Data consistency risks arise if mini-dimension updates lag.[4][34] For example, in a customer dimension, stable attributes like name and ID stay in the main table, while changing demographics (age bracket, income range) go to a mini-dimension. A fact table row for a purchase references main surrogate 1001 and mini surrogate 2001 (low income, 30-40 age). If demographics change, a new mini row 2002 is added, main references updated to 2002, but historical facts retain 2001 for accurate past analysis.[34][4]Type 5: Hybrid Dimension Mapping
Type 5 slowly changing dimensions extend Type 4 by incorporating the current mini-dimension surrogate key as a Type 1 attribute directly in the base (main) dimension table, allowing quick access to the current profile without always joining to the mini-dimension.[2] This hybrid enables both historical accuracy via the mini-dimension and current-state efficiency in the base dimension. It is applicable in environments needing frequent current lookups alongside occasional historical reconstruction, such as customer profiling in marketing analytics.[5][4] Implementation maintains the mini-dimension as in Type 4 for full history, but the base dimension's mini-key attribute is overwritten (Type 1) with the latest surrogate key upon changes. Fact tables still reference both main and mini surrogates for history, but current queries can use the base dimension's embedded mini-key to fetch the active profile directly. ETL updates involve inserting new mini rows for changes and upserting the base dimension's mini-key. In modern cloud warehouses like Snowflake (as of 2025), change data capture (CDC) streamlines these updates.[2][5] Advantages include optimized performance for current reporting by avoiding joins, while preserving Type 4's historical depth, and flexibility in query patterns. It reduces latency in BI tools for real-time dashboards. Disadvantages involve added ETL complexity for dual maintenance and potential inconsistency if base updates fail post-mini insert. Storage is moderate, but schema design requires foresight on attribute volatility.[2][35] A practical example is an employee dimension where stable role data is in the base, volatile skills in a mini-dimension. The base embeds the current skills mini-key (e.g., 3001). When skills update, new mini row 3002 is created, base mini-key overwritten to 3002; historical facts use original mini-keys for tenure-based skill analysis.[4]Type 6: Dual-Row with Attributes
Type 6 slowly changing dimensions build on Type 2 by adding Type 1 attributes for the current values of changing attributes, which are overwritten across all historical rows sharing the natural key, combining versioning with current-state accessibility.[36] This allows facts to access both as-was (historical columns) and as-is (current columns) without complex filtering. It suits compliance-heavy scenarios like financial services, where both audit trails and up-to-date reporting are required.[2][37] Implementation extends the Type 2 schema with current-value columns for key attributes (e.g., current_address alongside historical_address). Upon change detection in ETL, a new Type 2 row is inserted with the updated historical value and effective/expiration dates, and then the current-value columns are updated (Type 1 overwrite) in all rows for that natural key. This ensures every version reflects the latest current state. In cloud environments like BigQuery (as of 2025), materialized views can optimize these updates for performance. Fact tables join on surrogate keys, with optional current column access.[36][3] Advantages offer hybrid reporting flexibility—historical via versioned rows, current via overwritten columns—supporting trend analysis and compliance without extra joins. It balances storage between Type 2 growth and Type 1 efficiency. Disadvantages include high ETL overhead from batch updates to historical rows, risking performance in large dimensions, and schema bloat from dual columns. Not ideal for very frequent changes due to rewrite costs.[36][38] A representative example is an employee dimension with department changes. Initial row: surrogate 1001, historical_department "Sales", current_department "Sales", effective 2024-01-01, is_current true. On promotion to "Marketing" (2025-01-01), insert new row surrogate 1002, historical_department "Marketing", effective 2025-01-01, is_current true; then update current_department to "Marketing" in both rows 1001 and 1002. Queries for current role use current_department across versions, while historical uses effective date filters on historical_department.[36][3]| Aspect | Original Row (Pre-Change) | New Row (Post-Change) | Updated Rows (Current Columns) |
|---|---|---|---|
| Surrogate Key | 1001 | 1002 | 1001 & 1002 (updated) |
| Historical Department | Sales | Marketing | Sales (1001), Marketing (1002) |
| Current Department | Sales | Marketing | Marketing (both) |
| Effective Date | 2024-01-01 | 2025-01-01 | Unchanged |
| Is Current | true | true (after insert) | true (1002), false (1001) |
Type 7: Surrogate-Natural Key Hybrid
Type 7 slowly changing dimensions support both Type 1 current and Type 2 historical reporting using a single dimension table with surrogate keys for versions and a durable natural key linking all versions of an entity.[39] Fact tables include both the surrogate key (for history) and natural key (for current), enabling dual perspectives via views. This is ideal for systems balancing operational current queries with analytical history, such as inventory tracking with product SKUs as natural keys.[21] Implementation features the dimension with surrogate primary key, natural key, effective/expiration dates, and current flag. Changes add new rows with updated surrogate, adjusting prior row's dates/flag, while natural key persists. Fact tables store both keys at transaction time. Current views join fact.natural_key = dim.natural_key AND dim.current = true; historical views join fact.surrogate = dim.surrogate. ETL handles versioning as Type 2, with natural key integrity. As of 2025, this integrates well with CDC in platforms like Azure Synapse for late-arriving facts.[39][21] Advantages provide seamless as-is/as-was access without overwriting, enhancing usability and performance via direct natural key joins for current data, while surrogates ensure history. It simplifies late-arriving fact integration. Disadvantages include fact table bloat from dual keys, increased ETL for key management, and query/view maintenance to avoid ambiguity. Referential constraints need care on natural keys.[39][21] For example, customer natural key "CUST123" starts with surrogate 1001, address "123 Main St.", effective 2025-01-01, current true. On change to "456 Oak Ave." (2025-07-01), add surrogate 1002, same natural key, new address, effective 2025-07-01, current true; update row 1001 expiration to 2025-06-30, current false. Facts store both keys: early 2025 facts have surrogate 1001, natural "CUST123"; current queries join on natural + current flag for latest address.[21][39]Advanced Techniques
Fact Table Integration Strategies
In data warehousing, fact tables integrate with slowly changing dimensions (SCDs) by referencing surrogate keys from the dimension tables, ensuring that each fact record links to the appropriate historical or current version of the dimension at the time the fact was recorded. This approach maintains data integrity for analytical queries, where changes in dimension attributes are handled through mechanisms like effective dating to align facts with the correct dimension state without altering historical attributions.[40] Every join between fact and dimension tables uses surrogate keys rather than natural keys, as surrogate keys provide stability, performance, and insulation from operational system changes.[41] For SCD Type 2, which preserves full history via new rows for each change, integration relies on the fact table storing the surrogate key of the specific dimension version active during the fact event. During ETL, the process determines the correct surrogate key by matching the fact's event timestamp to the dimension row's effective dates (or current flag for the active version at that time), then loads this surrogate as the foreign key in the fact table. Queries then join simply on this surrogate key, delivering the dimension attributes as they existed at the time of the fact without requiring additional date-range filters in the query itself. This method supports historical accuracy and improves query performance by enabling efficient equality joins on integer keys rather than complex range conditions.[21][42] An alternative approach, common in some zipper table implementations, stores the natural/business key in the fact table and joins on the natural key while ensuring the fact event timestamp falls within the dimension row's effective start and end dates (e.g.,fact_date >= start_date AND fact_date < end_date, or using BETWEEN). Current records are handled with end_date as NULL or a far-future date. While functional, this method can introduce performance overhead from range scans on large tables and more complex query logic.[43]
SCD Type 6, a hybrid combining Type 1 (overwrite for current values), Type 2 (versioning for history), and Type 3 (previous attribute columns), integrates similarly to Type 2 but enhances query efficiency by storing both current and prior attribute values in the same row.[2] Fact tables reference the surrogate key, and joins leverage the Type 3 columns for direct access to previous states without additional subqueries, reducing complexity in scenarios requiring both current and historical attributions.[3] This structure allows for enriched reporting, such as comparing past and present dimension attributes in a single join operation.[2]
A key challenge in SCD-fact integration is handling late-arriving facts, where fact data for past events arrives after the dimensional context has changed since the event date, potentially leading to linkage with an incorrect surrogate key if using the current version. The recommended approach is to search the dimension tables using the fact's natural keys and event date to retrieve the surrogate keys that were effective at the time of the event.[44] Another approach involves assigning the current surrogate key to late facts for approximate matching, though this may introduce minor inaccuracies in historical reporting.
Best practices for associating SCD Type 2 (zipper) tables with fact tables follow Kimball dimensional modeling principles. The preferred method uses surrogate keys in the dimension table; during ETL, load the correct surrogate key (matching the fact event time) into the fact table as a foreign key. This enables simple equality joins on the surrogate key for historical accuracy without date-range complexity in queries, improving performance and simplicity. The alternative date-range approach (joining on natural key with timestamp within start/end dates) is viable but less preferred due to potential performance issues with large joins. Encapsulate any complex join logic in views to simplify user queries. Educate users on distinguishing historical versus current views (using current flags or date filters where appropriate). Ensure ETL processes correctly assign keys or dates for late-arriving facts to maintain accuracy. Indexing on surrogate keys, effective dates, and natural keys in both fact and dimension tables optimizes join performance.[21][41]
Additionally, considerations for slowly changing facts—such as retroactive adjustments to metrics like sales totals—often involve additive correction records in the fact table rather than direct updates, preserving auditability while aligning with dimension changes.[45]
For example, in a sales fact table tracking revenue by customer, integrating with a Type 2 customer dimension involves storing the customer's surrogate key at the sale date; a query for regional revenue over time joins on this key to accurately attribute sales to the customer's region at the time of sale, avoiding distortions from subsequent address changes.[25]
