What is the ‘Data Vault’?
The Data Vault is a database modeling technique that was devised by Dan Linstedt in the 1990s, and released for widespread use in 2000, to provide a cheaper and faster way to meet the data needs of ever-growing and changing businesses. In his own words, Dan Linstedt describes the Data Vault in his book, “Building a Scalable Data Warehouse with Data Vault 2.0,” as:
“…a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.” [a]
To understand how this model provides such benefits, we have to unpack the Data Vault itself and understand its guiding principles and the entities that make up its architecture. This model is built upon the concept of “all the data, all of the time,” [b] meaning that all data loads regardless of data quality, and no data deletes once it’s been loaded. Designed to mimic naturally occurring networks, the Data Vault is composed of three primary entity types: Hubs, Links and Satellites. Although there can be variations of each entity type, their primary functions do not change, and they are as follows:
Hub: Separates business keys from the rest of the model
Link: Stores relationships between business keys (and/or hubs)
Satellites: Store the context/attributes of a business key or relationship [c]
Hubs are the main pillars around which Links and Satellites get built within the model. A visual representation of their relationship to one another, showing the integration of sales data from the fictional WideWorld Importers and AdventureWorks databases, can be seen in Figure 1 below.
In the example above, the Product, Invoice, Customer and SalesPerson hubs are all connected by the SaleLine Link. To understand what that connection means, and what each of these entities contains, let’s take a closer look at each entity type and the data stored within it.
Hubs can be built only after a thorough understanding of the business has been attained because their main function is to store business keys. Business keys are the keys that are supplied by users to identify, track, and locate information, such as a customer number, invoice number, or product number. In some instances, business keys may be composite keys, derived from a combination of objects such as a product number and a version code. Business keys should be unique and all keys within the same hub should be at the same level of granularity. Because hubs, and consequently business keys, are such a pivotal part of building a Data Vault, Dan Linstedt devotes considerable real estate in his book to the appropriate selection of business keys. For our purposes, it is enough to simply understand their role, but if you are interested in finding out more, I would highly recommend consulting his book that was referred to in the opening of this article.
In addition to business keys, Hubs also contain a few other standard columns as described below:
The HashKey for each table is generated from the Business Key and is used to reference the business object within other entities. Hash Keys improve performance and should not be exposed to business users. The date a record is loaded is tracked as the LoadDateTime and is used to maintain historical auditability and allow for quick detection of errors if an entire batch was loaded incorrectly. The Record Source is entered as a hard-coded value and allows for auditability to track the lineage of data. The Record Source should use the lowest level of granularity possible for optimal traceability back to the source.Table 1 Hub Requirements
Additionally, an optional LastSeenDate column can be used with systems that do not use Change Data Capture (CDC) in order to determine when a record can be considered “Deleted.” This value will be updated each time a business key is seen on a data load, and it will be up to the business to determine their threshold for a “Deleted” record. (Note: I’ve put “Deleted” in quotes because these records are not actually deleted from the system – instead they are considered to be obsolete by referencing the last seen date).
Links are used to show the relationships between business keys (hubs), and their level of granularity is determined by the number of hubs they connect. The goal of a link is not to represent the timeline or active status of a relationship, as that relates more to context, and for that reason, links should not contain begin or end dates. Links provide the scalability and flexibility to the Data Vault model because they can easily be added or modified to adjust to changing business needs. The time it takes for IT to respond to business changes is drastically reduced in comparison with how they would respond to changes in other models.
The columns contained in a standard Link are as follows:
Within a Link table, the HashKey is generated from a concatenation of the Business Keys in the hubs it is linking together. It is optional to include a Business Key value for the Link table itself, which would simply be the concatenated values used to generate the HashKey. For example, if a Product Hub used ProductID for its Business Key, and an Invoice Hub used InvoiceID, the Business Key for the Link would be the concatenation of ProductID+InvoiceID, and the Link HashKey would be generated from that. Including a Link BusinessKey column is not a practice that is standard within the approach outlined by Dan Linstedt, but it is something I have found helpful from a user perspective when building out a Data Vault.
In addition to the link HashKey and BusinessKey columns, the standard Link table should also include the load date and time of the link, the specific record source from which the relationship originated, and a column for the HashKey of each Hub that is being connected. The Hub HashKey values that populate these columns should be derived using the same logic that is used in the Hubs themselves, and should not be populated by performing a lookup on those Hub tables. By deriving these HashKey values, a faster loading speed can be maintained.
Here, it is important to note that there can be variations on the standard link to represent different types of relationships. An example of this would be a “Same-As-Link,” which is connected to only one Hub and is used to identify when the same business objects are identified by more than one business key. These different types of links provide more flexibility to the Data Vault model but are outside the scope of this article, and can be reviewed in depth in Dan Linstedt’s book.
The last major entity to review is the Satellite, which holds all of the context within the Data Vault and is used to describe the business objects identified in Hubs or Links. Because attributes of a business object can change, Satellites are used to track these changes over time. Each Satellite is connected to only one Hub or Link, but each Hub or Link can have multiple Satellites. These entities can vary significantly in their set up because different Satellites are used to describe different types of data. For example, a Product Hub may track product ID’s from two disparate source systems, and would therefore have two different Satellites based on the attribute data available in those systems.
The columns contained in a standard Satellite are as follows:
Within a Satellite, the HashKey value is derived from its parent Hub or Link. This value, combined with the LoadDatetime, serves as the Primary Key. In addition to the LoadDatetime, an EndDatetime is also captured to indicate when the entry becomes invalid. This is the only value within the table that should ever be updated. The last required metadata field within the Satellite is the RecordSource, to indicate where the data came from.
Optional fields within the Satellite entity include the HashDiff, which is a hashed value of all the descriptive data of an entry that is used for quick comparisons, and the ExtractDate, which is the date and time the data was actually extracted from the source system. This date should never be used in place of the LoadDatetime, as it can be unreliable due to time zone differences or other issues with the source system.
Outside of the required and optional metadata fields, the Satellite will also contain fields to hold the actual attribute data. For example, a Satellite attached to a Customer Hub might contain fields such as name, address, city, state, and industry. The number of attribute fields will vary for each Satellite, but each satellite must contain at least one. Similar to Links, there are variations on the standard Satellite that provide additional flexibility to the Data Vault model, but they will not be covered in depth here.
Benefits of the Data Vault
Although not as popular as the commonly used star schema model, the advantages of using the Data Vault cannot be denied. Within the classic Data Warehouse design process, the Data Vault is intended to fit between the staging area and data marts. Chief among its many advantages are auditability, scalability, and flexibility, as described below.
The Data Vault is built on the core principle that no data is ever deleted. This means the Data Vault can be used as a single and historic source of truth. Not only is this attribute important for accurate reporting, it’s also crucial for compliance with many modern data regulations.
Because new entities can be added with ease, the ability to scale up or down is considerably faster with a Data Vault model. Adding a new source system can be as simple as adding additional Satellites to existing Hubs or Links, and inserting records into those Hubs and Links for the newly integrated Business Objects. Although this still requires a considerable amount of technical understanding and expertise to accomplish, the overall time and effort are less than with other models.
Because each entity serves a different role within the Data Vault, the model itself is very flexible. The ability to look at data on a different level of granularity can be done by creating a new link with greater or fewer Hub connections. Disparate data sources can be brought together without a complex mapping of data by taking advantage of different Satellites per source system.
Each business is unique, and as a result, so are the data needs of that business. What works for one company may not be the right fit for another, but one thing that is constant is the need to be able to access and analyze data to experience continued growth and success. In a time focused on the ability to adapt quickly and meet business needs in an Agile way, the benefit of using a Data Vault cannot be overlooked.
[b] Linstedt, Daniel, and Michael Olschimke. Building a Scalable Data Warehouse with Data Vault 2.0. Morgan Kaufmann, 2016.
[c] Linstedt, Daniel, and Michael Olschimke. Building a Scalable Data Warehouse with Data Vault 2.0. Morgan Kaufmann, 2016.