Data Vault Pipeline Description (DVPD) is a concept and syntax to provide a universal data format, for storing all essential informations, that are needed to implement or generate a data loading process for a data vault model.
When established as standardized interface in the implementation workflow, DVPD decouples the implementation tools. Iterative extention and optimization of the toolset will have less impact on other steps. This will allow projects to select tools, that are taylored to the current needs of the project without blocking the option for later adjustments and enhancements.
Finally, as beeing a document, a DVPD can be treated as an encapsulated deployable artifact and therefore fits nicely into CI/CD workflows.
Motivation
Keeping the Data Vault implementation toolset agile
The Implementation of loading processes for a data warehouse, is a complex task even when taking into account that Data Vault provides a lot of standardization and generalization. Many tools and frameworks try to support the Data Vault modelling and implementation process but in the end none can cover all steps.
That’s why data warehouse platforms often contain a bundle of tools with a mix of commercial or open source products and self written code. One major function needed in the implementation workflow, is the collection of the metadata, that is forged during the analysis and modelling steps and providing it to the implemtation steps. Unfortunatly, even after many projects an products have developed data structures to solve this problem in their own context, there is no solution, that can be used as a standardized and tool independent format.
This is where the DVPD concept steps in.
Scope
All loading phases
DVPD contains information for all phases, that are necessary to load data from a data source. This includes declarations how to retrieve the data from the source, how to parse and transform the data into a tabularized representation and how to map it into a given data vault model. DVPD does not require an execution process, that solves all these steps in a single module. In contrary it enables the teams to combe multiple tools for different phases, all configured by a single arifact.
Full Data Vault support
DVDP will allow the declaration of all model structures and data mappings, that are possible in the Data Vault method. This also covers different approaches, how to implement row comparison, manage the enddating and detect actual data. The DVPD syntax might prevent declarations, that are not allowed/possible in the Data Vault method, but it is not the purpose of DVPD to fully enforce Data Vault compliance of models and mappings.
DVPD consuming tools (especially code generators and execution modules) might not support all Data Vault features and therefore all DVPD features. This is accepted, since projects might be competely satisfied with a reduced featureset. It is the responsibilty of these tools (or a checkscript the CI workflow) to warn about the usage of unsupported DVPD features.
A Simple Example
To give you an impression about the syntax elements, lets take simpe example. It describes the loading of data from a simple database table named „Person“.
Lets first focus on the pure source structure declaration that is driven by the structure of the data source.
There is one declaration for every field of the source.
So lets continute with declaration of the target data vault model.
As you might notice, there are no column definitions in the target model. Only the tables, their data vault stereotype and their relations are declared.
Finally we need to map the fields to the target tables. This is also declared in the list of fields we have seen in the first step.
From the mapping of the source fields to the target tables, the columns of the target tables are induced. There are options to declare different names for the columns, but since it is best practice to name columns like their source fields, column names default to the field names.
Applying DVPD interpretation rules
In many cases, the shown syntax of declaration is all, thats needed. All other informations for the loading process can be derived by following the interpretation rules:
- fields mapped to a hub are business keys in the hub
- fields mapped to a satellite are relevant for change detection
- the hub key column, needed for the hub will be named „HK_“ if not declared otherwise
- the key column name in the satellite will be the same as in the parent
Additionally there are general rule settings (called model profile), that are used by the whole project to keep consistency. In this example the following model profile settings are relevant:
- satellite comparison will use a diff hash
- satellites will be enddated
- satellites contains a deletion flag
- names for all the meta columns
This results in the following detailed data vault structure and mapping:
The complete DVPD of the example
There are some more declarations needed to describe the whole loading process. So here is the full declaratation that contains all meta information:
The main elements in DVPD document
- Definition how to get data from the source (mostly defined by declaring an appropriate code generator or execution module, valid names depend on the toolset of the platform)
- Communication protocol
- Incremental retrieval method and behaviour
- method how to detect deleted data
- Declaration of the Model profile that has to be used (if not the default)
- Target Datamodel
- Names and stereotypes of the data vault tables
- Relations between the tables
- Names of key and diff hash columns
- Definition how to parse the source data into rows and fields (highly depeding on the data format and capabilites of the parsing module)
- Mapping of fields to the data model
- Name of target column (default = field name)
- Flags to exclude columns from hashes or comparison
- identification of the relevant relation (when there are multiple possibilities)
- Properties to the control order of columns in hash concatenation
For the full set of elements and their meaning, please look into the syntax reference.
Content of the Model Profile
The model profile, defines general properties, that should be the same for all loading processes
- Names and types of metadata columns
- Hashing method and value representation for hub keys and diff hashes
- Default settings for sattellite tables (e.g. if they have a diff hash)
- Values for ghost records
For the full set of elements and their meaing, please take a look into the syntax reference.
How to start ?
Read detailled documentation in the repository
Download the git repository from: https://github.com/cimt-ag/data_vault_pipelinedescription
It’s free
The DVPD concept is licensed under CC BY-ND 4.0, and the contained code under Apache-2.0
What’s in the repo ?
- Concept Documentation
- Description of the concept
- Reference of the core syntax of DVPD
- Analysis about the use case variations to cover by the syntax
- Data Mapping variation taxonomy
- Data Mapping dependend process generation
- Partitioned deletion scenarios
- Reference implementation of a DVPD compiler in python
- Testsets for the DVPD compiler
- Examples for generator scripts in python
- DDL script generator
- „Developer cheat sheet“ generator
- HTML Dokumentation generator