Data Vault Pipeline Description (DVPD) is a concept and syntax to provide a universal data format, for storing all essential informations, that are needed to implement or generate a data loading process for a data vault model.

When established as standardized interface in the implementation workflow, DVPD decouples the implementation tools. Iterative extention and optimization of the toolset will have less impact on other steps. This will allow projects to select tools, that are taylored to the current needs of the project without blocking the option for later adjustments and enhancements.

Finally, as beeing a document, a DVPD can be treated as an encapsulated deployable artifact and therefore fits nicely into CI/CD workflows.

A Simple Example

To give you an impression about the syntax elements, lets take simpe example. It describes the loading of data from a simple database table named „Person“.

Lets first focus on the pure source structure declaration that is driven by the structure of the data source.

There is one declaration for every field of the source.

So lets continute with declaration of the target data vault model.

As you might notice, there are no column definitions in the target model. Only the tables, their data vault stereotype and their relations are declared.

Finally we need to map the fields to the target tables. This is also declared in the list of fields we have seen in the first step.

From the mapping of the source fields to the target tables, the columns of the target tables are induced. There are options to declare different names for the columns, but since it is best practice to name columns like their source fields, column names default to the field names.

Applying DVPD interpretation rules

In many cases, the shown syntax of declaration is all, thats needed. All other informations for the loading process can be derived by following the interpretation rules:

  • fields mapped to a hub are business keys in the hub
  • fields mapped to a satellite are relevant for change detection
  • the hub key column, needed for the hub will be named „HK_“ if not declared otherwise
  • the key column name in the satellite will be the same as in the parent

Additionally there are general rule settings (called model profile), that are used by the whole project to keep consistency. In this example the following model profile settings are relevant:

  • satellite comparison will use a diff hash
  • satellites will be enddated
  • satellites contains a deletion flag
  • names for all the meta columns

This results in the following detailed data vault structure and mapping:

The complete DVPD of the example

There are some more declarations needed to describe the whole loading process. So here is the full declaratation that contains all meta information:

The main elements in DVPD document

Content of the Model Profile

How to start ?

Read detailled documentation in the repository

Download the git repository from: https://github.com/cimt-ag/data_vault_pipelinedescription

It’s free

The DVPD concept is licensed under CC BY-ND 4.0, and the contained code under Apache-2.0

What’s in the repo ?

  • Concept Documentation
    • Description of the concept
    • Reference of the core syntax of DVPD
    • Analysis about the use case variations to cover by the syntax
      • Data Mapping variation taxonomy
      • Data Mapping dependend process generation
      • Partitioned deletion scenarios
  • Reference implementation of a DVPD compiler in python
    • Testsets for the DVPD compiler
    • Examples for generator scripts in python
      • DDL script generator
      • „Developer cheat sheet“ generator
      • HTML Dokumentation generator