5 Data modeling

There are several things to consider to this with the final goals of provenance of metadata for all data. Importantly focusing on enable provenance of fact table manigipulatisonw hile tracking effective access metadata.

review of metata handling in DBT
discussions on modeling stratedgy
review of data lineage and data provenance

5.1 Modeling stratedgy: transcational vs semantic layers

We want to leverage dimensional modeling principles in order to model both our data points (facts) but also track the metadata (dimensions) of those data points. As noted in the DBT docuemntation and like an principles … principles are not an implementation plan. Coming down to Earth you need to make decisions on what is fact or dimension and importantly, whether you should keep fact and dimension tables seperate or create wide joined tables.

Our ETL process is complex and it was noted that it a dynamic dimensional processes that creates standardized OBT (wide joined data + metadata tables) that are interoperable. All our source tables should have both data and metadata available in this tidy format. Now we need to make a clear distinction. that this tidy OBT format is not as immediate consumable or understandable that the traditional tabular data. There is value in both. For future reference we will use the term transcational to refer to our OBT format that is useful for internal storage of data/metadata and the semantic format to refer to the traditional tidy datasets that are commonly used.

These two formats are both valuable for example.

Semantic
- assume atomic and work with wide (DBT style)
- pros:
  - we can utilize column level lineage
- cons:
  - we lose metadata of data points because DBT only has column level metadata
  - we cannot not track data point dimensions/metadata
  - metadta is not tracked.
Transcational
- assume atomic and work long (OBT style)
- pros
  - this is interesting if we approach the problem as a purely data point
  - All metadata is tracked in the fact table
- cons
  - we cannot utilize column level linage…. but is this a problem.
  - we lost column level documentation in DBT.
  - All metadata is in the fact table and not available as documentation in DBT.
  - low observability of metadata. Metadata is not semantically accessible

5.2 CCUH: a hyrbid implementation

There are clear valuable use cases for both of these formats. We can clearly define our various use cases and then use the best ‘form’ of of our data as it is needed.

ETL blocks:
- generate transactional format for DBT
- generate semantic format for resusability across ETL blocks
DBT blocks:
- all source, base, intermediate models are modeled transactionally.
- model documentaiton (yml files) are generated semantically

Things to clarify: - use of doc blocks for DRY docuemnation across yml files - use of the DBT meta tag for mroe complex metadata - functions/macros to generate semantic yml files

Known limitation: - The inheritance of identifiers is likely going to be one layer. - this is common in DBT workloads such as in DBT-codegen - I think we can work around this limitation as this inheritance is most oftne used for just haromniozations bewtween source/base into CCUH intermediate models. Any extended levels of inherticance will be handled by either just increased observability tooling or potentially storing this informatino in the ‘meta’ tag. - Observability is going to be challenge see next section

5.3 DBT layers

5.4 1. Source

Definition: Source tables represent the raw data as it is ingested from the data warehouse. These tables are often not transformed and are used as the starting point for further modeling.
Purpose: To clearly define and document the raw data tables, making it easier to reference and understand the source data.

5.5 2. Seed

Definition: Seed tables are tabular files that are loaded into the database. They are often used for static or reference data that doesn’t change frequently.
Purpose: To provide a simple way to manage small datasets that can be version controlled within your DBT project.

5.6 3. Base

Definition: Base tables are foundational models that represent the cleaned and slightly transformed data from the source tables. They typically perform minimal transformations.
Purpose:
- To create a standardized layer of data that can be reused in more complex transformations.
- Here is where we do most of our ‘harmonization’ between SALURBAL and CCUH.

5.7 4. Intermediate (INT)

Definition: Intermediate tables are models that contain more complex transformations and aggregations. They often serve as intermediate steps between base tables and final reporting tables.
Purpose: To modularize and break down complex transformations into manageable steps, making the logic easier to understand and maintain.

5.8 5. Mart

Definition: Mart tables are final models that are ready for consumption by end-users, such as dashboards, reports, or direct queries. These tables are usually highly aggregated and formatted for easy use.
Purpose: To provide a curated and optimized dataset for business intelligence and analytics purposes.

5.9 Overview Diagram

Here’s a simplified view of how these layers typically interact:

Source -> Seed -> Base -> Intermediate (INT) -> Mart

Source: Raw data ingestion.
Seed: Static reference data.
Base: Cleaned and standardized data.
Intermediate (INT): Complex transformations and aggregations.
Mart: Final, user-facing datasets.