3  Abstractions

Abstractions are fundamental concepts that our data infrastructure uses to operate.

3.1 Data Block

3.1.1 Intro to Data Blocks

A Data Block is a data product that be utilized independently or within a pipeline. A datablock is essentially a file or system of files located on a unique directory within the UHC server. Each block is documented on the CCUH Blocks Database. This abstraction section covers at a high level the function and features of blocks, please refer to the CCUH SOP on Pipelines and Blocks for implementation details.

There are two main types of blocks.

  • Raw
    • Raw data is the data that is directly ingested from the source. This data is not processed in any way and is stored in its original format.
  • ETL Blocks
    • these are data blocks that are generated from seed data or other processed data.
    • Designed for reuse in other ETL pipelines
    • Some are also loaded into our DBT system as source files.

Blocks can be stored in many storage formats (parquet, HDFS, lakes) - see CCUH Storage SOP for details see CCUH Storage SOP for details.

3.1.2 CCUH Block standard

Most SCUH blocks contain three parquet files.

  • Semantic dataset - {block_id}__semantic.parquet
    • This one is directly user facing and is one row per observation
  • Semantic codebook - {block_id}__codebook.parquet
    • This is the traditional minimalist codebook for your semantic data - metadata on columns.
    • Required columns:
      • column_name - this like measure name or var_name
      • column_description - this is like columne/measure/variable definition
      • column_type - this is like the data type of the column
    • Optional columns: these can be any type of column level metadata we may want to utliize or include into DBT documentation. So fware we include:
      • scope: SALURBAL vs USA
      • current_block: link to current block’s Notion page
      • upstream_block: link to parent blocks’ notion page
      • block_id: the block id
  • Transactional OBT - {block_id}__transcational.parquet
    • This is a denormalized data model of the data in SALURBAL OBT style. One row per data point with all metadata wide. Useful for provenance and internal data/metadata management.

We are currently both modeling the data semantically but also transcationally to find a good balance between accessibility and provenance. This section likely will updated in v0.2.

3.2 Pipeline

Pipelines move data between blocks. They are the fundamental unit of data movement in the UHC data infrastructure and an inventory of pipelines can be found at the CCUH Blocks Database. This abstraction section covers at a high level the function and features of blocks, please refer to the CCUH SOP on Pipelines and Blocks for implementation details.

We have two types of pipelines:

  • Processing pipelines
    • the output is a CCUH ETL Block
  • Loading Pipelines
    • the output is both a CCUH ETL block and a CCUHC DBT source block

Depending on the Block type, the path of where data/metadata is exported will differ.

3.3 Datawarehouse (DBT)

Note this is not a detailed introduction to data warehousing please refers to DBT training resources on their own site. This section focuses on describing the high level steps of using DBT for CCUH and assumes familiarity with the tool itself. In the CCUH infrastructure there are few key milestones to transition from leveraging our ETL pipelines to do data warehousing.

  1. storage
  2. source models
  3. base models with origin schemas
  4. intermediate models with standardized schemas
  5. user facing models:
    • core models (OBTs) for downstream applications/EDA
    • mart models for documenting manuscripts or specific usecases

We hope to develop at least at least a single example of all of these types of models for the PoC.

3.3.1 1. Storage

For security and ease of access, we will materialize all datawarehouse data model results into the CCUH encrypted server - \\files.drexel.edu\encrypted\SOPH\UHC\CCUH\DBT\ccuh-dbt which has has a few folders:

  • /source: where the ETL loaded data/metadata sits.
  • /dev: where the development dbt model results are stored
  • /staging: where the staging dbt model results are stored
  • /models: where the production dbt model results are stored

3.3.2 2. Loaded Data Blocks -> Source Model

The source refers to the original data from which information is derived. These sources can include databases, CSV files, APIs, real-time data streams, and other forms of raw data inputs. In a data pipeline, this is typically where the data journey begins, and it usually involves extracting data from where it is created or stored.

By now our ETL pipelines have structured our raw data enough to start data warehousing and to leverage the provenance/lineage and integrity/testing features of DBT. Duckdb-DBT can operationalize base models from many upstream formats such as parquet, csv or json. Our loading pipelines load data blocks into the DBT source folder for DBT to utilize to create base models. Each of these data blocks consist of two files (for now we do not include dimension tables so this will likely change in later phases of the project)

  • a fact table as a {block_instance}.parquet
  • a fact table schema/codebook {block_instance}_codebook.xlsx

Before DBT can utilize these loaded data blocks, we need to explciitly communicate to DBT that these parquet files should be considered upstream source models. This is done via creating a {block_instance}_source.yml; note that, this also utilizes the {block_instance}_codebook.xlsx to generate column level metadata for the source.

3.3.3 3. Source Model -> Base Models

Brings in source tables into DBT and able to preview + query them

Base models are the initial models built directly from source data. This stage involves cleaning the data and transforming it into a more usable form while staying as true as possible to the original data. It’s about creating a reliable foundation for further transformations and analyses.

Historically, we import source files and do any basic formatting (renaming, type casting, some mutations) with the source file along. No major transformations or merges From here we can really start do some modeling.

3.3.4 4. Base Models -> Intermediate Models

Focuses on harmonizing base model columns with CCUH standardizes schemas. mostly renaming and some mutations.

Here are the following schemas

  • boundaries
  • record level death records
  • record level hospitalization
  • area level data

3.3.5 5. Intermediate Models -> User facing models

Focuses on merging data, transofmraitons to generate user facing data models