8 General ETL workflow
There are four overarching steps for the CCUH ETL workflow
- Design and Project Management
- Development space setup
- ETL Pipeline development
- Documentation and Review
Lets start with a modifiable template for a issue that could be reused across pipeline issue. When you have a new pipeline. create an issue and copy this template into the issue and modify to document your specific pipeline. Below we go through the details of this template.
## Context
- Need to get temperature data from SALURBAL for two countries (PA and GT)
- This pipeline will cache the renovated SALURBAL dataset
- Then operationalize a reusable CCUH block for ETL and load into DBT
## Design and Project Management (Team)
- [ ] GitHub
- [ ] Draft issue
- [ ] Add Context
- [ ] Notion Documentation
- [ ] [Pipeline](https://www.notion.so/drexel-ccuh/US-Tract-Population-by-age-sex-race-ethnicity-2006-2020-SEER-9a2bf7f0a56e428aa891b654b1349c42) (Fill in starred properties)
- [ ] Upstream blocks (Fill in starred properties)
- [ ] Downstream blocks (Fill in starred properties)
- [ ] Project management: added assignee and task priority, size
## Action Items
- [ ] Development Space Setup (Team)
- [ ] Initialize upstream block storage location
- [ ] Initialize downstream block storage location
- [ ] Create branch with `CCUH ID` property
- [ ] Init pipeline folder with + README in repo with pipeline documentation template
- [ ] Init pipeline notebook
- [ ] Develop (Contributor)
- [ ] Setup
- [ ] dependencies
- [ ] declare blocks
- [ ] Pre-processing
- [ ] Prod run with full set upstream TMPDAILY renovation func
- [ ] EDA
- [ ] Processing
- [ ] QC
- [ ] Data + Metadata
- [ ] Semantic data
- [ ] Semantic codebook
- [ ] Transactional OBT
- [ ] Export
- [ ] Copy in Notion Export Code
- [ ] Run
- [ ] Copy in Notion Import Code
- [ ] Validate
- [ ] Publish Quarto notebook
- [ ] Review (Team)
- [ ] Request PR Review
- [ ] Review PR
- [ ] Initialize DBT source if a DBT source
- [ ] Merge PR
- [ ] Clean up branch on origin
- [ ] Clean up branch on local
- [ ] Docuemnt in Notion
- [ ] Close issue
8.1 Design and Project Management
The first step to any task is to clearly define what is a need or problem. These often come through conversations or team meetings. The first step is to document the need or problem. This first step is done in GitHub - our primary project management tool. Usually the infrastructure product manager will coordinate with the team to identify a clarify this need and docuemnt next steps into a github issue. This has two specific tasks:
Then depending on the problem, write a plan of action. In the case of ETL pipelines we do pipeline design in Notion as that is where we are managing relationship between data and code. Note that we are new to this process and are still working out the kinks but likely this process will occur in real time during team meetings to enable everyone one to get on the same page and contribute to the design process and the responsibility of the document-er is to express what ever logic and decisions into the documentation system (which is for now Notion).
At this point we understand the need, we understand more clearly the scope of work that needs to be done and now we assign tasks to team member, size of task, time line. This is project
8.2 Development space setup
We now know the problem, the next steps, the who and when. Before the assignee can start working we need to setup a safe environement for them to work. This involves
-
- refere to Notion documentation about storage locations (from Notion page)
- here we just intialize folders
-
- a branch is an a isolated code space for them work quickly in without worrying about breaking existing infrastructure
- create a branch. usually with the pipeline_id (from Notion page)
-
- Setup
- 1.1 Dependency management
- 1.2 block management
- declare downstream blocks (Copying from Notion Block pages)
- import upstream blocks (Copying from Notion BLock pages)
- Data
- operationalize downstream block
- EDA QC
- Metadata
- block level metadata
- column level definitions
- schema
- op.
tbl_data
- Export
- export data/metadata to downstream block location
- export DBT
- export block/pipeline manifest (starting to think about that each pipeline)
There are several ways to intialize notebooks. We are still working out the best way to do this. But the idea is to have a template that is easily modifiable to the specific pipeline. This is a work in progress. But here is link to a pipeline notebook that you can copy source code and initialize. Note that this is done by the core CCUH team and prepared for analysts prior to start of development.
8.3 ETL Pipeline development
The logic here depends on the pipeline and is the responsibility of the analyst responsible. Note that details on each of components can be found in the CCUH Abstractions page. While each pipeline’s outputs may vary, here is a generic output guideline for regular CCUH data blocks.
Most SCUH blocks contain three parquet files.
- Semantic dataset -
{block_id}__semantic.parquet
- This one is directly user facing and is one row per observation
- Semantic codebook -
{block_id}__codebook.parquet
- This is the traditional minimalist codebook for your semantic data - metadata on columns.
- Required columns:
column_name
- this like measure name or var_namecolumn_description
- this is like columne/measure/variable definitioncolumn_type
- this is like the data type of the column
- Optional columns: these can be any type of column level metadata we may want to utliize or include into DBT documentation. So fware we include:
scope
: SALURBAL vs USAcurrent_block
: link to current block’s Notion pageupstream_block
: link to parent blocks’ notion pageblock_id
: the block id
- Transactional OBT -
{block_id}__transcational.parquet
- This is a denormalized data model of the data in SALURBAL OBT style. One row per data point with all metadata wide. Useful for provenance and internal data/metadata management.
We are currently both modeling the data semantically but also transcationally to find a good balance between accessibility and provenance. This section likely will updated in v0.2.
8.4 Documentation and Review
Once the pipeline is developed, the analyst will request a PR review. The reviewer will review the code, the documentation, the metadata and the data. Once the PR is approved the analyst will merge the PR. The analyst will then clean up the branch on origin and local. The analyst will then document the pipeline in Notion and close the issue.
Note that each pipeline and thus downstream block will have a maintainer as well as a dedicted QC specialist. This is to add layers of redundancy to as much as possible prevent cost downstream data issues. So for CCUH we will have dedicated QC specialists who will be responsible for reviewing the data:
- USA: Heather Rollins (informal chat done)
- Latin America: Jess/Ran as alternates to each others pipelines.