Course Summary

Notes and resources

Big Picture

What we want to do

We want to move from raw data to research outputs.

What we do

For each manuscript we can individually ETL (aka data wrangling) then store it where ever we want.

This results in a lot of repeated work across projects and a lot of repeated work.

*ETL

Many organization have faced this issue and the industry solution is datawahousing.

How Data warehousing can help

We can organize our work so that data wrangling (including complex methods such as model-based imputations) are done by individual groups. But once data is structured we can deposit into data warehouse as primary data aka Seeds.

We can then centralize the transformations of primary/seed data into what ever downstream outputs we want. All under the best practices frame work of DBT which provides a very mature workflow for organizing queries, generating documentation, version control and collaborative environment management.

Course Content

  • Session 1 (5/10/23): Get Started + Setup
    • setup software required for DBT
  • Session 2 (5/24/23): Loading data into DBT
    • Start with source data (.csv or .parquet or .json)
    • Load source data into DBT
    • Generate documentation
  • Session 3 (5/31/23): Intro to Modeling
    • Intro to structure
    • Base models
    • Interactive modeling
  • Session 4 (6/7/23): Modeling Fundamentals
  • Session 5 (6/14/23): Standups + Intermediate Features
    • Stand-ups
    • Working on the cloud
      • Cloud storage
      • Cloud database example
    • Summary

Moving Forward


Note

Somethings to keep in mind before we start