4  Tools

Before we go into the actual SOP, we will go over the tools we will be using to manage this process importantly:

This really comes back to the idea that in modern data organizations, you have three inseparable things that generate value and complexity: data, metadata, and software. You have to manage all of them to deliver the most value. Ignoring complexity from any of these concepts is going to catch up and hinder you. You cannot just ignore complexity; it will come back to bite you.

There is data, metadata, software, and knowledge management. Here are our key tools for managing these complexities. This next section aims to clearly describe each tool and clearly lay boundaries for what work/information/documentation is accomplished with what to both maximize the efficiency of the team and to define sources of truth so we have clear and consistent information across all tools.

4.1 GitHub Project Management (Issues/Projects)

  • Project management (What, when, how, who)
    • All project management will be done in GitHub.
    • Issues describe needs, design/planning, and detailed implementation.
    • Project boards are used to organize issues and assign tasks across team members.
    • PRs focus on documentation, distilling context, and insights from issues for higher-level summaries of actions.

4.2 GitHub Code Management (Repo/Branch/PR)

  • Organizing where we do the work (software management, reproducibility, quick and safe progression)
    • Branches for safe and quick non-linear collaborative development.
    • Commits for atomic changes - linked to issues.

4.3 Storage

  • Data storage
    • Drexel Server

4.4 Quarto Notebooks

  • Knowledge management/sharing (within individual tasks)
    • Sharing code/data/text notebooks.
    • These are not meant to serve as single sources of truth (e.g., block location).
    • Just to document processes.
    • How we did it?

4.5 Quarto Book (Knowledge Sharing)

  • Knowledge management/sharing (across tasks)
    • Institutionalize and share knowledge base across the team and to other projects.

4.6 Notion (Knowledge Management)

  • Document linkages + Information management interface for organizing all these individual components
    • Where is data stored
    • What data is available
    • Where did it come from, who is responsible.
  • Features:
    • It has relational database features to allow linkages.
    • It has integrations with other tools.
    • It has a built-in interface which we can customize to represent our project as needed.

4.7 DBT (Single Source of Truth for Data)

DBT (Data Build Tool) provides a data warehouse framework for storing metadata and data transformations. It serves as the single source of truth for data and metadata, offering several benefits:

  • Centralized Management: DBT centralizes all data transformations and metadata in one place, ensuring consistency and reducing redundancy.
  • Reproducibility: Every transformation is version-controlled and documented, making it easy to reproduce and understand data workflows.
  • Transparency: Clear documentation and lineage tracking help in understanding the flow of data from raw sources to final outputs.
  • Collaboration: Teams can collaborate effectively by working on shared models and transformations, with clear version history and rollback capabilities.
  • Integration: DBT integrates well with various data sources and BI tools, streamlining the data pipeline from extraction to analysis.

By using DBT, we can ensure that all data and metadata are managed efficiently, consistently, and transparently, providing a robust foundation for our data operations.

Note that for PoC we will demonstrate feasibility of connecting ETL to DBT then some masic modeling there is a lot of best practices to expand and implement in MVP. Notably

4.8 Integrated Repo for Obeservability

  • These are complex processes. and having visibility is key to managing it.
  • I think it a strong argument that the DBT docs is not as acessibl for non-engineers.
  • we will need to build an observability layer anyways to conenct ETL to DBT.
  • it might not be worth the effort at least in v0.1 to build out the yaml sections and instead focus on observablity tooling outside of DBT?
    • identify SOT. this will be the ETL loading block for source tables.
    • what about metadata for tables within DBT. I mean the harmonization process where we have var_name_ccuh and var_name_raw that is going to be a challenge. I think the SOT is going to be hte DBT extenrally materials model. We can build out ETL to observability based on two things: 1) the manifest which will give use the lineage then 2) recomstruct semantic metadata (var, var_def) from the DBT models to be used in the obesrvability tool.
  • Optimizing this CI/DI process is going to be crucial ….
  • We also need to evaluate how lineage in DBT Power User fullfills this while develping in DBT. perhaps its not a very big issue.