Data Science Working Group: Hello

Introduction to group vision, tooling and next steps.

Usama Bilal

Ran Li

2023-07-06

Overview

Big Picture Vision
Toolkit
Integration

Big Picture Vision

Context

The DSWG was created to around a directive to think about projects at a ‘system’ level from raw data to deliverable to researchers, policy-makers and community members.
Research is 80% data cleaning (Access) and 20% actual research (Understand)
Research and data are much more valuable if they can be communicated to stakeholders (other researchers, policy makers, community).
We aim to sustain a workflow that is built on software engineering best practices and develop students/staff that to provide this as service to projects both internal and external to UHC.

Toolkit

Principles
- FAIR
- Keep abreast of industry trends
- Tool agnostic
Access
- Grammar of data manipulation
- Data warehousing
Communicate
- GitHub
- Packages
- Literate Programming
- Dashboards
- Web development

1.1 FAIR

FAIR: Findable, Accessible, Interoperable, Reusable
Code: web documentation, version control, packaging
Data: metadata, lineage, versioning, web documentation
Research: findable, accessible stories told to target audiences.

1.2 Innovation (pt 1)

NIH create an Office of Data Science Strategy
- findability, interconnectivity, and interoperability of NIH-funded biomedical data sets and resources
- integration of existing data management tools and development of new ones
- universalization of innovative algorithms and tools created by academic scientists into enterprise-ready resources that meet industry standards of ease of use and efficiency of operation
- growing costs of data management.

1.2 Innovation (pt 2)

Pharma: Roche and big pharma to default to R as primay language for new trials
Modern Data warehousing: helping data teams work like software engineers with DBT
Full stack Component based Javascript frameworks. React.js, Next.js, Svelte.js
Journalism embrace of open source Javascript for story telling (e.g. Rueters graphics patterns
Web hosting is free (Azure static web app, GH pages, Netlify, AWS amplify)
Serverless infrastructure is cheap and easy (AWS lambda, Azure function apps)

1.3 Language Agnostic

Move away from having one solution for how we do promgramming
The flexibility to move towards the best tool for solving problems is paramount
So tools we are using now are not the destination. Its part of a journey that allows us to constantly shift to what the best tool is best for us in the future.
Embrace open source culture of collaboratively building solutions as software developers rather than software consumers.

2.1 Dplyr - Grammar of data manipulation

  rowid species    island bill_length_mm bill_depth_mm flipper_length_mm
1     1  Adelie Torgersen           39.1          18.7               181
2     2  Adelie Torgersen           39.5          17.4               186
3     3  Adelie Torgersen           40.3          18.0               195
4     4  Adelie Torgersen             NA            NA                NA
5     5  Adelie Torgersen           36.7          19.3               193
6     6  Adelie Torgersen           39.3          20.6               190
  body_mass_g    sex year
1        3750   male 2007
2        3800 female 2007
3        3250 female 2007
4          NA   <NA> 2007
5        3450 female 2007
6        3650   male 2007

Research Question:Calculate ratio of bill length to depth then calculate rank by species. Return a table whose rows are arranged in order by species and contiaining only relevant columns.

Use penguins as the input data
Group by species
Calculate bill length depth ratio
Arrange rows based on rank
Select columns: species, rank, ratio
Calculate rank of ratio_bill

data = read.csv(penguins_data_url)
data %>%
  group_by(species) %>%
  mutate(ratio_bill = bill_length_mm/bill_depth_mm) %>% 
  select(species, ratio_bill ) %>% 
  mutate(rank = rank(desc(ratio_bill ))) %>% 
  arrange(rank)

# A tibble: 344 x 3
# Groups:   species [3]
   species   ratio_bill  rank
   <chr>          <dbl> <dbl>
 1 Adelie          2.45     1
 2 Gentoo          3.61     1
 3 Chinstrap       3.26     1
 4 Adelie          2.44     2
 5 Gentoo          3.51     2
 6 Chinstrap       2.93     2
 7 Adelie          2.43     3
 8 Gentoo          3.51     3
 9 Chinstrap       2.88     3
10 Adelie          2.42     4
# i 334 more rows

2.1 Dplyr - flat files (.csv)

Semantics
Dplyr + .csv

Research Question:Calculate ratio of bill length to depth then calculate rank by species. Return a table whose rows are arranged in order by species and contiaining only relevant columns.

Use penguins as the input data
Group by species
Calculate bill length depth ratio
Arrange rows based on rank
Select columns: species, rank, ratio
Calculate rank of ratio_bill

data %>%
  group_by(species) %>%
  mutate(ratio_bill = bill_length_mm/bill_depth_mm) %>% 
  select(species, ratio_bill ) %>% 
  mutate(rank = rank(desc(ratio_bill ))) %>% 
  arrange(rank)

# A tibble: 344 x 3
# Groups:   species [3]
   species   ratio_bill  rank
   <chr>          <dbl> <dbl>
 1 Adelie          2.45     1
 2 Gentoo          3.61     1
 3 Chinstrap       3.26     1
 4 Adelie          2.44     2
 5 Gentoo          3.51     2
 6 Chinstrap       2.93     2
 7 Adelie          2.43     3
 8 Gentoo          3.51     3
 9 Chinstrap       2.88     3
10 Adelie          2.42     4
# i 334 more rows

2.1 Dplyr - Databases (e.g. SQLite)

Semantics
Dplyr + Database
Dplyr + SQL

Research Question:Calculate ratio of bill length to depth then calculate rank by species. Return a table whose rows are arranged in order by species and contiaining only relevant columns.

Use penguins as the input data
Group by species
Calculate bill length depth ratio
Arrange rows based on rank
Select columns: species, rank, ratio
Calculate rank of ratio_bill

database  <- memdb_frame(data)
query = database %>%
  group_by(species) %>%
  mutate(ratio_bill = bill_length_mm/bill_depth_mm) %>% 
  select(species, ratio_bill ) %>% 
  mutate(rank = rank(desc(ratio_bill ))) %>% 
  arrange(rank)

query %>% collect()

# A tibble: 344 x 3
# Groups:   species [3]
   species   ratio_bill  rank
   <chr>          <dbl> <int>
 1 Adelie          2.45     1
 2 Chinstrap       3.26     1
 3 Gentoo          3.61     1
 4 Adelie          2.44     2
 5 Chinstrap       2.93     2
 6 Gentoo          3.51     2
 7 Adelie          2.43     3
 8 Chinstrap       2.88     3
 9 Gentoo          3.51     3
10 Adelie          2.42     4
# i 334 more rows

query %>% show_query()

<SQL>
SELECT
  *,
  RANK() OVER (PARTITION BY `species` ORDER BY `ratio_bill` DESC) AS `rank`
FROM (
  SELECT `species`, `bill_length_mm` / `bill_depth_mm` AS `ratio_bill`
  FROM `dbplyr_001`
)
ORDER BY `rank`

2.1 Dplyr - Columnar storage (e.g. Parquet)

Semantics
Dplyr + parquet

Research Question:Calculate ratio of bill length to depth then calculate rank by species. Return a table whose rows are arranged in order by species and contiaining only relevant columns.

Use penguins as the input data
Group by species
Calculate bill length depth ratio
Arrange rows based on rank
Select columns: species, rank, ratio
Calculate rank of ratio_bill

  dataset = open_dataset('penguins.parquet')
  dataset %>%
    group_by(species) %>%
    mutate(ratio_bill = bill_length_mm/bill_depth_mm) %>% 
    select(species, ratio_bill ) %>% 
    collect() %>% 
    mutate(rank = rank(desc(ratio_bill ))) %>% 
    arrange(rank)

# A tibble: 344 x 3
# Groups:   species [3]
   species   ratio_bill  rank
   <chr>          <dbl> <dbl>
 1 Adelie          2.45     1
 2 Gentoo          3.61     1
 3 Chinstrap       3.26     1
 4 Adelie          2.44     2
 5 Gentoo          3.51     2
 6 Chinstrap       2.93     2
 7 Adelie          2.43     3
 8 Gentoo          3.51     3
 9 Chinstrap       2.88     3
10 Adelie          2.42     4
# i 334 more rows

1. Dplyr - multilingual

R: dplyr, dbplyr
JavaScript: tidy.js
Python: Polars
SQL: PRQL

1. Dplyr - Summary

Foundation of R&D is data manipulation
Dplyr work flow focuses on semantics and not syntax. meaning easy onboarding.
It is expressive: meaning complex wrangling logic in less code = faster development = less maintenance
It is powerful: works with databases and modern data formats.
Skills translatable to other languages

2.2 Data Warehousing

While dplyr is great for working on specific tasks and projects (e.g. operationalizing a dataset). It does not have the tools required for data warehousing.
Data modeling is like designing the blueprint for a house, but instead of rooms and doors, you’re planning where to store different types of information and how they connect to each other.
It is a key tool to implementing FAIR at an organizaiton level, in order to reduce repeated work, handle big data, document data lineage, and make outputs accessible.
DBT:
- HCUP: https://drexel-uhc.github.io/hcup-dbt/
- DBT training: https://drexel-uhc.github.io/analytics-corner/pages/manuals/dbt/overview.html

3.1 Communicate Software (GitHub)

GitHub is a online version control and project management platform
- developer features (git, CI/DI, virtual machines)
- project management features (issues, discussions, projects, teams, emails)
- website host
- package host
- Example of collaboration
Resources:
Version control is fundamental to reproducibility and effective collaboration.

3.2 Reuse Software: R Packages

R packages are a mature set of best practices, guidelines how how to write R code and the tools to share them.
This has led to very influential opens source software ecoystems that are comparable or surpass enterprise solutions.
- tidycensus
- tidyverse
Instead email code and writing .doc. R packages give us a easy way to develop custom solutions for our problems, document them and share them.
A local package ecosystem will help automate so many tasks and they also can be published to increase the impact of the work we do.
Examples from the DSWG

3.3 Communicate results: Quarto

Literate Programming is the idea of being able to notebook with text (markdown) and code (R,Python, Julia, Javascript).
Knitr is an engine that combines R + .md
Jupyter is an engine that combines Python + .md
Quarto builds onto over various language engines (Knitr, Jupyter) and binds outputs to various formats (web slides, ppt, word, html, websites, books, blogs)
Allows analysts to rapid share their work in accessible formats.
It saves time: no copying into word or ptt. single source of code -> multiple outputs.
Examples:

3.4 Communicate analytic results: Dashboards

Shiny is a multi-lingual (R, Python) tool for building web applications without needing any web development skills
Interactivity is a must as data becomes more complex and higer volume. A dashboard is much more accessible that a 200 page PDF to communicate analytic results.
Gives analysts the ability to extend their R/Python expertise by adding interactivity.
The UHC already has infrastructure to deploy Shiny applications with just a click of a button.
Easy to learn and fast to develop. Examples

3.5 Communicate results: Web content

Why not dashbaords:
- server based which requires infrastructure and compute costs
- cold starts for consumption plans are not feasible for public facing content
- great for communicating analytic results but not telling stories. policy makers and community members don’t have time to click around
- Dashboards need a server to run (R/Python) in addition to user brower to run (JS)
We adopt indsutry trend to move everything to JS with modern JS frameworks: React.js, Next.js, Svelte.js
- Low cost
- Highly flexible
- Inter operable with many existing service API’s
- Harness open source JS codebases
- Utilize server less computing (lambda/azure-functions) for any computations
Examples:
- SALURBAL Data Portal.
- Digital journalism
  - idea: Digital Journalism for urban health
  - example: The Pudding repository
  - adaptation: MAUP Philadelphia

Toolkit Recap

Principles
- FAIR
- Keep abreast of industry trends
- Tool agnostic
Access
- Grammar of data manipulation
- Data warehousing
Communicate
- GitHub
- Packages
- Literate Programming
- Dashboards
- Web development

It takes a village

4. Integration with RDC

Data/backend **
statistics
front-end
training/Data-Team **

4. Integration with Other Cores

Training
- Summer Institutes FAIR or DS course
Policy
- front-end
Community Engagemet
- front-end

Data Science Working Group: Hello

Overview

Big Picture Vision

Context

Toolkit

1.1 FAIR

1.2 Innovation (pt 1)

1.2 Innovation (pt 2)

1.3 Language Agnostic

2.1 Dplyr - Grammar of data manipulation

2.1 Dplyr - flat files (.csv)

2.1 Dplyr - flat files (.csv)

2.1 Dplyr - Databases (e.g. SQLite)

2.1 Dplyr - Columnar storage (e.g. Parquet)

1. Dplyr - multilingual

1. Dplyr - Summary

2.2 Data Warehousing

3.1 Communicate Software (GitHub)

3.2 Reuse Software: R Packages

3.3 Communicate results: Quarto

3.4 Communicate analytic results: Dashboards

3.5 Communicate results: Web content

Toolkit Recap

It takes a village

4. Integration with RDC

4. Integration with Other Cores

Appendix: Rollout