Data Science Working Group: Hello

Introduction to group vision, tooling and next steps.

Usama Bilal
Ran Li

2023-07-06

Overview

  • Big Picture Vision
  • Toolkit
  • Integration

Big Picture Vision

Context

  • The DSWG was created to around a directive to think about projects at a ‘system’ level from raw data to deliverable to researchers, policy-makers and community members.

  • Research is 80% data cleaning (Access) and 20% actual research (Understand)

  • Research and data are much more valuable if they can be communicated to stakeholders (other researchers, policy makers, community).

  • We aim to sustain a workflow that is built on software engineering best practices and develop students/staff that to provide this as service to projects both internal and external to UHC.

Toolkit

  1. Principles
    • FAIR
    • Keep abreast of industry trends
    • Tool agnostic
  2. Access
    • Grammar of data manipulation
    • Data warehousing
  3. Communicate
    • GitHub
    • Packages
    • Literate Programming
    • Dashboards
    • Web development

1.1 FAIR

  • FAIR: Findable, Accessible, Interoperable, Reusable
  • Code: web documentation, version control, packaging
  • Data: metadata, lineage, versioning, web documentation
  • Research: findable, accessible stories told to target audiences.

1.2 Innovation (pt 1)

  • NIH create an Office of Data Science Strategy
    • findability, interconnectivity, and interoperability of NIH-funded biomedical data sets and resources
    • integration of existing data management tools and development of new ones
    • universalization of innovative algorithms and tools created by academic scientists into enterprise-ready resources that meet industry standards of ease of use and efficiency of operation
    • growing costs of data management.

1.2 Innovation (pt 2)

1.3 Language Agnostic

  • Move away from having one solution for how we do promgramming
  • The flexibility to move towards the best tool for solving problems is paramount
  • So tools we are using now are not the destination. Its part of a journey that allows us to constantly shift to what the best tool is best for us in the future.
  • Embrace open source culture of collaboratively building solutions as software developers rather than software consumers.

2.1 Dplyr - Grammar of data manipulation

  rowid species    island bill_length_mm bill_depth_mm flipper_length_mm
1     1  Adelie Torgersen           39.1          18.7               181
2     2  Adelie Torgersen           39.5          17.4               186
3     3  Adelie Torgersen           40.3          18.0               195
4     4  Adelie Torgersen             NA            NA                NA
5     5  Adelie Torgersen           36.7          19.3               193
6     6  Adelie Torgersen           39.3          20.6               190
  body_mass_g    sex year
1        3750   male 2007
2        3800 female 2007
3        3250 female 2007
4          NA   <NA> 2007
5        3450 female 2007
6        3650   male 2007

2.1 Dplyr - flat files (.csv)

Research Question:Calculate ratio of bill length to depth then calculate rank by species. Return a table whose rows are arranged in order by species and contiaining only relevant columns.

  1. Use penguins as the input data
  2. Group by species
  3. Calculate bill length depth ratio
  4. Arrange rows based on rank
  5. Select columns: species, rank, ratio
  6. Calculate rank of ratio_bill
data = read.csv(penguins_data_url)
data %>%
  group_by(species) %>%
  mutate(ratio_bill = bill_length_mm/bill_depth_mm) %>% 
  select(species, ratio_bill ) %>% 
  mutate(rank = rank(desc(ratio_bill ))) %>% 
  arrange(rank)
# A tibble: 344 x 3
# Groups:   species [3]
   species   ratio_bill  rank
   <chr>          <dbl> <dbl>
 1 Adelie          2.45     1
 2 Gentoo          3.61     1
 3 Chinstrap       3.26     1
 4 Adelie          2.44     2
 5 Gentoo          3.51     2
 6 Chinstrap       2.93     2
 7 Adelie          2.43     3
 8 Gentoo          3.51     3
 9 Chinstrap       2.88     3
10 Adelie          2.42     4
# i 334 more rows

2.1 Dplyr - flat files (.csv)

Research Question:Calculate ratio of bill length to depth then calculate rank by species. Return a table whose rows are arranged in order by species and contiaining only relevant columns.

  1. Use penguins as the input data
  2. Group by species
  3. Calculate bill length depth ratio
  4. Arrange rows based on rank
  5. Select columns: species, rank, ratio
  6. Calculate rank of ratio_bill
data %>%
  group_by(species) %>%
  mutate(ratio_bill = bill_length_mm/bill_depth_mm) %>% 
  select(species, ratio_bill ) %>% 
  mutate(rank = rank(desc(ratio_bill ))) %>% 
  arrange(rank)
# A tibble: 344 x 3
# Groups:   species [3]
   species   ratio_bill  rank
   <chr>          <dbl> <dbl>
 1 Adelie          2.45     1
 2 Gentoo          3.61     1
 3 Chinstrap       3.26     1
 4 Adelie          2.44     2
 5 Gentoo          3.51     2
 6 Chinstrap       2.93     2
 7 Adelie          2.43     3
 8 Gentoo          3.51     3
 9 Chinstrap       2.88     3
10 Adelie          2.42     4
# i 334 more rows

2.1 Dplyr - Databases (e.g. SQLite)

Research Question:Calculate ratio of bill length to depth then calculate rank by species. Return a table whose rows are arranged in order by species and contiaining only relevant columns.

  1. Use penguins as the input data
  2. Group by species
  3. Calculate bill length depth ratio
  4. Arrange rows based on rank
  5. Select columns: species, rank, ratio
  6. Calculate rank of ratio_bill
database  <- memdb_frame(data)
query = database %>%
  group_by(species) %>%
  mutate(ratio_bill = bill_length_mm/bill_depth_mm) %>% 
  select(species, ratio_bill ) %>% 
  mutate(rank = rank(desc(ratio_bill ))) %>% 
  arrange(rank)

query %>% collect()
# A tibble: 344 x 3
# Groups:   species [3]
   species   ratio_bill  rank
   <chr>          <dbl> <int>
 1 Adelie          2.45     1
 2 Chinstrap       3.26     1
 3 Gentoo          3.61     1
 4 Adelie          2.44     2
 5 Chinstrap       2.93     2
 6 Gentoo          3.51     2
 7 Adelie          2.43     3
 8 Chinstrap       2.88     3
 9 Gentoo          3.51     3
10 Adelie          2.42     4
# i 334 more rows
query %>% show_query()
<SQL>
SELECT
  *,
  RANK() OVER (PARTITION BY `species` ORDER BY `ratio_bill` DESC) AS `rank`
FROM (
  SELECT `species`, `bill_length_mm` / `bill_depth_mm` AS `ratio_bill`
  FROM `dbplyr_001`
)
ORDER BY `rank`

2.1 Dplyr - Columnar storage (e.g. Parquet)

Research Question:Calculate ratio of bill length to depth then calculate rank by species. Return a table whose rows are arranged in order by species and contiaining only relevant columns.

  1. Use penguins as the input data
  2. Group by species
  3. Calculate bill length depth ratio
  4. Arrange rows based on rank
  5. Select columns: species, rank, ratio
  6. Calculate rank of ratio_bill
  dataset = open_dataset('penguins.parquet')
  dataset %>%
    group_by(species) %>%
    mutate(ratio_bill = bill_length_mm/bill_depth_mm) %>% 
    select(species, ratio_bill ) %>% 
    collect() %>% 
    mutate(rank = rank(desc(ratio_bill ))) %>% 
    arrange(rank)
# A tibble: 344 x 3
# Groups:   species [3]
   species   ratio_bill  rank
   <chr>          <dbl> <dbl>
 1 Adelie          2.45     1
 2 Gentoo          3.61     1
 3 Chinstrap       3.26     1
 4 Adelie          2.44     2
 5 Gentoo          3.51     2
 6 Chinstrap       2.93     2
 7 Adelie          2.43     3
 8 Gentoo          3.51     3
 9 Chinstrap       2.88     3
10 Adelie          2.42     4
# i 334 more rows

1. Dplyr - multilingual

1. Dplyr - Summary

  • Foundation of R&D is data manipulation
  • Dplyr work flow focuses on semantics and not syntax. meaning easy onboarding.
  • It is expressive: meaning complex wrangling logic in less code = faster development = less maintenance
  • It is powerful: works with databases and modern data formats.
  • Skills translatable to other languages

2.2 Data Warehousing

  • While dplyr is great for working on specific tasks and projects (e.g. operationalizing a dataset). It does not have the tools required for data warehousing.
  • Data modeling is like designing the blueprint for a house, but instead of rooms and doors, you’re planning where to store different types of information and how they connect to each other.
  • It is a key tool to implementing FAIR at an organizaiton level, in order to reduce repeated work, handle big data, document data lineage, and make outputs accessible.
  • DBT:
    • HCUP: https://drexel-uhc.github.io/hcup-dbt/
    • DBT training: https://drexel-uhc.github.io/analytics-corner/pages/manuals/dbt/overview.html

3.1 Communicate Software (GitHub)

3.2 Reuse Software: R Packages

  • R packages are a mature set of best practices, guidelines how how to write R code and the tools to share them.
  • This has led to very influential opens source software ecoystems that are comparable or surpass enterprise solutions.
    • tidycensus
    • tidyverse
  • Instead email code and writing .doc. R packages give us a easy way to develop custom solutions for our problems, document them and share them.
  • A local package ecosystem will help automate so many tasks and they also can be published to increase the impact of the work we do.
  • Examples from the DSWG

3.3 Communicate results: Quarto

3.4 Communicate analytic results: Dashboards

  • Shiny is a multi-lingual (R, Python) tool for building web applications without needing any web development skills

  • Interactivity is a must as data becomes more complex and higer volume. A dashboard is much more accessible that a 200 page PDF to communicate analytic results.

  • Gives analysts the ability to extend their R/Python expertise by adding interactivity.

  • The UHC already has infrastructure to deploy Shiny applications with just a click of a button.

  • Easy to learn and fast to develop. Examples

3.5 Communicate results: Web content

  • Why not dashbaords:
    • server based which requires infrastructure and compute costs
    • cold starts for consumption plans are not feasible for public facing content
    • great for communicating analytic results but not telling stories. policy makers and community members don’t have time to click around
    • Dashboards need a server to run (R/Python) in addition to user brower to run (JS)
  • We adopt indsutry trend to move everything to JS with modern JS frameworks: React.js, Next.js, Svelte.js
    • Low cost
    • Highly flexible
    • Inter operable with many existing service API’s
    • Harness open source JS codebases
    • Utilize server less computing (lambda/azure-functions) for any computations
  • Examples:

Toolkit Recap

  1. Principles
    • FAIR
    • Keep abreast of industry trends
    • Tool agnostic
  2. Access
    • Grammar of data manipulation
    • Data warehousing
  3. Communicate
    • GitHub
    • Packages
    • Literate Programming
    • Dashboards
    • Web development

It takes a village

4. Integration with RDC

  • Data/backend **
  • statistics
  • front-end
  • training/Data-Team **

4. Integration with Other Cores

  • Training
    • Summer Institutes FAIR or DS course
  • Policy
    • front-end
  • Community Engagemet
    • front-end

Appendix: Rollout