SALURBAL logo.
  • Overview
  • FAIR Implementation
  • Renovation manuals
  • Database
  1. FAIR SALURBAL Database Summary

On this page

  • 1. Source
    • 1.1 Files
    • 1.2 V2 Codebooks
    • 1.3 Stratification Scheme
  • 2. Quality Control
    • 2.1 Tests
    • 2.2 Generate Outputs
    • 2.3 Validation of Outputs
  • 3. Database Summary
    • 3.1 Variables Count
    • 3.2 Stratification
    • 3.3 Value Type
    • 3.4 Longitudinal Data

Edit this page

Report an issue

FAIR SALURBAL Database Summary

Birds eye view of the SALURBAL data base and track the progress of implementing FAIR principles.
Published

September 10, 2022

1. Source

1.1 Files

df_codebook_sources = compiled_codebooks %>% 
  count(domain,subdomain, file, v2_codebooks) %>% 
  mutate(codebook = convert_full_path_to_file_name(file),
         v2_codebooks = ifelse(is.na(v2_codebooks),F,v2_codebooks)) %>%
  drop_na() %>% 
  select(domain, subdomain, codebook,v2_codebooks) %>% 
  mutate(domain =  ifelse( str_detect(codebook,'child'), paste0("Child ", domain), domain))

df_datasets_sources = compiled_datasets %>%  
  count(domain,subdomain,file) %>% 
  drop_na() %>% 
  select(domain, subdomain, file)%>% 
  mutate(domain =  ifelse( str_detect(file,'child'), paste0("Child ", domain), domain))


.x = df_datasets_sources %>% 
  left_join(df_codebook_sources) %>% 
  filter(file == 'BEC_L1AD_20210824.csv')
df_sources = df_datasets_sources %>% 
  left_join(df_codebook_sources) %>% 
  mutate(grouper = paste(file, codebook)) %>% 
  group_by(file, codebook) %>% 
  group_modify(~{
    n_domain_tmp = length(unique(.x$domain))
    n_subdomain_tmp = nrow(.x)
    name_tmp = case_when(
      unique(.x$domain) == 'Child Health Risk Factors' ~'Child Health Risk Factors' %>% list(),
      n_subdomain_tmp ==1 ~ unique(.x$subdomain) %>% list(),
      n_domain_tmp > 1 ~ unique(.x$domain) %>% list(),
      TRUE~unique(.x$domain) %>% list()) %>% 
      unique() %>% 
      unlist()
    tibble(data = name_tmp,
           file = unique(.x$file),
           codebook = unique(.x$codebook),
           v2_codebooks = unique(.x$v2_codebooks))
  }) %>% 
  ungroup() %>% 
  select(Data = data,
         `Codebook V2` = v2_codebooks,
         File = file,
         Codebook = codebook) %>% 
  arrange(Data)

df_sources %>% 
  reactable(
    defaultPageSize = 20,
    style = list( fontSize = "11px"),
    columns = list(
      `Codebook V2`  = colDef(
        cell = function(value) qc_icon_formatter(value),
        width = 100)),
  )
Data
Codebook V2
File
Codebook
Air Pollution
APSL1AD_02152022.csv
Codebook_APS_02152022_test3.xlsx
Child Health Risk Factors
child_survey_prevalence_city_02232022.csv
Codebook_child_survey_prevalence_02232022.xlsx
Economy
SECGDPGPC_L1AD_20200124.csv
Codebook_SECGDPGPC_L1AD_20200124.xlsx
Employment
SEC_Census_L1AD_06232021.csv
Codebook_SEC_Survey_09212020.xlsx
Employment
SEC_Survey_L1AD_09212020.csv
Codebook_SEC_CENSUS_20210624.xlsx
Environmental Impact
BEC_L1AD_20210824.csv
BEC_Codebook_20210824.xlsx
Health Risk Factors
adult_survey_prevalence_city_02222022.csv
Codebook_adult_survey_prevalence_02222022.xlsx
Inequality
SEC_GINI_L1AD_01312020.csv
Codebook_SEC_GINI_01312020.xlsx
Life Expectancy
LEMEDIAN_L1_20200824.csv
codebook_LEMEDIAN_L1_20200824.xlsx
Migration
SEC_MIGRATION_L1AD_06142021.csv
Codebook_SEC_MIGRATION_06142021.xlsx
Physical and Natural Environment
BEC_L1AD_20210824.csv
BEC_Codebook_20210824.xlsx
Population Projection Growth
PRJG_L1AD_20200915.csv
Codebook_PRJG_L1AD_20200915.xlsx
Population Projection Total
PRJT_L1AD_20200914.csv
Codebook_PRJT_Population_Total_Projections_20210601.xlsx
Segregation
SEC_SegregationIdx_L1AD_04252021.csv
Codebook_SEC_SegregationIdx_01192021.xlsx
Social, Economic, and Services Environment
SEC_Census_L1AD_06232021.csv
Codebook_SEC_CENSUS_20210624.xlsx
Social, Economic, and Services Environment
SEC_Survey_L1AD_09212020.csv
Codebook_SEC_Survey_09212020.xlsx

1.2 V2 Codebooks

SALURBAL codebooks will be updated to include system level metadata. Below compares v1 (old) and v2 (new) codebooks.

df_v1 = tibble(
  field = codebook__bec %>% select(-file) %>% names(),
  v1 = T
)

df_v2 = tibble(
  field = codebook__air_pollution %>% select(-file,-v2_codebooks) %>% names(),
  v2 = T
)

df_example = codebook__air_pollution %>% 
  filter(var_name == 'APSPM25MEAN') %>% 
  mutate_all(~as.character(.x)) %>% 
  pivot_longer(cols = everything(), names_to = 'field', values_to = 'example')


df_codebook_versions = df_v2 %>% 
  left_join(df_v1) %>% 
  left_join(df_example) %>% 
  select(Field = field, v1, v2, `v2 example` = example) %>% 
  mutate(v1 = ifelse(is.na(v1),F, v1))

df_codebook_versions %>% 
  reactable(
    defaultPageSize = 20,
    style = list( fontSize = "11px"),
    columns = list(
      `v1`  = colDef(
        cell = function(value) qc_icon_formatter(value),
        width = 100),
      `v2`  = colDef(
        cell = function(value) qc_icon_formatter(value),
        width = 100))
  )
Field
v1
v2
v2 example
domain
Physical and Natural Environment
subdomain
Air Pollution
var_name
APSPM25MEAN
var_label
Average PM2.5 concentration
var_def
Arithmetic mean of PM2.5 measurements for all grid cells where the centroid falls within the geographic unit
coding
999998 = No data available
units
micromilligrams per cubic meter (ug/m3)
interpretation
A higher value indicates higher levels of air pollution
source
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
var_year
1998-2020
notes
​
limitations
​
acknowledgment
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
longitudinal
TRUE
downloadable
TRUE
censor
FALSE

1.3 Stratification Scheme

The table below defines how we will organiation stratification within the data. There are three columns: 1) attribute - grouping varaible (e.g. Sex) 2) attribute_value are the actual strata (e.g. Female, Males…) 3. default is to signal which strata to display in a non-stratified context (e.g. data catalog table or city profiles viz)

df_strata %>% 
  select(-default) %>% 
  arrange(attribute) %>% 
  reactable(groupBy = "attribute")
attribute
attribute_value
Age (19)
Sex (4)
Standard Population (18)

2. Quality Control

2.1 Tests

The below displays the results of our individual QC tests. If the test has failed you can click each row to see the a summary of the variables that failed.

tests = ls(pattern = 'test_')
results = map_df(tests, ~{
  result = get(.x)(compiled_datasets,compiled_codebooks,role = 'test')   
  tibble(
    Test = result$test, 
    Result = result$pass,
    Description = result$desc,
    faulty = list(result$faulty))})%>% 
  arrange(desc(Result))
index_faulty = which(!results$Result)
results %>% 
  select(-faulty)  %>% 
  reactable(
    columns = list(
      Test  = colDef(width = 100),
      Result  = colDef(
        cell = function(value) qc_icon_formatter(value),
        width = 100)),
    details = function(index){
      if (index%in%index_faulty){
        nested_data <- results %>% slice(index) %>% pull(faulty) %>% .[[1]]
        is_nestable  = nested_data %>% count(var_name) %>% pull(n)>1
        index_name = nested_data %>% count(var_name) %>% pull(var_name)
        index_nestable = which(is_nestable)
        if (any(is_nestable)){
          nested_data1= nested_data %>% 
            select(file, var_name) %>% 
            distinct()
        } else {
          nested_data1= nested_data 
        }
        htmltools::div(style = "padding: 16px",
                       reactable(nested_data1, 
                                 outlined = TRUE,
                                 resizable = TRUE,
                                 wrap = FALSE,
                                 bordered = TRUE,
                                 details = function(index){
                                   if (index%in%index_nestable){
                                     data_nested2 = nested_data %>% 
                                       filter(var_name == index_name[index]) %>% 
                                       select(-file, -var_name)
                                     htmltools::div(style = "padding: 16px",
                                                    reactable(data_nested2, 
                                                              resizable = TRUE,
                                                              bordered = F,
                                                              wrap = FALSE)) }
                                   
                                 }) )  } })
Test
Result
Description
​
Default variable name
All 197 variables have a single default row in cleaned_codebook.
​
Stratification labels
All stratifcation names + values have labels.
​
Stratification values
All 197 variables adhere to values as defined the stratification schema - see 1.2
​
Checkout Variable Labels
All 197 have check our variable labels (var_name_base +var_name_nested).
​
Definition
3 variables are missing variable definition (var_def).
​
Domain
1 variable is missing domain.
​
Subdomain
1 variable is missing subdomain.
​
Variable label
1 variable is a missing short name (var_label).

2.2 Generate Outputs

We keep subset data to those that have passed all QC tests then merge with codebook metadata for. This object ‘cleaned_datasets’ will be used for downstream datastore generation.

## Get outputs from each test
outputs = map(tests, ~{get(.x)(compiled_datasets,compiled_codebooks,role = 'output')}) 

## Keep only those rows that passed all tests and operationalize other columns
cleaned_datasets = purrr::reduce(outputs, dplyr::inner_join) %>% 
  select(-pass) %>% 
  ## operationalize value_type (categorical/discrete/continuous)
  group_by(var_name) %>% 
  group_modify(~{
    values = .x$value
    .x %>% mutate( value_type = case_when(
      any(str_detect(values,"[:alpha:]"))~"categorical",
      any(str_detect(values,"\\."))~"continuous",
      any(str_detect(var_label,c("Number","Minimum",'Minimim',
                                 "Maximum","GDP per capita",
                                 'City foundation year')))~"continuous",
      TRUE~"discrete"
    ))
  }) %>% 
  ungroup() %>%
  ## Add var_name_base_labels 
  mutate(var_name_base_label = case_when(
    n_attr == 0 ~ NA_character_,
    n_attr == 1 ~ attribute1,
    n_attr == 2 ~ glue("{attribute2} ({attribute1_value})") %>% paste0()
  )) %>% 
  select(var_name, var_raw, geo, salid1,country, city,year, value,value_type,
         everything())
# {## Check value_type
#   ## Distribution of value_type
#   cleaned_datasets %>% 
#     count(var_name,value_type) %>% 
#     count(value_type)
#   ##  Get list of discrete variables
#   cleaned_datasets %>% 
#     filter(value_type == 'discrete') %>% 
#     count(var_name, var_label) %>% View()
#   }
#' Add longitudinal coding labels. This function will categorize variables based on 
#' what year and country available. THere will be four categories:
#'   1. Longitudinal - when variable is longitudinal
#'   2. Completely non-longitudinal - when a variable only has a single year value.
#'   3. Functionally non-longitudinal - when a variable only has a single year value per country; 
#'      this is for when data source differ by a few years across country. For example life expectancy is 
#'      '2010-2014' for El Salvador but '2012-2016' for all other countries.
#'   4. Mixed by Country - some countries non-longitudinal but some countries have dta vailable for multiple years.
{ ##  Prepare  longitudinal_status
  xwalk_years_available =  cleaned_datasets %>% 
    select(var_name, subdomain,longitudinal, country, year) %>% 
    distinct() %>% 
    # filter(!longitudinal)  %>% 
    group_by(subdomain, var_name) %>% 
    group_modify(~{
      # .x =  cleaned_datasets %>%
      #   select(var_name, subdomain,longitudinal, country, year) %>%
      #   distinct() %>% filter(var_name == "LEALE")
      # .x =  cleaned_datasets %>%
      #   select(var_name, subdomain,longitudinal, country, year) %>%
      #   distinct() %>% filter(var_name == "CNSMIGR_FORBORN")
      
      years_all = .x %>% count(year) 
      years = unique(.x$year)
      table_year_country =  .x %>% count(country )
      statusTmp = case_when(
        length(years)==1~ "Completely non-longitudinal",
        all(table_year_country$n == 1) ~"Functionally non-longitudinal",
        length(unique(.x$longitudinal))>1~"Mixed by Country",
        unique(.x$longitudinal)~"Longitudinal",
        TRUE~"Mixed by Country"
      ) %>% 
        unique()
      
      ## Check for non-numeric outliers
      if (is.na( min(as.numeric(years)))|is.na( max(as.numeric(years)))){
        stop("ERROR: Non-numeric year value!")
      }
      
      ## standardize year values for 'functionally non-longitudinal values' 
      # year_min_tmp = ifelse(
      #   str_detect(statusTmp, 'non-longitudinal'),
      #   years_all %>% filter(n == max(n)) %>% slice(1) %>% pull(year) %>% as.numeric(),
      #   min(as.numeric(years))
      # )
      # year_max_tmp = ifelse(
      #   str_detect(statusTmp, 'non-longitudinal'),
      #   years_all %>% filter(n == max(n)) %>% slice(1) %>% pull(year)%>% as.numeric(),
      #   min(as.numeric(years))
      # )
      
      tibble(
        n_years = length(years),
        year_min =  min(as.numeric(years)),
        year_max = max(as.numeric(years)),
        n_countries = length(unique(.x$country)),
        chr_years = paste0(years, collapse = ', '),
        longitudinal_status = statusTmp
      ) %>% 
        mutate(years = list(as.numeric(unique(.x$year))))
    }) %>% 
    ungroup() %>% 
    select(var_name, longitudinal_status, n_years, years, year_min, year_max)
  }

{
  xwalk_var_name_nested = cleaned_datasets %>% select(var_merge, var_name_nested) %>% distinct() %>% 
    filter(!is.na(var_merge))
}

cleaned_codebooks  = compiled_codebooks %>% 
  filter(var_name%in%cleaned_datasets$var_name ) %>% 
  left_join(xwalk_years_available) %>% 
  left_join(xwalk_var_name_nested)





## Check for that nested var_name codebook entries have var_name_nested
checkCodebookNested = cleaned_codebooks %>% 
  add_count(var_name) %>% 
  select(var_name, var_merge, var_name_nested, n) %>% 
  filter(n>1) %>% 
  arrange(desc(n)) %>% 
  filter(is.na(var_merge)| is.na(var_name_nested)) %>% 
  nrow() > 0

if (checkCodebookNested){
  stop("ERROR: Some neste codebook entries are missing var_merge/var_name_nested")
}


cleaned_datasets %>% 
  slice(1:10) %>% 
  reactable( resizable = TRUE, wrap = FALSE, bordered = TRUE)
var_name
var_raw
geo
salid1
country
city
year
value
value_type
file
domain
subdomain
var_label
var_def
coding
units
interpretation
source
var_year
notes
limitations
acknowledgment
longitudinal
downloadable
censor
v2_codebooks
prj_target
prj_start
attribute1
attribute1_value
attribute2_value
attribute2
var_merge
attribute1_label
attribute2_label
attribute1_value_label
attribute2_value_label
n_attr
var_name_base
var_name_nested
var_name_base_label
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
1996
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
1997
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
1998
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
1999
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
2000
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
2001
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
2002
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
2003
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
2004
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​
APSNO2GRIDPTS
apsno2gridptsl1ad
L1AD
514383
Argentina
Bahia Blanca
2005
40
continuous
APSL1AD_02152022.csv
Physical and Natural Environment
Air Pollution
Number of NO2 grid point centroids
Number of grid point centroids used to calculate NO2 average and variance measures for the geographic unit
999998 = No data available
​
A higher number indicates more grid points were used to calculate the NO2 metrics
Atmospheric Composition Analysis Group of the Dalhousie University (ATMOS)
1996-2012
​
​
Citation: Hammer, M. S.; et al, Global Estimates and Long-Term Trends of Fine Particulate Matter Concentrations (1998-2018)., Environ. Sci. Technol, doi: 10.1021/acs.est.0c01764, 2020 AND van Donkelaar, A., R. V. Martin, et al. (2019). Regional Estimates of Chemical Composition of Fine Particulate Matter using a Combined Geoscience-Statistical Method with Information from Satellites, Models, and Monitors. Environmental Science & Technology, 2019, doi:10.1021/acs.est.8b06392
true
true
false
true
​
​
​
​
​
​
​
​
​
​
​
0
APSNO2GRIDPTS
APSNO2GRIDPTS
​

2.3 Validation of Outputs

The another round of QC is run to validate the quality of our final output. The results of the validation are shown in the table below.

## Make sure that all tests pass 
validaiton_results = map_df(tests, ~{
  result = get(.x)(cleaned_datasets,cleaned_codebooks,role = 'test')   
  tibble(
    Test = result$test, 
    Result = result$pass,
    Description = result$desc,
    faulty = list(result$faulty))})

validaiton_results %>% 
  select(-faulty) %>% 
  reactable(
    columns = list(
      Test  = colDef(width = 100),
      Result  = colDef(cell = function(value) qc_icon_formatter(value), width = 100)
    ))
Test
Result
Description
Definition
All 194 have variable definition (var_def).
Default variable name
All 194 variables have a single default row in cleaned_codebook.
Domain
All 194 have a domain.
Stratification labels
All stratifcation names + values have labels.
Stratification values
All 194 variables adhere to values as defined the stratification schema - see 1.2
Subdomain
All 194 have a subdomain
Checkout Variable Labels
All 194 have check our variable labels (var_name_base +var_name_nested).
Variable label
All 194 have a shortname

3. Database Summary

n_variables = cleaned_datasets %>%  count(var_name) %>% nrow()
n_datasets = nrow(df_sources)

From 16 datasets/codebook pairs we have cleaned 194 variables. Below are some metrics which summarize what is available in our compiled database.

3.1 Variables Count

How many variables for each dataset?

## Sankey

df_sankey_count = cleaned_datasets %>% 
  count(domain,subdomain, var_name) %>% 
  mutate(all = glue("Cleaned Variables ({n_variables})")) %>% 
  select(all,domain, subdomain)
hchart(data_to_sankey(df_sankey_count), "sankey", name = "Variable Count")
Created with Highcharts 9.3.1Cleaned Variables (194)​Cleaned Variables (194)Demographics​DemographicsEnvironmental Impact​Environmental ImpactHealth Risk Factors​Health Risk FactorsMortality​MortalityPhysical and Natural Environment​Physical and Natural EnvironmentSocial, Economic, and Services Environment​Social, Economic, and Services EnvironmentMigration​MigrationPopulation Projection Growth​Population Projection GrowthPopulation Projection Total​Population Projection TotalCarbon footprint​Carbon footprintAnthropometry​AnthropometryDiabetes​DiabetesHealth Care​Health CareHealth Status​Health StatusHypertension​HypertensionTobacco Use​Tobacco UseLife Expectancy​Life ExpectancyAir Pollution​Air PollutionNatural environment metrics​Natural environment metricsStreet design metrics​Street design metricsTransportation metrics​Transportation metricsUrban form metrics​Urban form metricsUrban landscape metrics​Urban landscape metricsEconomy​EconomyEducation​EducationEmployment​EmploymentHousing​HousingInequality​InequalityLabor Force​Labor ForceLabor Participation​Labor ParticipationPoverty​PovertySegregation​SegregationUnemployment​Unemployment

3.2 Stratification

How many is our data stratified?

df2a = cleaned_datasets %>% 
  count(var_name, subdomain, domain, attribute1) %>% 
  mutate(attribute1 = ifelse(is.na(attribute1),"Not Stratified", "Stratification Available")) %>% 
  select(stratified = attribute1, domain )
# hchart(data_to_sankey(df2a), "sankey", name = "Variable Count")

table_strata_by_file = cleaned_datasets %>%
  count(domain, var_name, attribute1, attribute2) %>%
  mutate(stratified = ifelse(!is.na(attribute1), 'Available', 'Not Available')) %>%
  count(domain,stratified) %>%
  ## Order domain
  group_by(domain) %>%
  mutate(n_domain = sum(n))%>%
  ungroup() %>%
  mutate(domain = fct_reorder(as.factor(domain), (n_domain)))


{ ## b. Examine strata levels
  
  ## All strata vars
  dfa =  cleaned_datasets %>%
    count(file, var_name, attribute1, attribute2) %>%
    filter(!is.na(attribute1))%>% 
    select(-n)
  
  ## Single attribute variables
  df1 = dfa %>% 
    filter(is.na(attribute2)) %>% 
    left_join(cleaned_datasets) %>% 
    count(domain, var_name, attribute1,attribute1_value)
  # df1
  
  ## Double attribute variables
  df2 = dfa %>% 
    filter(!is.na(attribute2)) %>% 
    left_join(cleaned_datasets) %>% 
    count(domain, var_name,
          attribute1,attribute1_value,
          attribute2,attribute2_value )
  # df2
  
  }

## Plot 
dfb = cleaned_datasets %>%
  mutate(stratified2 =case_when(
    is.na(attribute1)~'Not Stratified',
    is.na(attribute2)~'Single Stratification',
    TRUE~'Double Stratification'
  )  )%>% 
  count(domain,var_name,stratified2,attribute1, attribute2) %>% 
  rowwise() %>% 
  mutate(attribute = paste(attribute1,attribute2, sep  = '+') %>% 
           str_remove_all('NA') %>% 
           str_trim() %>% 
           ifelse(str_sub(.,-1L)=='+',str_sub(.,0,-1L-1),. ) %>% 
           ifelse(.=='',NA,.),
         stratified = ifelse(stratified2=='Not Stratified','Not Stratified','Stratified')) %>% 
  ungroup() %>% 
  select(domain,stratified2, attribute)
hchart(data_to_sankey(dfb), "sankey", name = "Stratification Count")
Created with Highcharts 9.3.1Demographics​DemographicsEnvironmental Impact​Environmental ImpactHealth Risk Factors​Health Risk FactorsMortality​MortalityPhysical and Natural Environment​Physical and Natural EnvironmentSocial, Economic, and Services Environment​Social, Economic, and Services EnvironmentNot Stratified​Not StratifiedSingle Stratification​Single StratificationDouble Stratification​Double StratificationSex+Age​Sex+AgeSex+Standard Population​Sex+Standard PopulationSex​Sex
{## Operationalize statistics
  
  n_var_with_strata = table_strata_by_file %>% filter(stratified == "Available") %>% pull(n) %>% sum()
  n_var_without_strata = table_strata_by_file %>% filter(stratified != "Available") %>% pull(n) %>% sum()
  n_var_with_1_strata = df1 %>% count(var_name) %>% nrow()
  n_var_with_2_strata = df2 %>% count(var_name) %>% nrow()
}

Out of 194 variables, only 23 variables are have stratification available. The different stratas available cane be seen below.

3.3 Value Type

What type of data is available?

dfa = cleaned_datasets %>% 
  count(var_name, value_type) %>% 
  count(value_type)
dfa %>% 
  reactable(
    details = function(index){
      dfa_tmp = cleaned_datasets %>% 
        count(var_name, var_label, value_type) %>% 
        filter(value_type ==dfa %>% slice(index) %>% pull(value_type) ) 
      htmltools::div(style = "padding: 16px",
                     reactable(dfa_tmp, 
                               outlined = TRUE,
                               resizable = TRUE,
                               wrap = FALSE,
                               bordered = TRUE) ) }
  )
value_type
n
​
continuous
186
​
discrete
8

3.4 Longitudinal Data

EDA of longitudinality in SALURBAL data. Note that whether variables are longitudinal is not available in v1 codebook. pending v2 codebooks they will be assigned by groups themselves and thus should be more accurate. Right now only 9 variables from v2 codebooks (Air Pollution have DMC assigned longitudinal value).

cleaned_codebooks %>% 
  count(var_name, longitudinal) %>% 
  tabyl(longitudinal) %>% 
  reactable()
longitudinal
n
percent
valid_percent
true
9
0.0463917525773196
1
​
185
0.95360824742268
​

But we can back-calculate longitudinality from the data it self, for now we will use these ‘longitudinal_status’ as a metadata field for our datastore. Roughly 65% of the variables are not longitudinal. Here we document the four possible cases of longitudinality which were coded for each variable based on year availability by year.

cleaned_codebooks %>% 
  select(var_name, subdomain,longitudinal_status) %>% 
  distinct() %>% 
  tabyl(longitudinal_status) %>% 
  arrange(desc(percent)) %>% 
  reactable()
longitudinal_status
n
percent
Completely non-longitudinal
101
0.520618556701031
Longitudinal
51
0.262886597938144
Mixed by Country
32
0.164948453608247
Functionally non-longitudinal
10
0.0515463917525773
  • Longitudinal is when variable is longitudinal

  • Completely non-longitudinal is when a variable only has a single year value.

  • Functionally non-longitudinal is when a variable only has a single year value per country; this is for when data source differ by a few years across country. For example life expectancy is ‘2010-2014’ for El Salvador but ‘2012-2016’ for all other countries.

  • Mixed When the variable identified as non-longitudinal but does not fall into the expected two cases above. It has single years for some countries but multiple years for other countries. These come from two subdomains 1) migration and 2) segregation. The distribution of data availabl by year and country for these two datasets are displayed below.

Segregation

ABCDEFGHIJ0123456789
country
<chr>
Argentina
Brasil
Chile
Chile
Colombia
Colombia
Costa Rica
El Salvador
Guatemala
Mexico
Next
12
Previous
1-10 of 11 rows | 1-1 of 2 columns

Migration

ABCDEFGHIJ0123456789
country
<chr>
Brasil
Chile
Chile
Costa Rica
El Salvador
Guatemala
Guatemala
Mexico
Mexico
Peru
Next
12
Previous
1-10 of 11 rows | 1-1 of 2 columns

Moving forward we need to to make some decisions about how to address each of these cases during the check out selection and downloads steps.