F2. Data are described with rich metadata (defined by R1 below)

C omprehensive codebooks that cover community needs

Machines are great for computing can’t extrapolate certain things based on context. Consequently, how informative our data portal is will depend on how comprehensive and machine-actionable our data/codebooks are. Below we deocument standards for SALURBAL data/codebooks that improve FAIRness of our project and make a much more comprehensive amount of context accessible to the data portal.

Data

Legacy SALURBAL data tables structure were in general pretty FAIR. The only major change in the renovated data table structure is that we enforce strict rules for our within project variable identifiers. var_name is our workhorse identifier which links things at the variable-level - it should not contain strata information. In some cases metadata is available within variable by country details or data strata and we use iso2 or strata_id to do data-metadata linkage. More details can be found int the F3 principle page.

The first tab below give details on what fields/columns should be present in renovated SALURBAL data tables and the second shows an example data table. The new data columns/fields can be grouped in to the following categories

  • Identifiers: columns responsible for linkage of data and metadata. var_name is our workhorse identifier which links things at the variable-level - it should not contain strata information
  • Data: data related fields including SALID, year of data, geographic level and year.
  • Internal: internal project related metadata (intermediate strata details and file directories ).

Codebooks

Machines are great for computing but are quite dumb … in other words they can’t extrapolate certain things that humans can based on context. So making the web application to interface with the SALURBAL data we found that there were fundamental metadata (data about the data) - which may be possible for SALURBAL staff to extrapolate - were missing (either explicitly missing from codebooks in machine unreadable formats).

Below is a more comprehensive codebook structure to address these gaps. The first tab below give details on what fields/columns should be present in renovated SALURBAL codebooks and the second shows an example codebook. The new codebooks columns/fields can be grouped in to the following categories

  • Identifiers: columns responsible for linkage of data and metadata. var_name is our workhorse identifier which links things at the variable-level - it should not contain strata information
  • Categorization: columns responsible for grouping variables into user friendly domains and subdomains.
  • Details: research related variables details, this will be useful for users who want to reuse our data/codebooks.
  • Internal: internal project related metadata (file directories and access status).

The new codebooks columns/fields can be grouped in to the following categories

  • Identifiers: columns responsible for linkage of data and metadata
  • Categorization: columns responsible for categorising variables into domain or subdomain
  • Details: research related variables details, this will be useful for users who want to reuse our data/codebooks.
  • Internal: internal project related metadata

TLDR (To Long Did not Read)

In trying to make a FAIR data portal we found two major challenges: 1) our existing codebooks were not accessible or comprehensive enough to support creating a FAIR data portal 2) no existing way to link complex metadata (by strata, country) to data. This page documents a proposed data and codebooks standard that will guide the FAIR renovation of existing datasets and serve as templates for future datasets.

  • a set of identifer fields (var_name, strata_fields) to link metadata and data at a variable level while accounting for complex by strata/country/year metadata
  • More comprehensive codebooks to explicitly codify important metadata (identifiers, strata details, categories, internal info, research details)