F3. Metadata clearly and explicitly include the identifier of the data they describe.
ELI5: Metadata is linkable to data
Description
The SALURBAL database is a collection of data variables; each variable has a unique identifier var_name (F1). If life were simple, metadata would matched 1 to 1 with each variable and we could do linkage with just var_name. However, the pairing between variable and individual metadata fields are not always one to one. SALURBAL metadata/data linkage scenarios are listed below based on prevalence.
- simple: (Very common) Metadata links 1:1 to data at the variable level via
var_name
(e.g. domain, subdomain .. ETC). This linkage specific codebook would be calledcodebook.csv
- by_country (Very common) This may be a common complexity where metadata differs by variable + country and needs to be linked by
var_name
andiso2
(e.g. data source or censor status). This linkage specific codebook would be calledcodebook_by_iso2.csv
- by_year (Uncommon?) metadata differs by variable + country and needs to be linked by
var_name
andyear
(e.g. data source or censor status). This linkage specific codebook would be calledcodebook_by_year.csv
- by_strata (Rare) This case is rare but should be noted. Here metadata differs by variable + strata thus needs to be linked by
var_name
andstrata_id
. (e.g. var_def or intepretation). This linkage specific codebook would be calledcodebook_by_strata.csv
The direct consequence of having multiple linkages between data and metadata is that for each dataset we need to 1) evaluate what type of linkage works best for each metadata field then 2)secondly operationalize seperate linkage specific codebooks for each of those linkages. We will discuss each of these two steps further below.
1. Evaluate metadata linkage
The first step is to evaluate what type of linkage works best for each metadata field. The interactive table represents how youshould fill out for the dataset you are try to process. Moreover you can download salurbal_codebook_evaluation.csv which is a csv template for the require metadata fields which shows by default all metadata have simple linkage; use this as a starting point to evaluate the metadata linkage for your dataset.
Guidelines
- this linkage categorization for each field is mutually exclusive (only one category per field). For now we assume linkage complexity exists at one level (by a single identifier) lest try this for now and deal with more complex later.
- salurbal_codebook_evaluation.csv is template containing a table of require metadata and possible linkage types.
- the template provided assumes everything is simple (which most of the time it is). Please go through each field and assign a linkage type by asking your team ‘is this field going to vary by ….’ then update the template based on your reply.
2. Operationalize linkage specific codebooks
I think its better to compartmentalize each linkage to its own table because seperation of concerns makes it very transparent which meta data fields are in which linkage tables. Fully merged codebooks that account for potential linkage complexity are often bloated and harder to QC or work with in the data pipeline.
After we have evaluated the metadata linkage for a dataset. We will know which codebook and codebook variations to prepare. For each dataset we could potentiall have up to four:
codebook_simple.csv: (Very common) will link to the data via only a single identifer
var_name
and contain all the metadata fields that were categorized as ‘simple’.codebook_by_country.csv (Very common) will link to the data via
var_name
andiso2
; it will contain all the metadata fields that were categorized as ‘by_country’codebook_by_year.csv (Uncommon?) will link to the data via
var_name
andyear
; it will contain all the metadata fields that were categorized as ‘by_country’codebook_by_strata.csv (Rare) will link to the data via
var_name
andstrata_id
; it will contain all the metadata fields that were categorized as ‘by_country’. Metadata links to data via