Working with Geo database

spatial

boundaries

This post will introduce geodatabases and a basic workflow to work with this format within R.

Author

Ran Li

Published

November 20, 2023

What is in a .gdb?

Geodatabase (.gdb) is a modern spatial storage format developed by ESRI. It is a folder that contains many files and subfolders. A geodatabase (.gdb) is not just a file, but a collection of various types of data like feature classes (points, lines, polygons), raster data, tables, and more. It’s a robust and scalable way to store spatial data, offering advantages like spatial indexing, topologies, and network datasets for complex spatial analysis.

As a an example lets use the US Census Beureau’s 2022 Cartographic Boundary Files. We’ve downloaded the geodatabse file and now lets do a quick EDA!

Working with Geodatabase (.gdb) in R

Before we start working we’ll load our libraries: We’ll use just three packages sf for spatial data, dplyr for data manipulation and leaflet for visualization.

library(sf)
library(dplyr)
library(leaflet)

Step 1: See what layers are in the .gdb

A geodatabase (.gdb) is not just a file, but a collection of various types of data like feature classes (points, lines, polygons), raster data, tables, and more. It’s a robust and scalable way to store spatial data, offering advantages like spatial indexing, topologies, and network datasets for complex spatial analysis.

The first step is to see what ‘layers’ are available in the .gdb.

## Declare path to .gdb
path_gdb = 'cb_2022_us_all_20m.gdb'

## Examine layers
sf::st_layers(path_gdb)

Driver: OpenFileGDB 
Available layers:
               layer_name geometry_type features fields crs_name
1    cb_2022_us_cd118_20m Multi Polygon      437      9    NAD83
2   cb_2022_us_county_20m Multi Polygon     3222     12    NAD83
3 cb_2022_us_division_20m Multi Polygon        9      8    NAD83
4   cb_2022_us_nation_20m Multi Polygon        1      3    NAD83
5   cb_2022_us_region_20m Multi Polygon        4      8    NAD83
6    cb_2022_us_state_20m Multi Polygon       52      9    NAD83

We can see that this geo-database contains six layers; let just work with two of then county and state focusing on Pennsylvania.

Step 2: Import State and County boundaries (PA)

The sf package in R is a versatile tool for anyone working with spatial data. It simplifies the management and analysis of geospatial information, supporting common formats like shapefiles and geodatabases. With sf, users can seamlessly integrate spatial data with R’s powerful data analysis capabilities, making complex spatial tasks both accessible and efficient.

Lets import the boundaries as sf objects.

## Import state
sf_state = sf::st_read(path_gdb,"cb_2022_us_state_20m")  %>%  
      sf::st_transform("+proj=longlat +datum=NAD83")

Reading layer `cb_2022_us_state_20m' from data source 
  `D:\git\analytics-corner\pages\blog\issues\999 - working with gdb\cb_2022_us_all_20m.gdb' 
  using driver `OpenFileGDB'
Simple feature collection with 52 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
Geodetic CRS:  NAD83

## Check data structure
glimpse(sf_state)

Rows: 52
Columns: 10
$ STATEFP  <chr> "48", "06", "21", "13", "55", "41", "29", "51", "47", "22", "~
$ STATENS  <chr> "01779801", "01779778", "01779786", "01705317", "01779806", "~
$ AFFGEOID <chr> "0400000US48", "0400000US06", "0400000US21", "0400000US13", "~
$ GEOID    <chr> "48", "06", "21", "13", "55", "41", "29", "51", "47", "22", "~
$ STUSPS   <chr> "TX", "CA", "KY", "GA", "WI", "OR", "MO", "VA", "TN", "LA", "~
$ NAME     <chr> "Texas", "California", "Kentucky", "Georgia", "Wisconsin", "O~
$ LSAD     <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "~
$ ALAND    <dbl> 676685555821, 403673617862, 102266581101, 149486268417, 14029~
$ AWATER   <dbl> 18974391187, 20291712025, 2384240769, 4418716153, 29343193162~
$ SHAPE    <MULTIPOLYGON [°]> MULTIPOLYGON (((-106.6234 3..., MULTIPOLYGON (((~

## Map to check
sf_state %>% 
  leaflet() %>% 
  addTiles()   %>%
  addPolygons(
    label = ~NAME,  # Custom label with multiple details
    labelOptions = labelOptions(
      html = T,
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "13px",
      direction = "auto"
    )
  )

## Import county
sf_county = sf::st_read(path_gdb,"cb_2022_us_county_20m")  %>%  
      sf::st_transform("+proj=longlat +datum=NAD83")

Reading layer `cb_2022_us_county_20m' from data source 
  `D:\git\analytics-corner\pages\blog\issues\999 - working with gdb\cb_2022_us_all_20m.gdb' 
  using driver `OpenFileGDB'
Simple feature collection with 3222 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
Geodetic CRS:  NAD83

## Check data structure
glimpse(sf_county)

Rows: 3,222
Columns: 13
$ STATEFP    <chr> "17", "27", "37", "47", "06", "17", "19", "22", "18", "33",~
$ COUNTYFP   <chr> "127", "017", "181", "079", "021", "093", "095", "003", "05~
$ COUNTYNS   <chr> "01784730", "00659454", "01008591", "01639755", "00277275",~
$ AFFGEOID   <chr> "0500000US17127", "0500000US27017", "0500000US37181", "0500~
$ GEOID      <chr> "17127", "27017", "37181", "47079", "06021", "17093", "1909~
$ NAME       <chr> "Massac", "Carlton", "Vance", "Henry", "Glenn", "Kendall", ~
$ NAMELSAD   <chr> "Massac County", "Carlton County", "Vance County", "Henry C~
$ STUSPS     <chr> "IL", "MN", "NC", "TN", "CA", "IL", "IA", "LA", "IN", "NH",~
$ STATE_NAME <chr> "Illinois", "Minnesota", "North Carolina", "Tennessee", "Ca~
$ LSAD       <chr> "06", "06", "06", "06", "06", "06", "06", "15", "06", "06",~
$ ALAND      <dbl> 614218330, 2230473967, 653701481, 1455320362, 3403160299, 8~
$ AWATER     <dbl> 12784614, 36173451, 42190675, 81582236, 33693344, 5136525, ~
$ SHAPE      <MULTIPOLYGON [°]> MULTIPOLYGON (((-88.92876 3..., MULTIPOLYGON (~

## Map to check
sf_county %>% 
  leaflet() %>% 
  addTiles()   %>%
  addPolygons(
    label = ~NAME,  # Custom label with multiple details
    labelOptions = labelOptions(
      html = T,
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "13px",
      direction = "auto"
    )
  )

Step 3: Wrangling to focus on PA

Once we have imported the boundaries as sf objects we can start to leverage tidyverse tools to do wrangling or analysis. As a simple walkthrough lets do the following:

filter boundaries for PA
operationalize some nicer labels for counties
append some data for mapping

## Filter for PA and op. labels
sf_county_pa = sf_county %>% 
  filter(STUSPS == 'PA')


## Add some data
library(arrow)
df_demographics = arrow::read_parquet("df_demographics.parquet") %>% 
  filter(geo == 'county',
         year == 2020) %>%
  select(GEOID = geoid, median_age)
sf_county_pa = sf_county_pa %>% 
  left_join(df_demographics, by = 'GEOID') 

## Op. labels
sf_county_pa = sf_county_pa %>% 
  rowwise() %>% 
  mutate(
    county_age_label =  paste0(NAME, '<br>', 
                               'Median Age: ', median_age) %>% 
      htmltools::HTML()
  ) %>% 
  ungroup()

Summary

In this post we learned how to work with geodatabases in R. We used the sf package to import the boundaries and then used leaflet to visualize the data.