GDP Data Checks
GDP Data Checks
Ran Li
4/10/2020
1. Data Management
“GDP_L1AD_09302019.csv” and “SECGDPGPC_L1AD_20200124.csv” were cleaned and merged to include all overlapping L1/year data points. The tables below displays the cleaned data used for this analysis and a summary of its coverage, respectively. The two new variables are:
- gdp.a which is the raw value of “secgdpcapl1ad” fomr the original dataset “GDP_L1AD_09302019”
- gdp.b which is the raw value of “secgdpgppc” from the new dataset “SECGDPGPC_L1AD_20200124.csv”
# First data set ("GDP_L1AD_09302019")
dfa = read.csv("Data/GDP_L1AD_09302019.csv") %>%
as_tibble() %>%
clean_names() %>%
select(iso2,salid1, year,gdp.a = secgdpcapl1ad )
# Second data set ("SECGDPGPC_L1AD_20200124")
dfb = read.csv("Data/SECGDPGPC_L1AD_20200124.csv") %>%
as_tibble() %>%
clean_names() %>%
select(iso2,salid1, year, gdp.b = secgdpgppc)
# join and tidy
dfc = dfa %>%
left_join(dfb, by = c("iso2", "salid1", "year")) %>%
drop_na() %>%
mutate_if(is.numeric,~as.double(.x))
dfc %>%
mutate(seed = runif(nrow(dfc),1310,2310)) %>%
mutate(id = round(salid1*seed,0)) %>%
select(iso2,id,year, gdp.a, gdp.b) %>%
DT::datatable()
# Coverage
dfc %>%
count(year, iso2) %>%
count(iso2, n) %>%
mutate(years_available = c("2002-2015","2001", "2002-2015","2005,2010,2015") ) %>%
select(iso2,
"Number of Years Available" = nn,
"Years" = years_available,
"Number of L1 available"=n) %>%
kable()
iso2 | Number of Years Available | Years | Number of L1 available |
---|---|---|---|
BR | 14 | 2002-2015 | 152 |
CL | 1 | 2001 | 7 |
CL | 14 | 2002-2015 | 9 |
MX | 3 | 2005,2010,2015 | 56 |
We have overlapping data for three countries (BR, CL, and MX) with 14 years available (2001-2015) for BR and CL but only 3 available (2005,2010,2015) for MX. It is noted that for 2001, we are missing GDP for two L1’s.
# Summary statistic for within city long term variation
df.cv = dfc %>%
pivot_longer(cols= contains("gdp"),
names_to = "type",
values_to = "gdp") %>%
group_by(iso2, salid1, type) %>%
summarise(cv = sd(gdp)/mean(gdp)) %>%
ungroup() %>%
arrange(desc(cv))
# Summary Statistics for between dataset difference
df = dfc %>%
mutate(p_change = round(((gdp.b-gdp.a)/gdp.a),3)) %>%
arrange(desc(p_change))
In addition to the raw data, I constructed the following metrics to help describe trends in the data:
- Percent change of GDP (\(Change_{it} = (GDP_{1it}-GDP_{2it})/GDP_{1it}\)) for each city, \(i\), at time, \(t\), between the two data sets, \(d\). This metric allows us to identify cities which have large disparities
- Coefficient of Variation (\(CV_{di} = SD(GDP_{dit})/Mean(GDP_{dit})\)) for each city, \(i\), and GDP type ,\(d\), to summarize variation of GDP within city over time, \(t\). This will allow us to identify cities with potentially erroneous spikes/drops in GDP over time.
We will explore correlation and these two metrics in the data overall to identify potentially erroneous data points before foucsing on individual countries or cities.
2. Overall Analysis
# Overall Correlation
df %>%
filter(year%in%c(2005, 2010, 2015)) %>%
ggplot(aes(x = gdp.a, y= gdp.b, col = iso2))+
geom_point()+
geom_abline(slope = 1, intercept = 0)+
facet_grid(vars(year), vars(iso2))+
theme_bw()+
labs(x = "GDP A", y= "GDP B", title = "Fig 1: Correlation by country and year",
subtitle = "Black line = perfect correlation")
Visual inspection of subset of data suggest CL and MX are pretty tightly correlated. There may be issues with Brazil: 1) the data does not seem to match well and 2) the new GDP dataset (“GDP B” on the y axis) almost seems categorical for brazil rather than continous. We will look at Brazil more closely later on.
df %>% group_by(iso2) %>%
summarize("Correlation" = round(cor(gdp.a,gdp.b),3)) %>%
ungroup() %>% kable()
iso2 | Correlation |
---|---|
BR | 0.385 |
CL | 0.980 |
MX | 0.875 |
The correlation between datasets by country is shown above. Not surprising, Brazil has the lowest correlation coefficient and definitely warrants additional examinination.
In addition to examining the cross sectional correlation between datasets, we will check the volatility of within city GDP over time. CV is a useful statistic to capture volatility and the distribution of CV for each city by country and dataset is shown in the plot below.
# Within City variations.
df2 = df.cv %>%
group_by(iso2, type) %>%
mutate(outlier = cv > median(cv) + IQR(cv) * 2) %>%
ungroup() %>% rename(Dataset = type)
df2 %>%
ggplot(aes(x=iso2, y =cv, col = Dataset, label = salid1))+
geom_boxplot(outlier.shape = NA)+
geom_point(data = filter(df2, outlier),
position = position_jitterdodge(
jitter.width = 0.325
))+
theme_bw() +
labs(x= "", y ="Coefficient of Variation",
title = "Fig 2: Longituindal GDP volatility by dataset and country",
subtitle = "")+
scale_color_brewer(type = "qual", palette=2)
This plot highglights that according to the newer GDP dataset (gdp.n) there are several cities in Brazil which demonstrate huge shifts in GDP over time (CV > 0.6). We will examine the most volatile cities by country and data set in further sections but clearly, this higher level visualization suggests issues in the datasets for Brazil.
3. Brazil
Brazile has a correlation coefficient of only 0.385 for the values of GDP between both datasets. Here we use percent change of the new GDP value relative to the original GDP value to quantify the magnitude of change.
dft = df %>%
filter(iso2 == "BR") %>%
mutate(change_cat = case_when(
abs(p_change)<0.25 ~"<25%",
between(abs(p_change),0.25,0.5)~"25-50%",
abs(p_change)>0.5 ~ ">50%"
)) %>% count(change_cat, name = "Number L1/Years") %>%
mutate( "Proportion of all L1/Years" = round((`Number L1/Years`/2128),2)) %>%
mutate(Difference = case_when(change_cat=="<25%"~"Not Different (<25%)",
change_cat=="25-50%"~"Different (25-50%)",
change_cat==">50%"~"Very Different (>50%)",
)) %>%
mutate(Difference = factor(Difference,levels = c("Not Different (<25%)",
"Different (25-50%)",
"Very Different (>50%)"))) %>%
select(4,2,3)
kable(dft[c(1,3,2),])
Difference | Number L1/Years | Proportion of all L1/Years |
---|---|---|
Not Different (<25%) | 881 | 0.41 |
Different (25-50%) | 695 | 0.33 |
Very Different (>50%) | 552 | 0.26 |
Across all years, only 41% of the data has percent change less than 25% with almost a third of the data having greater than 50% change in the new value of GDP relative to the old. The plot below indicates that the prevalence of this large degree of difference between values of GDP is consistent over all years of the data.
df %>%
filter(iso2 == "BR") %>%
mutate(change_cat = case_when(
abs(p_change)<0.25 ~"<25%",
between(abs(p_change),0.25,0.5)~"25-50%",
abs(p_change)>0.5 ~ ">50%"
)) %>%
mutate(Difference = case_when(change_cat=="<25%"~"Not Different (<25%)",
change_cat=="25-50%"~"Different (25-50%)",
change_cat==">50%"~"Very Different (>50%)",
),
p_change =p_change*100) %>%
ggplot(aes(x = year, y = p_change, group = year))+geom_boxplot()+
labs(title = "Fig 3: Distribution of % change in GDP relative to original value by year",
x= "", y = "% Change",
subtitle = "Red lines indicate + or - 25%")+
geom_hline(yintercept = 25, lty = 2, col = 'red')+
geom_hline(yintercept = -25, lty = 2, col = 'red')+
scale_x_continuous(breaks = 2002:2015)+
theme_bw()
We can see than approximately half of all data for each year has an absoluted change greater than 25%. The scatterplot of the data for two years is displayed below to allow for exploration of cities. Hover over dots to get names
dft = df %>%
filter(iso2 == "BR") %>%
filter(year%in%c(2002,2015)) %>%
mutate(change_cat = case_when(
abs(p_change)<0.25 ~"<25%",
between(abs(p_change),0.25,0.5)~"25-50%",
abs(p_change)>0.5 ~ ">50%"
))
p = dft%>%
left_join(xwalk.br) %>%
mutate(p_change =paste(p_change*100,"%")) %>%
ggplot(aes(gdp.a, gdp.b, col = change_cat,
text=sprintf("Change: %s<br>Name: %s", p_change, name))) +
geom_point() +
geom_abline(slope = 1, intercept = 0)+
facet_wrap(~year) +
labs(title = "Fig 4: Highlighting specific Cities where data is unconsistent")+
theme_bw() %>%
scale_color_manual(values = c("black","red","orange"))
ggplotly(p)
There are several red dots which are associated with up to 90% differences in the values. In the subset of data above, we can see that in 2015 the new data severely underestimates the GDP for a cluster of cities (the red cluster on the bottom left). It looks like the deimals are the wrong place for these values. Also in 2002, the city of Rios of Ostras is a clear outlier.
In terms of the longitudinal trends, we can examine the 25 cities in the new dataset which have excessively large CV values of greater than 0.4.
p=df %>%
left_join(xwalk.br)%>%
filter(iso2 == "BR") %>%
filter(salid1%in%(df.cv %>% filter(cv>0.4) %>% pull(salid1) ) ) %>%
ggplot(aes(x = year, y= gdp.b, group = name, col = name ))+
geom_line()+labs(x = "",y = "GDP",
title = "Fig 5a: GDP from new data set for cities with CV >0.4")+
scale_x_continuous(breaks = seq(2003,2015,2))+
theme_bw();p
This actually incredible confusing. First, there is sharp drop in GDP in 2010 which raises a red flag. Second, the values between cities are so close, that you cannot see the the difference between many of the cities. Their lines are overlapping.I display below a artificially jittered version of the data (below) just to show that the multi city data is there but just masked.
p=df %>%
left_join(xwalk.br)%>%
filter(iso2 == "BR") %>%
filter(salid1%in%(df.cv %>% filter(cv>0.4) %>% pull(salid1) ) ) %>%
ggplot(aes(x = year, y= gdp.b, group = name, col = name ))+
geom_line(position=position_jitter(w=0.9, h=0.9))+
geom_line()+labs(x = "",y = "GDP",
title = "Fig 5b: GDP from new data set for cities with CV >0.4")+
scale_x_continuous(breaks = seq(2003,2015,2))+
theme_bw();p
Longitudinal GDP drops in five states. The majorities of these drops are in : Piauí, Bahia, and Parana, as noted by Josiah. But there are two cities in Para Maranhao which also display this drop. There are two cities in Para which actualyl rise pretty sharply. I added an interactive table below the plot so you can explore all the cities in the plots.
dfaa = df %>%
left_join(xwalk.br)%>%
filter(iso2 == "BR") %>%
filter(salid1%in%(df.cv %>% filter(cv>0.4) %>% pull(salid1) ) )
dfaa %>%
ggplot(aes(x = year, y= gdp.b, group = name, col = state_name ))+
geom_line(position=position_jitter(w=0.9, h=0.9))+
geom_line()+labs(x = "",y = "GDP",
title = "Fig 5c: GDP from new data set for cities with CV >0.4 by state")+
scale_x_continuous(breaks = seq(2003,2015,2))+
theme_bw() +
facet_wrap(~state_name)+
theme(axis.text.x = element_text(angle = 90))
DT::datatable(dfaa %>% filter(year%in%c(2009,2010)) %>%
select(state = state_name, city = name, year, gdp.b) %>%
arrange(state, city,year))
Take Away: The new GDP data for Brazil does not seem usable. The huge difference compared to the old data as well as dramatic drops in GDP 2010 suggest this data is not very reliable for Brazil.
4. Chile and MX
The visualization and correlation from the overall analysis suggest that the new Chile and MX GDP numbers should be pretty similiar to the old ones. We will examine these two countries together to double check.
df %>%
filter(iso2 != "BR") %>%
mutate(change_cat = case_when(
abs(p_change)<0.25 ~"<25%",
between(abs(p_change),0.25,0.5)~"25-50%",
abs(p_change)>0.5 ~ ">50%"
)) %>% count(change_cat, name = "Number L1/Years") %>%
mutate( "Fig 6. Proportion of all L1/Years" = round((`Number L1/Years`/319),2)) %>%
mutate(Difference = case_when(change_cat=="<25%"~"Not Different (<25%)",
change_cat=="25-50%"~"Different (25-50%)",
change_cat==">50%"~"Very Different (>50%)",
)) %>%
mutate(Difference = factor(Difference,levels = c("Not Different (<25%)",
"Different (25-50%)",
"Very Different (>50%)"))) %>%
select(4,2,3) %>%
kable()
Difference | Number L1/Years | Fig 6. Proportion of all L1/Years |
---|---|---|
Not Different (<25%) | 289 | 0.91 |
Very Different (>50%) | 3 | 0.01 |
Different (25-50%) | 9 | 0.03 |
91% of data points are pretty similiar with less than 25% relative percentage change.
df %>%
filter(iso2 != "BR") %>%
filter(abs(p_change)>0.25) %>%
mutate(id = round(salid1*14,0)) %>%
select(iso2,id,year, gdp.a,gdp.b,p_change) %>%
arrange(desc(abs(p_change))) %>%
kable()
iso2 | id | year | gdp.a | gdp.b | p_change |
---|---|---|---|---|---|
MX | 2858534 | 2015 | 17203 | 29090 | 0.691 |
MX | 2858464 | 2005 | 27523 | 10024 | -0.636 |
MX | 2858534 | 2010 | 17250 | 27090 | 0.570 |
MX | 2857526 | 2005 | 15134 | 21657 | 0.431 |
MX | 2858534 | 2005 | 19150 | 26781 | 0.398 |
MX | 2858464 | 2015 | 22348 | 30693 | 0.373 |
CL | 1443442 | 2015 | 57254 | 75318 | 0.316 |
CL | 1443596 | 2001 | 8744 | 11190 | 0.280 |
MX | 2858100 | 2005 | 12233 | 8857 | -0.276 |
CL | 1443484 | 2015 | 13308 | 16939 | 0.273 |
MX | 2857512 | 2005 | 17133 | 21657 | 0.264 |
CL | 1443596 | 2013 | 11615 | 14599 | 0.257 |
The 12 data points which have differences greater than 25% are shown in the table below. THeres may be potential issues, but some inconsistencies are to be expected. Overall doesn’t raise a red flag to me.
In terms of longitudinal volatility, the overall analysis only flagged a single Mexican city with CV greater than 0.2. However, to be through lets plot the trajectories top 5 most volatile cities in MX and CL.
df2 = df.cv %>% filter(iso2 != "BR") %>%
group_by(iso2, type) %>% top_n(5,cv)
df3 =df %>%
filter(iso2 != "BR") %>%
filter(salid1%in%df2$salid1 ) %>%
pivot_longer(cols = contains("gdp"),names_to = "type", values_to = "gdp") %>%
left_join(df.cv) %>%
mutate(tmp = ifelse(cv>0.2,"1","0"))
p=df3%>%
mutate(tmp = ifelse(salid1 == 204176,"1","0")) %>%
filter(iso2 == "CL") %>%
mutate(id = round(salid1*3423,0)) %>%
ggplot(aes(x = year, y= gdp,group = id, col = tmp ))+
geom_line()+
facet_wrap(~type)+
scale_color_manual(values = c("black","red"))+
theme(legend.position = "none")+
labs(x = "",y = "GDP",
title = "Fig 6: Chile cities with top 10 highest CV",
subtitle = "Hover to show pop up over drag to zoom."); ggplotly(p)
The ten most volatile cities for each Chile and country are shown in the plot above. Overall, everything looks smooth. No crazy dramatic shifts.
p=df3%>%
mutate(tmp = ifelse(salid1 == 204176,"1","0")) %>%
filter(iso2 == "MX") %>%
mutate(id = round(salid1*3423,0)) %>%
ggplot(aes(x = year, y= gdp,group = id, col = tmp ))+
geom_line()+
facet_wrap(~type)+
scale_color_manual(values = c("black","red"))+
theme(legend.position = "none")+
labs(x = "",y = "GDP",
title = "Fig 7: Mexican cities with top 10 highest CV",
subtitle = "Hover to show pop up over drag to zoom."); ggplotly(p)
overall, the top ten most volatile cities look okay. Only worryin trend is ID #######, which is highlighted in red. Other than this single abnormality, I would say that
Take Away: Mexico and Chile’s data looks pretty consistent. The only flag that came up was the new GDP value for Mexico city in 2005.
5. Summary
Brazil: New GDP for Brazil does not correlate strongly with old data. There are several cities which show dramatic drops in GDP in the year 2010 which is likely an error.
Chile: New Data looks good. Correlates well with old data and abnormal longitutinal trends. It is usable.
Mexico: New Data Correlates well with old data. Overall, there are no abnormal longitutinal trends. However SALID1 204176 may have an issue. But in general I think this data is also quite reliable.