1 Introduction
Ahhhhh a company summer Friday. For many, it’s a day to relax–perhaps spend some time frolicking in the sun, enjoying a day where things like code, KPIs, and other business-y, work type things are far from their minds.
Well, it was hot outside today y’all–and Quarto now has a native Julia engine that’s been tempting me to write another Julia post…so I decided that I would spend some time getting my hands dirty with census data in Julia! Specifically, I was interested in exploring trends around bike commuting in different cities.
I’m hoping that this post serves as both an example of working with the Census API in Julia as well as a demonstration of Tidier.jl
as a data manipulation library. This post leans heavily on Tidier.jl
, as I’ve been testing it out in place of leaning on DataFramesMeta.jl
for analysis workflows. Spoiler alert, it’s awesome–and if you’re already familiar with R, it’s super easy to pick up.
2 How to Work with the ACS Data in Julia
If you aren’t interested in the how-to steps, you can skip straight to Section 3.
With any data analysis project, the first step is getting the data. Luckily, the Census maintains an API that we can use to access the treasure trove of data that they collect. The data we’ll be working with today comes from the American Community Survey (ACS). The ACS is described by the census as:
an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help inform how trillions of dollars in federal funds are distributed each year. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics.
2.1 The Census JSON Response
I won’t review the various inner workings of the Census API (you can read their extensive user guide for that), but I will briefly go over how to work with the API response in Julia.
The Census API returns requests in a non-standard JSON format, described by their user guide as:
[the API uses] a nonstandard version of JSON that is streamlined:
- Data are represented in a two-dimensional array
- Square brackets [ ] hold arrays
- Values are separated by a , (comma).
Since this is a non-standard format, that means we have to do a little work to put this into a DataFrame. There may be a slicker way to do this, but this is my approach! Let’s go over a short example.
We start by getting the response, and then reading this into a JSON array using JSON3
. Let’s use a string that mocks the return of the API as an example.
json_data_ex = """
[
["STNAME","POP","DATE_","state"],
["Alabama","4849377","7","01"],
["Alaska","736732","7","02"],
["Arizona","6731484","7","04"],
["Arkansas","2966369","7","05"],
["California","38802500","7","06"]
]
"""
json_array_ex = JSON3.read(json_data_ex)
6-element JSON3.Array{JSON3.Array, Base.CodeUnits{UInt8, String}, Vector{UInt64}}:
["STNAME", "POP", "DATE_", "state"]
["Alabama", "4849377", "7", "01"]
["Alaska", "736732", "7", "02"]
["Arizona", "6731484", "7", "04"]
["Arkansas", "2966369", "7", "05"]
["California", "38802500", "7", "06"]
Note that the headers (what will be our column names) are element 1 of the array, and the data is everything from the 2nd element on. We’ll store these as header_ex
and data_ex
respectively.
= json_array_ex[1], json_array_ex[2:end] header_ex, data_ex
To get these into the proper format for use in a DataFrame, we need a data structure that contains
- the name of the column
- the values of that column
Ideally, this data structure would also store columns in the order that we enter them into the API call.
As you might have noticed if you took a look at the libraries we’re using, we can use an OrderedDict
for this purpose. We’ll also assign it a type of {Symbol, Vector}
for clarity.
= OrderedDict{Symbol, Vector}() df_ex
So, now we can enumerate the headers, and use list comprehension to create a vector of values that belong to that header–storing these in the OrderedDict
we created.
for (i, col_name) in enumerate(header_ex)
Symbol(col_name)] = [row[i] for row in data_ex]
df_ex[end
df_ex
OrderedDict{Symbol, Vector} with 4 entries:
:STNAME => ["Alabama", "Alaska", "Arizona", "Arkansas", "California"]
:POP => ["4849377", "736732", "6731484", "2966369", "38802500"]
:DATE_ => ["7", "7", "7", "7", "7"]
:state => ["01", "02", "04", "05", "06"]
While reading this, you may have recognized that a Dict
(yes, even an OrderedDict
!) of the format Dict(Symbol => Vector)
is a valid and common constructor for a DataFrame
in Julia. So, the last step is to simply call DataFrame(df)
, which constructs a DataFrame
from the OrderedDict
.
DataFrame(df_ex)
Row | STNAME | POP | DATE_ | state |
---|---|---|---|---|
String | String | String | String | |
1 | Alabama | 4849377 | 7 | 01 |
2 | Alaska | 736732 | 7 | 02 |
3 | Arizona | 6731484 | 7 | 04 |
4 | Arkansas | 2966369 | 7 | 05 |
5 | California | 38802500 | 7 | 06 |
Now we have a DataFrame!
The nice part of this approach is its flexibility to the number of variables we request from the API and returns them in the same order that we input them. The downside is that it seems to return everything as a string, but we can take care of that later on in the analysis process.
2.2 Making it Repeatable with a Function!
We can make our calls to the API from within Julia a bit more pleasant by writing a function which lets us:
- specify the variables to return (
get
in the API docs) - specify the geographic area (
for
in the API docs) - lets us specify an addition (
in
in the API docs)
Rather than carefully re-writing the base URL for our request and painstakingly adding each additional query string, we can specify this as a query
within the HTTP.get()
call.
If there are elements that we aren’t using, we should skip those as they can actually break the API call. I approached this by specifying a Dict
that is initialized with the get
variables. If additional elements are empty, we skip them. We also include the API key (which you should store somewhere that isn’t plain text, as I have done here) by pushing it to my Julia ENV
.
The function is below:
function get_census_data(url; vars = [], _for = "", _in = "")
"""
function to get census data at a specific URL, at a specified level.
"""
census_query = Dict("get" => join(vars, ","))
if !isempty(_for)
census_query["for"] = _for
end
if !isempty(_in)
census_query["in"] = _in
end
#=
my API calls are working without this, but generally the census API requires that your key is included.
please get this in your ENV somehow and don't put this in plain text :)
=#
# census_query["key"] = ENV["CENSUS_API_KEY"]
r = HTTP.get(
url,
query = census_query
)
body = JSON3.read(r.body)
header, data = body[1], body[2:end]
df = OrderedDict{Symbol, Vector}()
for (i, col_name) in enumerate(header)
df[Symbol(col_name)] = [row[i] for row in data]
end
return DataFrame(df)
end
This function gets the API response and parses it to a DataFrame format (albeit where all the columns are strings…).
From a function design perspective, forcing things like vars
and _for
to be explicit keyword arguments–in my opinion–improves the readability of the function when we actually use it in a script. For example, here’s a call to the ACS API which returns the name of the “place”, the total population, the total number of commuters, and the total number of bike commuters for every state.
bike_commuters_by_place_ex = get_census_data(
"https://api.census.gov/data/2022/acs/acs1",
vars = [
"NAME",
"B01001_001E", # total population
"B08006_001E", # total commuters
"B08006_014E" # bike commuters
],
_for = "place:*",
_in = "state:*"
);
first(bike_commuters_by_place_ex, 10)
Row | NAME | B01001_001E | B08006_001E | B08006_014E | state | place |
---|---|---|---|---|---|---|
String | Union… | Union… | Union… | String | String | |
1 | Auburn city, Alabama | 80009 | 01 | 03076 | ||
2 | Birmingham city, Alabama | 196353 | 89157 | 397 | 01 | 07000 |
3 | Dothan city, Alabama | 70524 | 01 | 21184 | ||
4 | Hoover city, Alabama | 92427 | 01 | 35896 | ||
5 | Huntsville city, Alabama | 222363 | 112793 | 0 | 01 | 37000 |
6 | Mobile city, Alabama | 183282 | 01 | 50000 | ||
7 | Montgomery city, Alabama | 196986 | 01 | 51000 | ||
8 | Tuscaloosa city, Alabama | 110598 | 01 | 77256 | ||
9 | Anchorage municipality, Alaska | 287145 | 150599 | 625 | 02 | 03000 |
10 | Avondale city, Arizona | 91618 | 04 | 04720 |
I don’t see any Census-related packages in Julia yet, so this may be the basis of my own Census data package in the future…TBD 👀 but for now, let’s go ahead and start looking into some trends in this data.
3 Digging into Trends Around Bike Commuting
3.1 Washington, DC
I was particularly interested around the bike commuting trends in Washington, DC. After all, DC consistently ranks pretty highly on lists of bike-friendly cities.
A methodological note here, the variable group/table I’m using is B08006 – Sex of Workers By Means of Transportation To Work. I chose to use this because it’s what the League of American Cyclists used to produce this report, and I was able to replicate the numbers from this report (based on the 2016 ACS) as a sanity check for my own numbers.
With the exception of Davis, California…I was not able to verify the top-line 16.6% number because the API didn’t seem to return any data for Davis. The other cities’ numbers matched up though.
I pulled all the ACS data on bike commuting from 2005 to 2022.
Code
# these are some additional helper functions
function pull_acs_by_year(year)
"""
pull the ACS data of interest by year,
"""
df = get_census_data(
"https://api.census.gov/data/$year/acs/acs1",
vars = [
"NAME",
"B01001_001E", # total population
"B08006_001E", # total commuters
"B08006_014E", # bike commuters
"B08006_017E", # worked from home
],
_for = "place:*",
_in = "state:*"
)
return insertcols!(df, :data_year => year)
end
function combine_census_data_of_interest()
census_y_2005_2019 = reduce(vcat, [pull_acs_by_year(year) for year in 2005:2019])
census_y_2021_2022 = reduce(vcat, [pull_acs_by_year(year) for year in 2021:2022]) # there is no 2020 data...for obvious reasons.
return vcat(census_y_2005_2019, census_y_2021_2022)
end
# this combines the ACS data described above for multiple years.
census_all_data = combine_census_data_of_interest()
# just making these variables easier to work with
rename!(
census_all_data,
:B01001_001E => :total_population,
:B08006_001E => :total_commuters,
:B08006_014E => :total_bike_commuters,
:B08006_017E => :total_worked_from_home
)
first(census_all_data, 10)
Row | NAME | total_population | total_commuters | total_bike_commuters | total_worked_from_home | state | place | data_year |
---|---|---|---|---|---|---|---|---|
String | Union… | Union… | Union… | Union… | String | String | Int64 | |
1 | Birmingham city, Alabama | 222154 | 01 | 07000 | 2005 | |||
2 | Hoover city, Alabama | 74473 | 01 | 35896 | 2005 | |||
3 | Huntsville city, Alabama | 158618 | 01 | 37000 | 2005 | |||
4 | Mobile city, Alabama | 193332 | 01 | 50000 | 2005 | |||
5 | Montgomery city, Alabama | 193042 | 01 | 51000 | 2005 | |||
6 | Tuscaloosa city, Alabama | 68006 | 01 | 77256 | 2005 | |||
7 | Anchorage municipality, Alaska | 266281 | 126456 | 0 | 212 | 02 | 03000 | 2005 |
8 | Avondale city, Arizona | 61666 | 04 | 04720 | 2005 | |||
9 | Chandler city, Arizona | 225725 | 118423 | 0 | 440 | 04 | 12000 | 2005 |
10 | Gilbert town, Arizona | 178539 | 04 | 27400 | 2005 |
Let’s check out the trend in DC. After a first pass at the data, I thought it’d make sense to check out the trend over the past 10 years (excluding 2020, where we don’t have ACS data we can access due to COVID).
Code
dc_bikers_by_year = @chain census_all_data begin
@filter contains(NAME, "District")
@transmute(
data_year,
name = NAME,
# here's where we fix the fact that data comes through as strings :)
total_population = as_integer(total_population),
total_commuters = as_integer(total_commuters),
total_bike_commuters = as_integer(total_bike_commuters),
total_worked_from_home = as_integer(total_worked_from_home)
)
@mutate(
percent_bike_commuters = total_bike_commuters / total_commuters,
percent_wfh = total_worked_from_home / total_commuters
)
@filter data_year >= 2012
end
f = Figure(
fonts = (; regular = "IBM Plex Sans", color = :black),
dpi = 1200
)
ax = Axis(
f[1, 1],
xlabel = "ACS Data Year",
title = "Percent of Workers Commuting by Bike",
subtitle = "in Washington, DC (2012 - 2022)",
xticks = [2012, 2014, 2016, 2018, 2020, 2022],
xgridvisible = false,
ygridstyle = :dash,
ygridcolor = (:lightgrey, 0.45),
ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
titlealign = :left,
titlecolor = :black
)
lines!(ax, dc_bikers_by_year[!, :data_year], dc_bikers_by_year[!, :percent_bike_commuters], color = :black, linewidth = 2)
vlines!(ax, 2020, linestyle = :dot, color = :grey, label = "COVID")
f
The percent of workers commuting by bike hit a high point in 2017 with 4.96% of workers commuting by bike, before beginning a sharp decline and hitting rock bottom in 2021 at 2.1%. Of course, something happened around 2020 which changed commuting patterns in a significant way–perhaps even forever! So there’s another trend we have to investigate alongside the decline in bike commuting, and that’s the percent of workers working from home.
Code
f = Figure(
fonts = (; regular = "IBM Plex Sans", color = :black),
dpi = 1200
)
ax = Axis(
f[1, 1],
xlabel = "ACS Data Year",
title = "Percent of Workers Working from Home",
subtitle = "in Washington, DC (2012 - 2022)",
xticks = [2012, 2014, 2016, 2018, 2020, 2022],
xgridvisible = false,
ygridstyle = :dash,
ygridcolor = (:lightgrey, 0.45),
ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
titlealign = :left,
titlecolor = :black
)
lines!(ax, dc_bikers_by_year[!, :data_year], dc_bikers_by_year[!, :percent_wfh], color = :black, linewidth = 2)
vlines!(ax, 2020, linestyle = :dot, color = :grey, label = "COVID")
f
Unsurprisingly, the percent of workers that said they were working from home explodes after 2020. In 2021, 48.2% of workers in DC said that they were working from home–almost half the workforce!
3.2 Trends in Other Cities
Let’s take a look at two other cities that my partner and I are currently considering moving to: Chicago, IL and Portland, OR.
For clarity, this code chunk generates the datasets we’ll use to make our plot.
Code
chicago_bikers_by_year = @chain census_all_data begin
@filter contains(NAME, "Chicago")
@transmute(
data_year,
name = NAME,
total_population = as_integer(total_population),
total_commuters = as_integer(total_commuters),
total_bike_commuters = as_integer(total_bike_commuters),
total_worked_from_home = as_integer(total_worked_from_home)
)
@mutate(
percent_bike_commuters = total_bike_commuters / total_commuters,
percent_wfh = total_worked_from_home / total_commuters
)
@filter data_year >= 2012
end
portland_bikers_by_year = @chain census_all_data begin
@filter contains(NAME, "Portland city, Oregon")
@transmute(
data_year,
name = NAME,
total_population = as_integer(total_population),
total_commuters = as_integer(total_commuters),
total_bike_commuters = as_integer(total_bike_commuters),
total_worked_from_home = as_integer(total_worked_from_home)
)
@mutate(
percent_bike_commuters = total_bike_commuters / total_commuters,
percent_wfh = total_worked_from_home / total_commuters
)
@filter data_year >= 2012
end
This code chunk generates the plot.
Code
f = Figure(
fonts = (; regular = "IBM Plex Sans", color = :black),
dpi = 1200
)
ax1 = Axis(
f[1, 1],
title = "Bike to Work",
subtitle = "in Chicago, IL (2012 - 2022)",
xticks = [2012, 2014, 2016, 2018, 2020, 2022],
xgridvisible = false,
ygridstyle = :dash,
ygridcolor = (:lightgrey, 0.45),
ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
titlealign = :left,
titlecolor = :black,
ylabel = "% of Workers"
)
ax2 = Axis(
f[2, 1],
title = "Bike to Work",
subtitle = "in Portland, OR (2012 - 2022)",
xticks = [2012, 2014, 2016, 2018, 2020, 2022],
xgridvisible = false,
ygridstyle = :dash,
ygridcolor = (:lightgrey, 0.45),
ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
titlealign = :left,
titlecolor = :black,
ylabel = "% of Workers"
)
ax3 = Axis(
f[1, 2],
title = "Work from Home",
subtitle = "in Chicago, IL (2012 - 2022)",
xticks = [2012, 2014, 2016, 2018, 2020, 2022],
xgridvisible = false,
ygridstyle = :dash,
ygridcolor = (:lightgrey, 0.45),
ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
titlealign = :left,
titlecolor = :black
)
ax4 = Axis(
f[2, 2],
title = "Work from Home",
subtitle = "in Portland, OR (2012 - 2022)",
xticks = [2012, 2014, 2016, 2018, 2020, 2022],
xgridvisible = false,
ygridstyle = :dash,
ygridcolor = (:lightgrey, 0.45),
ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
titlealign = :left,
titlecolor = :black
)
axes = [ax1, ax2, ax3, ax4]
plot_data = [chicago_bikers_by_year, portland_bikers_by_year, chicago_bikers_by_year, portland_bikers_by_year]
plot_columns = ["percent_bike_commuters", "percent_bike_commuters", "percent_wfh", "percent_wfh"]
for (ax, ds, col) in zip(axes, plot_data, plot_columns)
lines!(
ax,
ds[!, :data_year],
ds[!, Symbol(col)],
color = :black,
linewidth = 2
)
vlines!(ax, 2020, linestyle = :dot, color = :grey, label = "COVID")
end
f
The WFH spike that we also observed in DC is not really surprising here. What is surprising to me is that Portland, supposedly the best city for biking in the United States, was seeing a pretty steep decline in the percent of bike commuters pre-COVID.
I don’t have any theories on that just from this data. The concept of an investigation into why bike commuting (perhaps cycling in general?) has declined in Portland is an interesting idea I may explore later…though some cursory research shows that the Portland Department of Transportation produced their own research into this exact question.
For Chicago, I don’t have any priors on how Chicago is for biking besides a few arguments over their low city rating from people for bikes. I’m sure the winter weather plays a huge factor into the relatively low percentage of people commuting to work by bike. Chicago also has a decently robust public transit network, which may also contribute to its low numbers.
3.3 Top Cities for Bike Commuting (and working from home)
These are the cities that have the highest share of bike commuters according to the most recent ACS data from 2022.
Code
top_cities = @chain census_all_data begin
@filter(
data_year == 2022,
!isnothing(total_bike_commuters)
)
@transmute(
data_year,
NAME,
total_population = as_integer(total_population),
total_commuters = as_integer(total_commuters),
bike_commuters = as_integer(total_bike_commuters),
worked_from_home = as_integer(total_worked_from_home)
)
@mutate(
percent_bike_commuters = bike_commuters / total_commuters,
percent_wfh = worked_from_home / total_commuters
)
@arrange desc(percent_bike_commuters)
first(10)
@mutate(
rn = row_number(),
name = str_remove(NAME, " city")
)
end
f = Figure(
fonts = (; regular = "IBM Plex Sans"),
dpi = 600,
width = 1200
)
ax = Axis(
f[1, 1],
ygridvisible = false,
title = "Cities with Highest Share of Bike Commuters in 2022",
yticks = (reverse(1:10), [name for name in top_cities[!, :name]]),
xgridstyle = :dash,
ygridcolor = (:lightgrey, 0.45),
xticks = [0.0, 0.03, 0.06, 0.09],
xtickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
titlealign = :left,
titlecolor = :black
)
barplot!(
ax,
top_cities[!, :rn],
reverse(top_cities[!, :percent_bike_commuters]),
bar_labels = [string.(round.(x * 100, digits = 2), "%") for x in reverse(top_cities[!, :percent_bike_commuters])],
direction = :x,
color = :black
)
xlims!(0, 0.10)
f
And for fun, these are the cities with the highest percentage of workers working from home.
Code
top_wfh_cities = @chain census_all_data begin
@filter(
data_year == 2022,
!isnothing(total_bike_commuters)
)
@transmute(
data_year,
NAME,
total_population = as_integer(total_population),
total_commuters = as_integer(total_commuters),
bike_commuters = as_integer(total_bike_commuters),
worked_from_home = as_integer(total_worked_from_home)
)
@mutate(
percent_bike_commuters = bike_commuters / total_commuters,
percent_wfh = worked_from_home / total_commuters
)
@arrange desc(percent_wfh)
first(10)
@mutate(
rn = row_number(),
name = str_remove(NAME, " city")
)
end
f = Figure(
fonts = (; regular = "IBM Plex Sans"),
dpi = 600,
width = 1200
)
ax = Axis(
f[1, 1],
ygridvisible = false,
title = "Cities with Highest Share of Working from Home 2022",
yticks = (reverse(1:10), [name for name in top_wfh_cities[!, :name]]),
xgridstyle = :dash,
ygridcolor = (:lightgrey, 0.45),
xticks = [0.0, 0.15, 0.3, 0.45],
xtickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
titlealign = :left,
titlecolor = :black
)
barplot!(
ax,
top_wfh_cities[!, :rn],
reverse(top_wfh_cities[!, :percent_wfh]),
bar_labels = [string.(round.(x * 100, digits = 2), "%") for x in reverse(top_wfh_cities[!, :percent_wfh])],
direction = :x,
color = :black
)
xlims!(0, 0.46)
f
4 Wrapping Up
I hope that this post illustrates that there are some interesting trends to dig into hiding in the census data. I’m also hoping that this post provides a good basis for how to work with that census data in Julia, including how to manipulate data with Tidier.jl
and how to plot data with CairoMakie.jl
. I’ve been messing with this all day, so now it’s time for me to log off and spend some time with my partner and our cat.
As always, if this post was helpful to you–please let me know!