Code

## these are the main libraries we'll use for this exploration
using CairoMakie
using DataFrames
using DataStructures: OrderedDict
using HTTP
using JSON3
using Tidier

set_theme!(theme_light())

1 Introduction

Ahhhhh a company summer Friday. For many, it’s a day to relax–perhaps spend some time frolicking in the sun, enjoying a day where things like code, KPIs, and other business-y, work type things are far from their minds.

Well, it was hot outside today y’all–and Quarto now has a native Julia engine that’s been tempting me to write another Julia post…so I decided that I would spend some time getting my hands dirty with census data in Julia! Specifically, I was interested in exploring trends around bike commuting in different cities.

I’m hoping that this post serves as both an example of working with the Census API in Julia as well as a demonstration of Tidier.jl as a data manipulation library. This post leans heavily on Tidier.jl, as I’ve been testing it out in place of leaning on DataFramesMeta.jl for analysis workflows. Spoiler alert, it’s awesome–and if you’re already familiar with R, it’s super easy to pick up.

2 How to Work with the ACS Data in Julia

Note

If you aren’t interested in the how-to steps, you can skip straight to Section 3.

With any data analysis project, the first step is getting the data. Luckily, the Census maintains an API that we can use to access the treasure trove of data that they collect. The data we’ll be working with today comes from the American Community Survey (ACS). The ACS is described by the census as:

an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help inform how trillions of dollars in federal funds are distributed each year. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics.

2.1 The Census JSON Response

I won’t review the various inner workings of the Census API (you can read their extensive user guide for that), but I will briefly go over how to work with the API response in Julia.

The Census API returns requests in a non-standard JSON format, described by their user guide as:

[the API uses] a nonstandard version of JSON that is streamlined:

Data are represented in a two-dimensional array

Square brackets [ ] hold arrays

Values are separated by a , (comma).

Since this is a non-standard format, that means we have to do a little work to put this into a DataFrame. There may be a slicker way to do this, but this is my approach! Let’s go over a short example.

We start by getting the response, and then reading this into a JSON array using JSON3. Let’s use a string that mocks the return of the API as an example.

json_data_ex = """
[
  ["STNAME","POP","DATE_","state"],
  ["Alabama","4849377","7","01"],
  ["Alaska","736732","7","02"],
  ["Arizona","6731484","7","04"],
  ["Arkansas","2966369","7","05"],
  ["California","38802500","7","06"]
]
"""

json_array_ex = JSON3.read(json_data_ex)

6-element JSON3.Array{JSON3.Array, Base.CodeUnits{UInt8, String}, Vector{UInt64}}:
 ["STNAME", "POP", "DATE_", "state"]
 ["Alabama", "4849377", "7", "01"]
 ["Alaska", "736732", "7", "02"]
 ["Arizona", "6731484", "7", "04"]
 ["Arkansas", "2966369", "7", "05"]
 ["California", "38802500", "7", "06"]

Note that the headers (what will be our column names) are element 1 of the array, and the data is everything from the 2nd element on. We’ll store these as header_ex and data_ex respectively.

header_ex, data_ex = json_array_ex[1], json_array_ex[2:end]

To get these into the proper format for use in a DataFrame, we need a data structure that contains

the name of the column
the values of that column

Ideally, this data structure would also store columns in the order that we enter them into the API call.

As you might have noticed if you took a look at the libraries we’re using, we can use an OrderedDict for this purpose. We’ll also assign it a type of {Symbol, Vector} for clarity.

df_ex = OrderedDict{Symbol, Vector}()

So, now we can enumerate the headers, and use list comprehension to create a vector of values that belong to that header–storing these in the OrderedDict we created.

for (i, col_name) in enumerate(header_ex)
    df_ex[Symbol(col_name)] = [row[i] for row in data_ex]
end

df_ex

OrderedDict{Symbol, Vector} with 4 entries:
  :STNAME => ["Alabama", "Alaska", "Arizona", "Arkansas", "California"]
  :POP    => ["4849377", "736732", "6731484", "2966369", "38802500"]
  :DATE_  => ["7", "7", "7", "7", "7"]
  :state  => ["01", "02", "04", "05", "06"]

While reading this, you may have recognized that a Dict (yes, even an OrderedDict!) of the format Dict(Symbol => Vector) is a valid and common constructor for a DataFrame in Julia. So, the last step is to simply call DataFrame(df), which constructs a DataFrame from the OrderedDict.

DataFrame(df_ex)

5×4 DataFrame

Row	STNAME	POP	DATE_	state
	String	String	String	String
1	Alabama	4849377	7	01
2	Alaska	736732	7	02
3	Arizona	6731484	7	04
4	Arkansas	2966369	7	05
5	California	38802500	7	06

Now we have a DataFrame!

The nice part of this approach is its flexibility to the number of variables we request from the API and returns them in the same order that we input them. The downside is that it seems to return everything as a string, but we can take care of that later on in the analysis process.

2.2 Making it Repeatable with a Function!

We can make our calls to the API from within Julia a bit more pleasant by writing a function which lets us:

specify the variables to return (get in the API docs)
specify the geographic area (for in the API docs)
lets us specify an addition (in in the API docs)

Rather than carefully re-writing the base URL for our request and painstakingly adding each additional query string, we can specify this as a query within the HTTP.get() call.

If there are elements that we aren’t using, we should skip those as they can actually break the API call. I approached this by specifying a Dict that is initialized with the get variables. If additional elements are empty, we skip them. We also include the API key (which you should store somewhere that isn’t plain text, as I have done here) by pushing it to my Julia ENV.

The function is below:

function get_census_data(url; vars = [], _for = "", _in = "")
    """
        function to get census data at a specific URL, at a specified level.
    """
    census_query = Dict("get" => join(vars, ","))

    if !isempty(_for)
        census_query["for"] = _for
    end

    if !isempty(_in)
        census_query["in"] = _in
    end
    
    #=
    my API calls are working without this, but generally the census API requires that your key is included.
    please get this in your ENV somehow and don't put this in plain text :)
    =#
    
    # census_query["key"] = ENV["CENSUS_API_KEY"]

    r = HTTP.get(
        url,
        query = census_query
    )

    body = JSON3.read(r.body)
    header, data = body[1], body[2:end]

    df = OrderedDict{Symbol, Vector}()
    for (i, col_name) in enumerate(header)
        df[Symbol(col_name)] = [row[i] for row in data]
    end

    return DataFrame(df)
end

This function gets the API response and parses it to a DataFrame format (albeit where all the columns are strings…).

From a function design perspective, forcing things like vars and _for to be explicit keyword arguments–in my opinion–improves the readability of the function when we actually use it in a script. For example, here’s a call to the ACS API which returns the name of the “place”, the total population, the total number of commuters, and the total number of bike commuters for every state.

bike_commuters_by_place_ex = get_census_data(
    "https://api.census.gov/data/2022/acs/acs1",
    vars = [
        "NAME",
        "B01001_001E", # total population
        "B08006_001E", # total commuters
        "B08006_014E" # bike commuters
    ],
    _for = "place:*",
    _in = "state:*"
);

first(bike_commuters_by_place_ex, 10)

10×6 DataFrame

Row	NAME	B01001_001E	B08006_001E	B08006_014E	state	place
	String	Union…	Union…	Union…	String	String
1	Auburn city, Alabama	80009			01	03076
2	Birmingham city, Alabama	196353	89157	397	01	07000
3	Dothan city, Alabama	70524			01	21184
4	Hoover city, Alabama	92427			01	35896
5	Huntsville city, Alabama	222363	112793	0	01	37000
6	Mobile city, Alabama	183282			01	50000
7	Montgomery city, Alabama	196986			01	51000
8	Tuscaloosa city, Alabama	110598			01	77256
9	Anchorage municipality, Alaska	287145	150599	625	02	03000
10	Avondale city, Arizona	91618			04	04720

I don’t see any Census-related packages in Julia yet, so this may be the basis of my own Census data package in the future…TBD 👀 but for now, let’s go ahead and start looking into some trends in this data.

3 Digging into Trends Around Bike Commuting

3.1 Washington, DC

I was particularly interested around the bike commuting trends in Washington, DC. After all, DC consistently ranks pretty highly on lists of bike-friendly cities.

Note

A methodological note here, the variable group/table I’m using is B08006 – Sex of Workers By Means of Transportation To Work. I chose to use this because it’s what the League of American Cyclists used to produce this report, and I was able to replicate the numbers from this report (based on the 2016 ACS) as a sanity check for my own numbers.

With the exception of Davis, California…I was not able to verify the top-line 16.6% number because the API didn’t seem to return any data for Davis. The other cities’ numbers matched up though.

I pulled all the ACS data on bike commuting from 2005 to 2022.

Code

# these are some additional helper functions
function pull_acs_by_year(year)
    """
        pull the ACS data of interest by year,
    """
    df = get_census_data(
        "https://api.census.gov/data/$year/acs/acs1",
        vars = [
            "NAME",
            "B01001_001E", # total population
            "B08006_001E", # total commuters
            "B08006_014E", # bike commuters
            "B08006_017E", # worked from home
        ],
        _for = "place:*",
        _in = "state:*"
    )
    return insertcols!(df, :data_year => year)
end

function combine_census_data_of_interest()
    census_y_2005_2019 = reduce(vcat, [pull_acs_by_year(year) for year in 2005:2019])
    census_y_2021_2022 = reduce(vcat, [pull_acs_by_year(year) for year in 2021:2022]) # there is no 2020 data...for obvious reasons.
    return vcat(census_y_2005_2019, census_y_2021_2022)
end

# this combines the ACS data described above for multiple years.
census_all_data = combine_census_data_of_interest()

# just making these variables easier to work with
rename!(
    census_all_data,
    :B01001_001E => :total_population,
    :B08006_001E => :total_commuters,
    :B08006_014E => :total_bike_commuters,
    :B08006_017E => :total_worked_from_home
)

first(census_all_data, 10)

10×8 DataFrame

Row	NAME	total_population	total_commuters	total_bike_commuters	total_worked_from_home	state	place	data_year
	String	Union…	Union…	Union…	Union…	String	String	Int64
1	Birmingham city, Alabama	222154				01	07000	2005
2	Hoover city, Alabama	74473				01	35896	2005
3	Huntsville city, Alabama	158618				01	37000	2005
4	Mobile city, Alabama	193332				01	50000	2005
5	Montgomery city, Alabama	193042				01	51000	2005
6	Tuscaloosa city, Alabama	68006				01	77256	2005
7	Anchorage municipality, Alaska	266281	126456	0	212	02	03000	2005
8	Avondale city, Arizona	61666				04	04720	2005
9	Chandler city, Arizona	225725	118423	0	440	04	12000	2005
10	Gilbert town, Arizona	178539				04	27400	2005

Let’s check out the trend in DC. After a first pass at the data, I thought it’d make sense to check out the trend over the past 10 years (excluding 2020, where we don’t have ACS data we can access due to COVID).

Code

dc_bikers_by_year = @chain census_all_data begin
    @filter contains(NAME, "District")
    @transmute(
        data_year,
        name = NAME,
        # here's where we fix the fact that data comes through as strings :)
        total_population = as_integer(total_population),
        total_commuters = as_integer(total_commuters),
        total_bike_commuters = as_integer(total_bike_commuters),
        total_worked_from_home = as_integer(total_worked_from_home)
    )
    @mutate(
        percent_bike_commuters = total_bike_commuters / total_commuters,
        percent_wfh = total_worked_from_home / total_commuters
    )
    @filter data_year >= 2012
end

f = Figure(
    fonts = (; regular = "IBM Plex Sans", color = :black),
    dpi = 1200
)
ax = Axis(
    f[1, 1],
    xlabel = "ACS Data Year",
    title = "Percent of Workers Commuting by Bike",
    subtitle = "in Washington, DC (2012 - 2022)",
    xticks = [2012, 2014, 2016, 2018, 2020, 2022],
    xgridvisible = false,
    ygridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
lines!(ax, dc_bikers_by_year[!, :data_year], dc_bikers_by_year[!, :percent_bike_commuters], color = :black, linewidth = 2)
vlines!(ax, 2020, linestyle = :dot, color = :grey, label = "COVID")
f

Source: American Community Survey, 2012 - 2022

The percent of workers commuting by bike hit a high point in 2017 with 4.96% of workers commuting by bike, before beginning a sharp decline and hitting rock bottom in 2021 at 2.1%. Of course, something happened around 2020 which changed commuting patterns in a significant way–perhaps even forever! So there’s another trend we have to investigate alongside the decline in bike commuting, and that’s the percent of workers working from home.

Code

f = Figure(
    fonts = (; regular = "IBM Plex Sans", color = :black),
    dpi = 1200
)
ax = Axis(
    f[1, 1],
    xlabel = "ACS Data Year",
    title = "Percent of Workers Working from Home",
    subtitle = "in Washington, DC (2012 - 2022)",
    xticks = [2012, 2014, 2016, 2018, 2020, 2022],
    xgridvisible = false,
    ygridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
lines!(ax, dc_bikers_by_year[!, :data_year], dc_bikers_by_year[!, :percent_wfh], color = :black, linewidth = 2)
vlines!(ax, 2020, linestyle = :dot, color = :grey, label = "COVID")
f

Unsurprisingly, the percent of workers that said they were working from home explodes after 2020. In 2021, 48.2% of workers in DC said that they were working from home–almost half the workforce!

3.2 Trends in Other Cities

Let’s take a look at two other cities that my partner and I are currently considering moving to: Chicago, IL and Portland, OR.

For clarity, this code chunk generates the datasets we’ll use to make our plot.

Code

chicago_bikers_by_year = @chain census_all_data begin
    @filter contains(NAME, "Chicago")
    @transmute(
        data_year,
        name = NAME,
        total_population = as_integer(total_population),
        total_commuters = as_integer(total_commuters),
        total_bike_commuters = as_integer(total_bike_commuters),
        total_worked_from_home = as_integer(total_worked_from_home)
    )
    @mutate(
        percent_bike_commuters = total_bike_commuters / total_commuters,
        percent_wfh = total_worked_from_home / total_commuters
    )
    @filter data_year >= 2012
end

portland_bikers_by_year = @chain census_all_data begin
    @filter contains(NAME, "Portland city, Oregon")
    @transmute(
        data_year,
        name = NAME,
        total_population = as_integer(total_population),
        total_commuters = as_integer(total_commuters),
        total_bike_commuters = as_integer(total_bike_commuters),
        total_worked_from_home = as_integer(total_worked_from_home)
    )
    @mutate(
        percent_bike_commuters = total_bike_commuters / total_commuters,
        percent_wfh = total_worked_from_home / total_commuters
    )
    @filter data_year >= 2012
end

This code chunk generates the plot.

Code

f = Figure(
    fonts = (; regular = "IBM Plex Sans", color = :black),
    dpi = 1200
)
ax1 = Axis(
    f[1, 1],
    title = "Bike to Work",
    subtitle = "in Chicago, IL (2012 - 2022)",
    xticks = [2012, 2014, 2016, 2018, 2020, 2022],
    xgridvisible = false,
    ygridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black,
    ylabel = "% of Workers"
)
ax2 = Axis(
    f[2, 1],
    title = "Bike to Work",
    subtitle = "in Portland, OR (2012 - 2022)",
    xticks = [2012, 2014, 2016, 2018, 2020, 2022],
    xgridvisible = false,
    ygridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black,
    ylabel = "% of Workers"
)
ax3 = Axis(
    f[1, 2],
    title = "Work from Home",
    subtitle = "in Chicago, IL (2012 - 2022)",
    xticks = [2012, 2014, 2016, 2018, 2020, 2022],
    xgridvisible = false,
    ygridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
ax4 = Axis(
    f[2, 2],
    title = "Work from Home",
    subtitle = "in Portland, OR (2012 - 2022)",
    xticks = [2012, 2014, 2016, 2018, 2020, 2022],
    xgridvisible = false,
    ygridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)

axes = [ax1, ax2, ax3, ax4]
plot_data = [chicago_bikers_by_year, portland_bikers_by_year, chicago_bikers_by_year, portland_bikers_by_year]
plot_columns = ["percent_bike_commuters", "percent_bike_commuters", "percent_wfh", "percent_wfh"]

for (ax, ds, col) in zip(axes, plot_data, plot_columns)
  lines!(
    ax, 
    ds[!, :data_year], 
    ds[!, Symbol(col)], 
    color = :black, 
    linewidth = 2
  )
  vlines!(ax, 2020, linestyle = :dot, color = :grey, label = "COVID")
end

f

The WFH spike that we also observed in DC is not really surprising here. What is surprising to me is that Portland, supposedly the best city for biking in the United States, was seeing a pretty steep decline in the percent of bike commuters pre-COVID.

I don’t have any theories on that just from this data. The concept of an investigation into why bike commuting (perhaps cycling in general?) has declined in Portland is an interesting idea I may explore later…though some cursory research shows that the Portland Department of Transportation produced their own research into this exact question.

For Chicago, I don’t have any priors on how Chicago is for biking besides a few arguments over their low city rating from people for bikes. I’m sure the winter weather plays a huge factor into the relatively low percentage of people commuting to work by bike. Chicago also has a decently robust public transit network, which may also contribute to its low numbers.

3.3 Top Cities for Bike Commuting (and working from home)

These are the cities that have the highest share of bike commuters according to the most recent ACS data from 2022.

Code

top_cities = @chain census_all_data begin
    @filter(
        data_year == 2022,
        !isnothing(total_bike_commuters)
    )
    @transmute(
        data_year,
        NAME,
        total_population = as_integer(total_population),
        total_commuters = as_integer(total_commuters),
        bike_commuters = as_integer(total_bike_commuters),
        worked_from_home = as_integer(total_worked_from_home)
    )
    @mutate(
        percent_bike_commuters = bike_commuters / total_commuters,
        percent_wfh = worked_from_home / total_commuters
    )
    @arrange desc(percent_bike_commuters)
    first(10)
    @mutate(
        rn = row_number(),
        name = str_remove(NAME, " city")
    )
end

f = Figure(
    fonts = (; regular = "IBM Plex Sans"),
    dpi = 600,
    width = 1200
)
ax = Axis(
    f[1, 1],
    ygridvisible = false,
    title = "Cities with Highest Share of Bike Commuters in 2022",
    yticks = (reverse(1:10), [name for name in top_cities[!, :name]]),
    xgridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    xticks = [0.0, 0.03, 0.06, 0.09],
    xtickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
barplot!(
    ax,
    top_cities[!, :rn],
    reverse(top_cities[!, :percent_bike_commuters]),
    bar_labels = [string.(round.(x * 100, digits = 2), "%") for x in reverse(top_cities[!, :percent_bike_commuters])],
    direction = :x,
    color = :black
)
xlims!(0, 0.10)
f

And for fun, these are the cities with the highest percentage of workers working from home.

Code

top_wfh_cities = @chain census_all_data begin
    @filter(
        data_year == 2022,
        !isnothing(total_bike_commuters)
    )
    @transmute(
        data_year,
        NAME,
        total_population = as_integer(total_population),
        total_commuters = as_integer(total_commuters),
        bike_commuters = as_integer(total_bike_commuters),
        worked_from_home = as_integer(total_worked_from_home)
    )
    @mutate(
        percent_bike_commuters = bike_commuters / total_commuters,
        percent_wfh = worked_from_home / total_commuters
    )
    @arrange desc(percent_wfh)
    first(10)
    @mutate(
        rn = row_number(),
        name = str_remove(NAME, " city")
    )
end

f = Figure(
    fonts = (; regular = "IBM Plex Sans"),
    dpi = 600,
    width = 1200
)
ax = Axis(
    f[1, 1],
    ygridvisible = false,
    title = "Cities with Highest Share of Working from Home 2022",
    yticks = (reverse(1:10), [name for name in top_wfh_cities[!, :name]]),
    xgridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    xticks = [0.0, 0.15, 0.3, 0.45],
    xtickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
barplot!(
    ax,
    top_wfh_cities[!, :rn],
    reverse(top_wfh_cities[!, :percent_wfh]),
    bar_labels = [string.(round.(x * 100, digits = 2), "%") for x in reverse(top_wfh_cities[!, :percent_wfh])],
    direction = :x,
    color = :black
)
xlims!(0, 0.46)
f

4 Wrapping Up

I hope that this post illustrates that there are some interesting trends to dig into hiding in the census data. I’m also hoping that this post provides a good basis for how to work with that census data in Julia, including how to manipulate data with Tidier.jl and how to plot data with CairoMakie.jl. I’ve been messing with this all day, so now it’s time for me to log off and spend some time with my partner and our cat.

As always, if this post was helpful to you–please let me know!