Working With Census Data in Julia

In this post, we’ll explore some bike commuting data from the American Community Survey (ACS).
Julia
Transit
Author
Published

August 2, 2024

Code
## these are the main libraries we'll use for this exploration
using CairoMakie
using DataFrames
using DataStructures: OrderedDict
using HTTP
using JSON3
using Tidier

set_theme!(theme_light())

1 Introduction

Ahhhhh a company summer Friday. For many, it’s a day to relax–perhaps spend some time frolicking in the sun, enjoying a day where things like code, KPIs, and other business-y, work type things are far from their minds.

Well, it was hot outside today y’all–and Quarto now has a native Julia engine that’s been tempting me to write another Julia post…so I decided that I would spend some time getting my hands dirty with census data in Julia! Specifically, I was interested in exploring trends around bike commuting in different cities.

I’m hoping that this post serves as both an example of working with the Census API in Julia as well as a demonstration of Tidier.jl as a data manipulation library. This post leans heavily on Tidier.jl, as I’ve been testing it out in place of leaning on DataFramesMeta.jl for analysis workflows. Spoiler alert, it’s awesome–and if you’re already familiar with R, it’s super easy to pick up.

2 How to Work with the ACS Data in Julia

Note

If you aren’t interested in the how-to steps, you can skip straight to Section 3.

With any data analysis project, the first step is getting the data. Luckily, the Census maintains an API that we can use to access the treasure trove of data that they collect. The data we’ll be working with today comes from the American Community Survey (ACS). The ACS is described by the census as:

an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help inform how trillions of dollars in federal funds are distributed each year. Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics.

2.1 The Census JSON Response

I won’t review the various inner workings of the Census API (you can read their extensive user guide for that), but I will briefly go over how to work with the API response in Julia.

The Census API returns requests in a non-standard JSON format, described by their user guide as:

[the API uses] a nonstandard version of JSON that is streamlined:

  • Data are represented in a two-dimensional array
  • Square brackets [ ] hold arrays
  • Values are separated by a , (comma).

Since this is a non-standard format, that means we have to do a little work to put this into a DataFrame. There may be a slicker way to do this, but this is my approach! Let’s go over a short example.

We start by getting the response, and then reading this into a JSON array using JSON3. Let’s use a string that mocks the return of the API as an example.

json_data_ex = """
[
  ["STNAME","POP","DATE_","state"],
  ["Alabama","4849377","7","01"],
  ["Alaska","736732","7","02"],
  ["Arizona","6731484","7","04"],
  ["Arkansas","2966369","7","05"],
  ["California","38802500","7","06"]
]
"""

json_array_ex = JSON3.read(json_data_ex)
6-element JSON3.Array{JSON3.Array, Base.CodeUnits{UInt8, String}, Vector{UInt64}}:
 ["STNAME", "POP", "DATE_", "state"]
 ["Alabama", "4849377", "7", "01"]
 ["Alaska", "736732", "7", "02"]
 ["Arizona", "6731484", "7", "04"]
 ["Arkansas", "2966369", "7", "05"]
 ["California", "38802500", "7", "06"]

Note that the headers (what will be our column names) are element 1 of the array, and the data is everything from the 2nd element on. We’ll store these as header_ex and data_ex respectively.

header_ex, data_ex = json_array_ex[1], json_array_ex[2:end]

To get these into the proper format for use in a DataFrame, we need a data structure that contains

  1. the name of the column
  2. the values of that column

Ideally, this data structure would also store columns in the order that we enter them into the API call.

As you might have noticed if you took a look at the libraries we’re using, we can use an OrderedDict for this purpose. We’ll also assign it a type of {Symbol, Vector} for clarity.

df_ex = OrderedDict{Symbol, Vector}()

So, now we can enumerate the headers, and use list comprehension to create a vector of values that belong to that header–storing these in the OrderedDict we created.

for (i, col_name) in enumerate(header_ex)
    df_ex[Symbol(col_name)] = [row[i] for row in data_ex]
end

df_ex
OrderedDict{Symbol, Vector} with 4 entries:
  :STNAME => ["Alabama", "Alaska", "Arizona", "Arkansas", "California"]
  :POP    => ["4849377", "736732", "6731484", "2966369", "38802500"]
  :DATE_  => ["7", "7", "7", "7", "7"]
  :state  => ["01", "02", "04", "05", "06"]

While reading this, you may have recognized that a Dict (yes, even an OrderedDict!) of the format Dict(Symbol => Vector) is a valid and common constructor for a DataFrame in Julia. So, the last step is to simply call DataFrame(df), which constructs a DataFrame from the OrderedDict.

DataFrame(df_ex)
5×4 DataFrame
Row STNAME POP DATE_ state
String String String String
1 Alabama 4849377 7 01
2 Alaska 736732 7 02
3 Arizona 6731484 7 04
4 Arkansas 2966369 7 05
5 California 38802500 7 06

Now we have a DataFrame!

The nice part of this approach is its flexibility to the number of variables we request from the API and returns them in the same order that we input them. The downside is that it seems to return everything as a string, but we can take care of that later on in the analysis process.

2.2 Making it Repeatable with a Function!

We can make our calls to the API from within Julia a bit more pleasant by writing a function which lets us:

  • specify the variables to return (get in the API docs)
  • specify the geographic area (for in the API docs)
  • lets us specify an addition (in in the API docs)

Rather than carefully re-writing the base URL for our request and painstakingly adding each additional query string, we can specify this as a query within the HTTP.get() call.

If there are elements that we aren’t using, we should skip those as they can actually break the API call. I approached this by specifying a Dict that is initialized with the get variables. If additional elements are empty, we skip them. We also include the API key (which you should store somewhere that isn’t plain text, as I have done here) by pushing it to my Julia ENV.

The function is below:

function get_census_data(url; vars = [], _for = "", _in = "")
    """
        function to get census data at a specific URL, at a specified level.
    """
    census_query = Dict("get" => join(vars, ","))

    if !isempty(_for)
        census_query["for"] = _for
    end

    if !isempty(_in)
        census_query["in"] = _in
    end
    
    #=
    my API calls are working without this, but generally the census API requires that your key is included.
    please get this in your ENV somehow and don't put this in plain text :)
    =#
    
    # census_query["key"] = ENV["CENSUS_API_KEY"]

    r = HTTP.get(
        url,
        query = census_query
    )

    body = JSON3.read(r.body)
    header, data = body[1], body[2:end]

    df = OrderedDict{Symbol, Vector}()
    for (i, col_name) in enumerate(header)
        df[Symbol(col_name)] = [row[i] for row in data]
    end

    return DataFrame(df)
end

This function gets the API response and parses it to a DataFrame format (albeit where all the columns are strings…).

From a function design perspective, forcing things like vars and _for to be explicit keyword arguments–in my opinion–improves the readability of the function when we actually use it in a script. For example, here’s a call to the ACS API which returns the name of the “place”, the total population, the total number of commuters, and the total number of bike commuters for every state.

bike_commuters_by_place_ex = get_census_data(
    "https://api.census.gov/data/2022/acs/acs1",
    vars = [
        "NAME",
        "B01001_001E", # total population
        "B08006_001E", # total commuters
        "B08006_014E" # bike commuters
    ],
    _for = "place:*",
    _in = "state:*"
);

first(bike_commuters_by_place_ex, 10)
10×6 DataFrame
Row NAME B01001_001E B08006_001E B08006_014E state place
String Union… Union… Union… String String
1 Auburn city, Alabama 80009 01 03076
2 Birmingham city, Alabama 196353 89157 397 01 07000
3 Dothan city, Alabama 70524 01 21184
4 Hoover city, Alabama 92427 01 35896
5 Huntsville city, Alabama 222363 112793 0 01 37000
6 Mobile city, Alabama 183282 01 50000
7 Montgomery city, Alabama 196986 01 51000
8 Tuscaloosa city, Alabama 110598 01 77256
9 Anchorage municipality, Alaska 287145 150599 625 02 03000
10 Avondale city, Arizona 91618 04 04720

I don’t see any Census-related packages in Julia yet, so this may be the basis of my own Census data package in the future…TBD 👀 but for now, let’s go ahead and start looking into some trends in this data.

3 Digging into Trends Around Bike Commuting

3.1 Washington, DC

I was particularly interested around the bike commuting trends in Washington, DC. After all, DC consistently ranks pretty highly on lists of bike-friendly cities.

Note

A methodological note here, the variable group/table I’m using is B08006 – Sex of Workers By Means of Transportation To Work. I chose to use this because it’s what the League of American Cyclists used to produce this report, and I was able to replicate the numbers from this report (based on the 2016 ACS) as a sanity check for my own numbers.

With the exception of Davis, California…I was not able to verify the top-line 16.6% number because the API didn’t seem to return any data for Davis. The other cities’ numbers matched up though.

I pulled all the ACS data on bike commuting from 2005 to 2022.

Code
# these are some additional helper functions
function pull_acs_by_year(year)
    """
        pull the ACS data of interest by year,
    """
    df = get_census_data(
        "https://api.census.gov/data/$year/acs/acs1",
        vars = [
            "NAME",
            "B01001_001E", # total population
            "B08006_001E", # total commuters
            "B08006_014E", # bike commuters
            "B08006_017E", # worked from home
        ],
        _for = "place:*",
        _in = "state:*"
    )
    return insertcols!(df, :data_year => year)
end

function combine_census_data_of_interest()
    census_y_2005_2019 = reduce(vcat, [pull_acs_by_year(year) for year in 2005:2019])
    census_y_2021_2022 = reduce(vcat, [pull_acs_by_year(year) for year in 2021:2022]) # there is no 2020 data...for obvious reasons.
    return vcat(census_y_2005_2019, census_y_2021_2022)
end

# this combines the ACS data described above for multiple years.
census_all_data = combine_census_data_of_interest()

# just making these variables easier to work with
rename!(
    census_all_data,
    :B01001_001E => :total_population,
    :B08006_001E => :total_commuters,
    :B08006_014E => :total_bike_commuters,
    :B08006_017E => :total_worked_from_home
)

first(census_all_data, 10)
10×8 DataFrame
Row NAME total_population total_commuters total_bike_commuters total_worked_from_home state place data_year
String Union… Union… Union… Union… String String Int64
1 Birmingham city, Alabama 222154 01 07000 2005
2 Hoover city, Alabama 74473 01 35896 2005
3 Huntsville city, Alabama 158618 01 37000 2005
4 Mobile city, Alabama 193332 01 50000 2005
5 Montgomery city, Alabama 193042 01 51000 2005
6 Tuscaloosa city, Alabama 68006 01 77256 2005
7 Anchorage municipality, Alaska 266281 126456 0 212 02 03000 2005
8 Avondale city, Arizona 61666 04 04720 2005
9 Chandler city, Arizona 225725 118423 0 440 04 12000 2005
10 Gilbert town, Arizona 178539 04 27400 2005

Let’s check out the trend in DC. After a first pass at the data, I thought it’d make sense to check out the trend over the past 10 years (excluding 2020, where we don’t have ACS data we can access due to COVID).

Code
dc_bikers_by_year = @chain census_all_data begin
    @filter contains(NAME, "District")
    @transmute(
        data_year,
        name = NAME,
        # here's where we fix the fact that data comes through as strings :)
        total_population = as_integer(total_population),
        total_commuters = as_integer(total_commuters),
        total_bike_commuters = as_integer(total_bike_commuters),
        total_worked_from_home = as_integer(total_worked_from_home)
    )
    @mutate(
        percent_bike_commuters = total_bike_commuters / total_commuters,
        percent_wfh = total_worked_from_home / total_commuters
    )
    @filter data_year >= 2012
end

f = Figure(
    fonts = (; regular = "IBM Plex Sans", color = :black),
    dpi = 1200
)
ax = Axis(
    f[1, 1],
    xlabel = "ACS Data Year",
    title = "Percent of Workers Commuting by Bike",
    subtitle = "in Washington, DC (2012 - 2022)",
    xticks = [2012, 2014, 2016, 2018, 2020, 2022],
    xgridvisible = false,
    ygridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
lines!(ax, dc_bikers_by_year[!, :data_year], dc_bikers_by_year[!, :percent_bike_commuters], color = :black, linewidth = 2)
vlines!(ax, 2020, linestyle = :dot, color = :grey, label = "COVID")
f

Source: American Community Survey, 2012 - 2022

The percent of workers commuting by bike hit a high point in 2017 with 4.96% of workers commuting by bike, before beginning a sharp decline and hitting rock bottom in 2021 at 2.1%. Of course, something happened around 2020 which changed commuting patterns in a significant way–perhaps even forever! So there’s another trend we have to investigate alongside the decline in bike commuting, and that’s the percent of workers working from home.

Code
f = Figure(
    fonts = (; regular = "IBM Plex Sans", color = :black),
    dpi = 1200
)
ax = Axis(
    f[1, 1],
    xlabel = "ACS Data Year",
    title = "Percent of Workers Working from Home",
    subtitle = "in Washington, DC (2012 - 2022)",
    xticks = [2012, 2014, 2016, 2018, 2020, 2022],
    xgridvisible = false,
    ygridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    ytickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
lines!(ax, dc_bikers_by_year[!, :data_year], dc_bikers_by_year[!, :percent_wfh], color = :black, linewidth = 2)
vlines!(ax, 2020, linestyle = :dot, color = :grey, label = "COVID")
f

Source: American Community Survey, 2012 - 2022

Unsurprisingly, the percent of workers that said they were working from home explodes after 2020. In 2021, 48.2% of workers in DC said that they were working from home–almost half the workforce!

3.3 Top Cities for Bike Commuting (and working from home)

These are the cities that have the highest share of bike commuters according to the most recent ACS data from 2022.

Code
top_cities = @chain census_all_data begin
    @filter(
        data_year == 2022,
        !isnothing(total_bike_commuters)
    )
    @transmute(
        data_year,
        NAME,
        total_population = as_integer(total_population),
        total_commuters = as_integer(total_commuters),
        bike_commuters = as_integer(total_bike_commuters),
        worked_from_home = as_integer(total_worked_from_home)
    )
    @mutate(
        percent_bike_commuters = bike_commuters / total_commuters,
        percent_wfh = worked_from_home / total_commuters
    )
    @arrange desc(percent_bike_commuters)
    first(10)
    @mutate(
        rn = row_number(),
        name = str_remove(NAME, " city")
    )
end

f = Figure(
    fonts = (; regular = "IBM Plex Sans"),
    dpi = 600,
    width = 1200
)
ax = Axis(
    f[1, 1],
    ygridvisible = false,
    title = "Cities with Highest Share of Bike Commuters in 2022",
    yticks = (reverse(1:10), [name for name in top_cities[!, :name]]),
    xgridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    xticks = [0.0, 0.03, 0.06, 0.09],
    xtickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
barplot!(
    ax,
    top_cities[!, :rn],
    reverse(top_cities[!, :percent_bike_commuters]),
    bar_labels = [string.(round.(x * 100, digits = 2), "%") for x in reverse(top_cities[!, :percent_bike_commuters])],
    direction = :x,
    color = :black
)
xlims!(0, 0.10)
f

Source: American Community Survey, 2022

And for fun, these are the cities with the highest percentage of workers working from home.

Code
top_wfh_cities = @chain census_all_data begin
    @filter(
        data_year == 2022,
        !isnothing(total_bike_commuters)
    )
    @transmute(
        data_year,
        NAME,
        total_population = as_integer(total_population),
        total_commuters = as_integer(total_commuters),
        bike_commuters = as_integer(total_bike_commuters),
        worked_from_home = as_integer(total_worked_from_home)
    )
    @mutate(
        percent_bike_commuters = bike_commuters / total_commuters,
        percent_wfh = worked_from_home / total_commuters
    )
    @arrange desc(percent_wfh)
    first(10)
    @mutate(
        rn = row_number(),
        name = str_remove(NAME, " city")
    )
end

f = Figure(
    fonts = (; regular = "IBM Plex Sans"),
    dpi = 600,
    width = 1200
)
ax = Axis(
    f[1, 1],
    ygridvisible = false,
    title = "Cities with Highest Share of Working from Home 2022",
    yticks = (reverse(1:10), [name for name in top_wfh_cities[!, :name]]),
    xgridstyle = :dash,
    ygridcolor = (:lightgrey, 0.45),
    xticks = [0.0, 0.15, 0.3, 0.45],
    xtickformat = x -> string.(round.(x * 100, digits = 1, RoundNearest), "%"),
    titlealign = :left,
    titlecolor = :black
)
barplot!(
    ax,
    top_wfh_cities[!, :rn],
    reverse(top_wfh_cities[!, :percent_wfh]),
    bar_labels = [string.(round.(x * 100, digits = 2), "%") for x in reverse(top_wfh_cities[!, :percent_wfh])],
    direction = :x,
    color = :black
)
xlims!(0, 0.46)
f

Source: American Community Survey, 2022

4 Wrapping Up

I hope that this post illustrates that there are some interesting trends to dig into hiding in the census data. I’m also hoping that this post provides a good basis for how to work with that census data in Julia, including how to manipulate data with Tidier.jl and how to plot data with CairoMakie.jl. I’ve been messing with this all day, so now it’s time for me to log off and spend some time with my partner and our cat.

As always, if this post was helpful to you–please let me know!