Cycling in DC, Part 1

Getting Data from DC’s Open Data Portal using R

Part 1 of a series of data-driven exploration into the safety of cycling in Washington, D.C. This post focuses on obtaining data from DC’s Open Data portal.
Published

February 17, 2023

views from an evening ride on the Mt. Vernon trail, on the VA side.

Introduction

As my girlfriend and I ponder a move across the Potomac to Washington D.C., we’ve had many conversations about my safety there as a cyclist.

I’m no stranger to road riding – since I got back into cycling a few years ago, I’ve made it a point to practice riding with traffic. I’ve had a few incidents which mostly made me angry rather than scared me, but the thought of riding in DC is a bit intimidating when stories like this seem commonplace.

So, being a data nerd, I decided to do some investigation into cycling incidents. This post is part 1 of my exploration which discusses getting data from the DC Open Data Portal.

Washington D.C. maintains an open data portal which is full of useful datasets about the city. The datasets I’m primarily interested in relate to car accidents, Vision Zero comments, and anything else related to cycling, pedestrian, or roadway safety.

Working with the API in R: httr and purrr to the rescue.

The DC Open Data portal provides an API to access the datasets in GeoJSON format from an application or script. To effectively analyze this data in R, it would work best if it were formatted into a dataframe.

Luckily, httr and purrr make it pretty easy to 1) get the data and 2) parse the JSON into a dataframe. Whoever says you can’t use R for something like this is wrong and also missing out… 😉

I’ll walk through how I approached this and try to break down the steps as simply as I can.

1) we need to send a request!

In R, we can use the httr package to make requests.

There are a few parameters we can provide to the request that are outlined in the API docs, but I’m only going to be using two here. These can be provided to the request in the query.

Let’s get 100 records at first to demonstrate what the result looks like.

library(httr) # optionally, we could specify library(httr, include.only = "GET")
library(jsonlite, include.only = "parse_json")

url <- "https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Public_Safety_WebMercator/MapServer/24/query?outFields=*&where=1%3D1&f=geojson"

resp <- GET(
    url, 
    query = list(
        resultOffset = 0, 
        resultRecordCount = 100
    )
)

resp_json <- resp |> 
    parse_json()

names(resp_json)
[1] "type"                  "features"              "exceededTransferLimit"
[4] "properties"           

Specifically, what we’re after here is in the “features” of the JSON. Each “feature” additionally has properties that we’d like to parse.

2) We need this to be in a format we can work with in R!

We can use the purrr package to parse JSON elegantly and coerce the result into a dataframe using map_dfr.

library(purrr)

resp_df <- resp_json |> 
    pluck("features") |>
    map_dfr(
      ~pluck(.x, "properties")
    )

head(resp_df, 10)
# A tibble: 10 × 52
   OBJECTID CRIMEID CCN   REPOR…¹ ROUTEID MEASURE OFFSET STREE…² ROADW…³ FROMD…⁴
      <int> <chr>   <chr>   <dbl> <chr>     <dbl>  <dbl>   <int>   <int>   <dbl>
 1   1.69e8 234252… 1012… 1.28e12 130594…   2613.   0.03   11050   16250 1.28e12
 2   1.69e8 234255… 1012… 1.28e12 110251…   1908.   0.02    1646    2293 1.28e12
 3   1.69e8 234257… 1118… 1.32e12 120153…   3696.  44.9       -9   29355 1.32e12
 4   1.69e8 234258… 1118… 1.32e12 120376…    865.  13.8       -9   23334 1.32e12
 5   1.69e8 234259… 1118… 1.32e12 110274…   1742.  38.9       -9   29708 1.32e12
 6   1.69e8 234260… 1118… 1.32e12 Route …      0    0         -9   31702 1.32e12
 7   1.69e8 234261… 1118… 1.32e12 110646…    904.   0.03    5213   35839 1.32e12
 8   1.69e8 234279… 1012… 1.28e12 110011…   1300.   0.04    6029   18959 1.28e12
 9   1.69e8 234282… 1012… 1.28e12 Route …      0    0         -9   32393 1.28e12
10   1.69e8 234283… 1012… 1.28e12 110338…   1274.   0.01    2221    2732 1.28e12
# … with 42 more variables: ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
#   XCOORD <dbl>, YCOORD <dbl>, WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>,
#   MAR_SCORE <dbl>, MAJORINJURIES_BICYCLIST <int>,
#   MINORINJURIES_BICYCLIST <int>, UNKNOWNINJURIES_BICYCLIST <int>,
#   FATAL_BICYCLIST <int>, MAJORINJURIES_DRIVER <int>,
#   MINORINJURIES_DRIVER <int>, UNKNOWNINJURIES_DRIVER <int>,
#   FATAL_DRIVER <int>, MAJORINJURIES_PEDESTRIAN <int>, …

a quick rundown on map_dfr

Let me explain what’s happening here briefly, because I recognize not everyone is familiar with the map_ set of functions from purrr.

pluck() here is the equivalent of doing resp_json[["features"]]. As I mentioned before, that’s the part of the data we’re interested in.

For each record we pulled, there is a set of properties attached to it. We can see this by running:

resp_json |> 
    pluck("features", 1)
$type
[1] "Feature"

$id
[1] 168878669

$geometry
$geometry$type
[1] "Point"

$geometry$coordinates
$geometry$coordinates[[1]]
[1] -76.99605

$geometry$coordinates[[2]]
[1] 38.84717



$properties
$properties$OBJECTID
[1] 168878669

$properties$CRIMEID
[1] "23425210"

$properties$CCN
[1] "10121974"

$properties$REPORTDATE
[1] 1.282626e+12

$properties$ROUTEID
[1] "13059452"

$properties$MEASURE
[1] 2612.96

$properties$OFFSET
[1] 0.03

$properties$STREETSEGID
[1] 11050

$properties$ROADWAYSEGID
[1] 16250

$properties$FROMDATE
[1] 1.282622e+12

$properties$TODATE
NULL

$properties$ADDRESS
[1] "2737 MARTIN LUTHER KING JR AVE SE"

$properties$LATITUDE
[1] 38.84717

$properties$LONGITUDE
[1] -76.99605

$properties$XCOORD
[1] 400343.1

$properties$YCOORD
[1] 131040.3

$properties$WARD
[1] "Ward 8"

$properties$EVENTID
[1] "{8C019248-1E0C-4380-A62D-DA34B38F79F7}"

$properties$MAR_ADDRESS
[1] "2737 MARTIN LUTHER KING JR AVENUE SE"

$properties$MAR_SCORE
[1] 100

$properties$MAJORINJURIES_BICYCLIST
[1] 0

$properties$MINORINJURIES_BICYCLIST
[1] 0

$properties$UNKNOWNINJURIES_BICYCLIST
[1] 0

$properties$FATAL_BICYCLIST
[1] 0

$properties$MAJORINJURIES_DRIVER
[1] 1

$properties$MINORINJURIES_DRIVER
[1] 0

$properties$UNKNOWNINJURIES_DRIVER
[1] 0

$properties$FATAL_DRIVER
[1] 0

$properties$MAJORINJURIES_PEDESTRIAN
[1] 0

$properties$MINORINJURIES_PEDESTRIAN
[1] 0

$properties$UNKNOWNINJURIES_PEDESTRIAN
[1] 0

$properties$FATAL_PEDESTRIAN
[1] 0

$properties$TOTAL_VEHICLES
[1] 3

$properties$TOTAL_BICYCLES
[1] 0

$properties$TOTAL_PEDESTRIANS
[1] 0

$properties$PEDESTRIANSIMPAIRED
[1] 0

$properties$BICYCLISTSIMPAIRED
[1] 0

$properties$DRIVERSIMPAIRED
[1] 0

$properties$TOTAL_TAXIS
[1] 0

$properties$TOTAL_GOVERNMENT
[1] 0

$properties$SPEEDING_INVOLVED
[1] 0

$properties$NEARESTINTROUTEID
[1] "13026862"

$properties$NEARESTINTSTREETNAME
[1] "CYPRESS ST SE"

$properties$OFFINTERSECTION
[1] 39.91

$properties$INTAPPROACHDIRECTION
[1] "South"

$properties$LOCATIONERROR
NULL

$properties$LASTUPDATEDATE
NULL

$properties$MPDLATITUDE
NULL

$properties$MPDLONGITUDE
NULL

$properties$MPDGEOX
NULL

$properties$MPDGEOY
NULL

$properties$FATALPASSENGER
[1] 0

$properties$MAJORINJURIESPASSENGER
[1] 0

$properties$MINORINJURIESPASSENGER
[1] 0

$properties$UNKNOWNINJURIESPASSENGER
[1] 0

$properties$MAR_ID
[1] 44965

Additionally, we could extract the properties of each point individually and turn them into a dataframe like this:

resp_json |> 
    pluck("features") |> 
    pluck(1, "properties") |> # properties of record number 1
    flatten_df()
# A tibble: 1 × 49
   OBJECTID CRIMEID CCN   REPOR…¹ ROUTEID MEASURE OFFSET STREE…² ROADW…³ FROMD…⁴
      <int> <chr>   <chr>   <dbl> <chr>     <dbl>  <dbl>   <int>   <int>   <dbl>
1 168878669 234252… 1012… 1.28e12 130594…   2613.   0.03   11050   16250 1.28e12
# … with 39 more variables: ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
#   XCOORD <dbl>, YCOORD <dbl>, WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>,
#   MAR_SCORE <int>, MAJORINJURIES_BICYCLIST <int>,
#   MINORINJURIES_BICYCLIST <int>, UNKNOWNINJURIES_BICYCLIST <int>,
#   FATAL_BICYCLIST <int>, MAJORINJURIES_DRIVER <int>,
#   MINORINJURIES_DRIVER <int>, UNKNOWNINJURIES_DRIVER <int>,
#   FATAL_DRIVER <int>, MAJORINJURIES_PEDESTRIAN <int>, …

That looks a bit familiar! Perhaps by this point you can even envision how to do this across all of the records with a loop, or with lapply() if you’re a bit advanced.

## with a loop
res <- dplyr::tibble()

for (x in 1:100) {
    res <- dplyr::bind_rows(
        res, 
        resp_json |> 
        pluck("features") |> 
        pluck(x, "properties") |> 
        flatten_df()
    )
}

head(res, 10)
# A tibble: 10 × 52
   OBJECTID CRIMEID CCN   REPOR…¹ ROUTEID MEASURE OFFSET STREE…² ROADW…³ FROMD…⁴
      <int> <chr>   <chr>   <dbl> <chr>     <dbl>  <dbl>   <int>   <int>   <dbl>
 1   1.69e8 234252… 1012… 1.28e12 130594…   2613.   0.03   11050   16250 1.28e12
 2   1.69e8 234255… 1012… 1.28e12 110251…   1908.   0.02    1646    2293 1.28e12
 3   1.69e8 234257… 1118… 1.32e12 120153…   3696.  44.9       -9   29355 1.32e12
 4   1.69e8 234258… 1118… 1.32e12 120376…    865.  13.8       -9   23334 1.32e12
 5   1.69e8 234259… 1118… 1.32e12 110274…   1742.  38.9       -9   29708 1.32e12
 6   1.69e8 234260… 1118… 1.32e12 Route …      0    0         -9   31702 1.32e12
 7   1.69e8 234261… 1118… 1.32e12 110646…    904.   0.03    5213   35839 1.32e12
 8   1.69e8 234279… 1012… 1.28e12 110011…   1300.   0.04    6029   18959 1.28e12
 9   1.69e8 234282… 1012… 1.28e12 Route …      0    0         -9   32393 1.28e12
10   1.69e8 234283… 1012… 1.28e12 110338…   1274.   0.01    2221    2732 1.28e12
# … with 42 more variables: ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
#   XCOORD <dbl>, YCOORD <dbl>, WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>,
#   MAR_SCORE <dbl>, MAJORINJURIES_BICYCLIST <int>,
#   MINORINJURIES_BICYCLIST <int>, UNKNOWNINJURIES_BICYCLIST <int>,
#   FATAL_BICYCLIST <int>, MAJORINJURIES_DRIVER <int>,
#   MINORINJURIES_DRIVER <int>, UNKNOWNINJURIES_DRIVER <int>,
#   FATAL_DRIVER <int>, MAJORINJURIES_PEDESTRIAN <int>, …
## with lapply
lapply(
    1:length(resp_json[["features"]]), 
    \(x) resp_json |> 
        pluck("features") |> 
        pluck(x, "properties") |> 
        flatten_df()
) |> 
    dplyr::bind_rows() |> 
    head(10)
# A tibble: 10 × 52
   OBJECTID CRIMEID CCN   REPOR…¹ ROUTEID MEASURE OFFSET STREE…² ROADW…³ FROMD…⁴
      <int> <chr>   <chr>   <dbl> <chr>     <dbl>  <dbl>   <int>   <int>   <dbl>
 1   1.69e8 234252… 1012… 1.28e12 130594…   2613.   0.03   11050   16250 1.28e12
 2   1.69e8 234255… 1012… 1.28e12 110251…   1908.   0.02    1646    2293 1.28e12
 3   1.69e8 234257… 1118… 1.32e12 120153…   3696.  44.9       -9   29355 1.32e12
 4   1.69e8 234258… 1118… 1.32e12 120376…    865.  13.8       -9   23334 1.32e12
 5   1.69e8 234259… 1118… 1.32e12 110274…   1742.  38.9       -9   29708 1.32e12
 6   1.69e8 234260… 1118… 1.32e12 Route …      0    0         -9   31702 1.32e12
 7   1.69e8 234261… 1118… 1.32e12 110646…    904.   0.03    5213   35839 1.32e12
 8   1.69e8 234279… 1012… 1.28e12 110011…   1300.   0.04    6029   18959 1.28e12
 9   1.69e8 234282… 1012… 1.28e12 Route …      0    0         -9   32393 1.28e12
10   1.69e8 234283… 1012… 1.28e12 110338…   1274.   0.01    2221    2732 1.28e12
# … with 42 more variables: ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
#   XCOORD <dbl>, YCOORD <dbl>, WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>,
#   MAR_SCORE <dbl>, MAJORINJURIES_BICYCLIST <int>,
#   MINORINJURIES_BICYCLIST <int>, UNKNOWNINJURIES_BICYCLIST <int>,
#   FATAL_BICYCLIST <int>, MAJORINJURIES_DRIVER <int>,
#   MINORINJURIES_DRIVER <int>, UNKNOWNINJURIES_DRIVER <int>,
#   FATAL_DRIVER <int>, MAJORINJURIES_PEDESTRIAN <int>, …

Basically, map_dfr abstracts these steps away and allows us to 1) apply a function (pluck in this case) to each element of a list and 2) return the resulting object as a dataframe created by binding rows together, which is what we demonstrated with the for-loop and lapply().

Thanks to httr and purrr, in just a few lines of code we’ve gotten a response from the API and parsed it into a usable dataframe format.

3) we need to do some more cleanup!

Some additional cleanup is necessary here though, as you may notice columns which should be a timestamp are currently in an odd format. The simplest solution here is to add a mutate step to the purrr pipe.

library(dplyr)
library(lubridate, include.only = "as_datetime")

resp_df <- resp_json |> 
    pluck("features") |>
    map_dfr(
      ~pluck(.x, "properties")
    ) |> 
    mutate(
        REPORTDATE = as_datetime(
            (REPORTDATE / 1000) - (6 * 60 * 60), tz = "UTC"
        ), 
        FROMDATE = as_datetime(
            (FROMDATE / 1000) - (6 * 60 * 60), tz = "UTC"
        )
        # with additional records, we need this line as well:
        # LASTUPDATEDATE = as_datetime(
        #     (LASTUPDATEDATE / 1000) - (6 * 60 * 60), tz = "UTC"
        # )
    )

head(resp_df, 10)
# A tibble: 10 × 52
    OBJECTID CRIMEID  CCN     REPORTDATE          ROUTEID MEASURE OFFSET STREE…¹
       <int> <chr>    <chr>   <dttm>              <chr>     <dbl>  <dbl>   <int>
 1 168878669 23425210 101219… 2010-08-23 23:00:00 130594…   2613.   0.03   11050
 2 168878670 23425550 101211… 2010-08-21 23:00:00 110251…   1908.   0.02    1646
 3 168878671 23425792 111843… 2011-12-17 02:16:00 120153…   3696.  44.9       -9
 4 168878672 23425816 111843… 2011-12-17 01:00:00 120376…    865.  13.8       -9
 5 168878673 23425979 111848… 2011-12-17 23:00:00 110274…   1742.  38.9       -9
 6 168878674 23426084 111841… 2011-12-16 19:48:00 Route …      0    0         -9
 7 168878675 23426193 111843… 2011-12-17 00:53:00 110646…    904.   0.03    5213
 8 168878676 23427912 101234… 2010-08-27 03:40:00 110011…   1300.   0.04    6029
 9 168878677 23428279 101230… 2010-08-25 23:00:00 Route …      0    0         -9
10 168878678 23428375 101235… 2010-08-26 23:00:00 110338…   1274.   0.01    2221
# … with 44 more variables: ROADWAYSEGID <int>, FROMDATE <dttm>, ADDRESS <chr>,
#   LATITUDE <dbl>, LONGITUDE <dbl>, XCOORD <dbl>, YCOORD <dbl>, WARD <chr>,
#   EVENTID <chr>, MAR_ADDRESS <chr>, MAR_SCORE <dbl>,
#   MAJORINJURIES_BICYCLIST <int>, MINORINJURIES_BICYCLIST <int>,
#   UNKNOWNINJURIES_BICYCLIST <int>, FATAL_BICYCLIST <int>,
#   MAJORINJURIES_DRIVER <int>, MINORINJURIES_DRIVER <int>,
#   UNKNOWNINJURIES_DRIVER <int>, FATAL_DRIVER <int>, …

4) We need this to get all of the data!

There are hundreds of thousands of crash records in the endpoint we’re working with, but we have a rate limit of 1000–meaning that we can only get 1000 records at a time.

Luckily, there’s a pretty simple solution here. Basically, we want to:

  1. collect 1000 responses at a time
  2. with each batch of responses we get, increase the offset for the next API call
  3. collect the responses (the “features” section of the JSON) into a list
  4. parse the elements of this list into a dataframe.

My suggestion is to make this into a function! That way, we can easily contain the logic to one spot.

One improvement I would suggest over my original design is actually to create two functions: one for getting the data and one for parsing the results to a dataframe. I’m also going to add some sections implementing a record limit, as otherwise this function would run until it collected all 200k+ responses!

get_dc_crashes_data <- function (rate_limit = 1000, offset = 0, record_limit = NULL) {
  if (rate_limit > 1000) {
    warning(
      sprintf("A limit value of %s exceeeds the allowed rate limit. Defaulting to maximum of 1000.", limit)
    )
    rate_limit = 1000
  }
    
    if (!is.null(record_limit)) {
        rate_limit <- record_limit
    }
  
  url <- "https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Public_Safety_WebMercator/MapServer/24/query?outFields=*&where=1%3D1&f=geojson"
  
  results_list <- list()
  n_responses <- rate_limit
  
  while (n_responses > 0) {
    resp <- GET(
      url, 
      query = list(
        resultOffset = offset, 
        resultRecordCount = rate_limit
      )
    )
    
    if (status_code(resp) != 200) {
      error("Bad request.")
    }
    
    json_resp <- resp |> parse_json()
    
    results_list <- append(results_list, json_resp[["features"]])
    
    if (!is.null(record_limit)) {
        n_responses <- 0
    } else {
        n_responses <- length(json_resp[["features"]])
    }
    
    offset <- offset + n_responses
  }
  
  if (offset > 0 & length(results_list) == 0) {
    stop("Offset of this length returned no results, do you already have all the results?")
  }
 
  results_list
}

parse_crash_data_to_df <- function(list_of_results) {
  list_of_results |> 
    map_dfr(
      ~pluck(.x, "properties")
    ) |> 
    mutate(
      REPORTDATE = lubridate::as_datetime(
        (REPORTDATE / 1000) - (6 * 60 * 60), tz = "UTC"
      ), 
      FROMDATE = lubridate::as_datetime(
        (FROMDATE / 1000) - (6 * 60 * 60), tz = "UTC"
      )
    )
}

Using these together should give us the same result as before.

get_dc_crashes_data(record_limit = 100) |> 
    parse_crash_data_to_df()
# A tibble: 100 × 52
    OBJECTID CRIMEID  CCN     REPORTDATE          ROUTEID MEASURE OFFSET STREE…¹
       <int> <chr>    <chr>   <dttm>              <chr>     <dbl>  <dbl>   <int>
 1 168878669 23425210 101219… 2010-08-23 23:00:00 130594…   2613.   0.03   11050
 2 168878670 23425550 101211… 2010-08-21 23:00:00 110251…   1908.   0.02    1646
 3 168878671 23425792 111843… 2011-12-17 02:16:00 120153…   3696.  44.9       -9
 4 168878672 23425816 111843… 2011-12-17 01:00:00 120376…    865.  13.8       -9
 5 168878673 23425979 111848… 2011-12-17 23:00:00 110274…   1742.  38.9       -9
 6 168878674 23426084 111841… 2011-12-16 19:48:00 Route …      0    0         -9
 7 168878675 23426193 111843… 2011-12-17 00:53:00 110646…    904.   0.03    5213
 8 168878676 23427912 101234… 2010-08-27 03:40:00 110011…   1300.   0.04    6029
 9 168878677 23428279 101230… 2010-08-25 23:00:00 Route …      0    0         -9
10 168878678 23428375 101235… 2010-08-26 23:00:00 110338…   1274.   0.01    2221
# … with 90 more rows, 44 more variables: ROADWAYSEGID <int>, FROMDATE <dttm>,
#   ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, XCOORD <dbl>, YCOORD <dbl>,
#   WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>, MAR_SCORE <dbl>,
#   MAJORINJURIES_BICYCLIST <int>, MINORINJURIES_BICYCLIST <int>,
#   UNKNOWNINJURIES_BICYCLIST <int>, FATAL_BICYCLIST <int>,
#   MAJORINJURIES_DRIVER <int>, MINORINJURIES_DRIVER <int>,
#   UNKNOWNINJURIES_DRIVER <int>, FATAL_DRIVER <int>, …

Next Steps

From here, as I discussed on the GitHub project page, I pushed this data to a local PostgreSQL server so I could query it directly. I’ll talk more about that in my next post on this subject!