views from an evening ride on the Mt. Vernon trail, on the VA side.
Introduction
As my girlfriend and I ponder a move across the Potomac to Washington D.C., we’ve had many conversations about my safety there as a cyclist.
I’m no stranger to road riding – since I got back into cycling a few years ago, I’ve made it a point to practice riding with traffic. I’ve had a few incidents which mostly made me angry rather than scared me, but the thought of riding in DC is a bit intimidating when stories like this seem commonplace.
So, being a data nerd, I decided to do some investigation into cycling incidents. This post is part 1 of my exploration which discusses getting data from the DC Open Data Portal.
Washington D.C. maintains an open data portal which is full of useful datasets about the city. The datasets I’m primarily interested in relate to car accidents, Vision Zero comments, and anything else related to cycling, pedestrian, or roadway safety.
Working with the API in R: httr
and purrr
to the rescue.
The DC Open Data portal provides an API to access the datasets in GeoJSON format from an application or script. To effectively analyze this data in R, it would work best if it were formatted into a dataframe.
Luckily, httr
and purrr
make it pretty easy to 1) get the data and 2) parse the JSON into a dataframe. Whoever says you can’t use R for something like this is wrong and also missing out… 😉
I’ll walk through how I approached this and try to break down the steps as simply as I can.
1) we need to send a request!
In R, we can use the httr
package to make requests.
There are a few parameters we can provide to the request that are outlined in the API docs, but I’m only going to be using two here. These can be provided to the request in the query.
Let’s get 100 records at first to demonstrate what the result looks like.
library(httr) # optionally, we could specify library(httr, include.only = "GET")
library(jsonlite, include.only = "parse_json")
<- "https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Public_Safety_WebMercator/MapServer/24/query?outFields=*&where=1%3D1&f=geojson"
url
<- GET(
resp
url, query = list(
resultOffset = 0,
resultRecordCount = 100
)
)
<- resp |>
resp_json parse_json()
names(resp_json)
[1] "type" "features" "exceededTransferLimit"
[4] "properties"
Specifically, what we’re after here is in the “features” of the JSON. Each “feature” additionally has properties that we’d like to parse.
2) We need this to be in a format we can work with in R!
We can use the purrr
package to parse JSON elegantly and coerce the result into a dataframe using map_dfr
.
library(purrr)
<- resp_json |>
resp_df pluck("features") |>
map_dfr(
~pluck(.x, "properties")
)
head(resp_df, 10)
# A tibble: 10 × 52
OBJECTID CRIMEID CCN REPOR…¹ ROUTEID MEASURE OFFSET STREE…² ROADW…³ FROMD…⁴
<int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <dbl>
1 1.69e8 234252… 1012… 1.28e12 130594… 2613. 0.03 11050 16250 1.28e12
2 1.69e8 234255… 1012… 1.28e12 110251… 1908. 0.02 1646 2293 1.28e12
3 1.69e8 234257… 1118… 1.32e12 120153… 3696. 44.9 -9 29355 1.32e12
4 1.69e8 234258… 1118… 1.32e12 120376… 865. 13.8 -9 23334 1.32e12
5 1.69e8 234259… 1118… 1.32e12 110274… 1742. 38.9 -9 29708 1.32e12
6 1.69e8 234260… 1118… 1.32e12 Route … 0 0 -9 31702 1.32e12
7 1.69e8 234261… 1118… 1.32e12 110646… 904. 0.03 5213 35839 1.32e12
8 1.69e8 234279… 1012… 1.28e12 110011… 1300. 0.04 6029 18959 1.28e12
9 1.69e8 234282… 1012… 1.28e12 Route … 0 0 -9 32393 1.28e12
10 1.69e8 234283… 1012… 1.28e12 110338… 1274. 0.01 2221 2732 1.28e12
# … with 42 more variables: ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
# XCOORD <dbl>, YCOORD <dbl>, WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>,
# MAR_SCORE <dbl>, MAJORINJURIES_BICYCLIST <int>,
# MINORINJURIES_BICYCLIST <int>, UNKNOWNINJURIES_BICYCLIST <int>,
# FATAL_BICYCLIST <int>, MAJORINJURIES_DRIVER <int>,
# MINORINJURIES_DRIVER <int>, UNKNOWNINJURIES_DRIVER <int>,
# FATAL_DRIVER <int>, MAJORINJURIES_PEDESTRIAN <int>, …
a quick rundown on map_dfr
Let me explain what’s happening here briefly, because I recognize not everyone is familiar with the map_
set of functions from purrr
.
pluck()
here is the equivalent of doing resp_json[["features"]]
. As I mentioned before, that’s the part of the data we’re interested in.
For each record we pulled, there is a set of properties
attached to it. We can see this by running:
|>
resp_json pluck("features", 1)
$type
[1] "Feature"
$id
[1] 168878669
$geometry
$geometry$type
[1] "Point"
$geometry$coordinates
$geometry$coordinates[[1]]
[1] -76.99605
$geometry$coordinates[[2]]
[1] 38.84717
$properties
$properties$OBJECTID
[1] 168878669
$properties$CRIMEID
[1] "23425210"
$properties$CCN
[1] "10121974"
$properties$REPORTDATE
[1] 1.282626e+12
$properties$ROUTEID
[1] "13059452"
$properties$MEASURE
[1] 2612.96
$properties$OFFSET
[1] 0.03
$properties$STREETSEGID
[1] 11050
$properties$ROADWAYSEGID
[1] 16250
$properties$FROMDATE
[1] 1.282622e+12
$properties$TODATE
NULL
$properties$ADDRESS
[1] "2737 MARTIN LUTHER KING JR AVE SE"
$properties$LATITUDE
[1] 38.84717
$properties$LONGITUDE
[1] -76.99605
$properties$XCOORD
[1] 400343.1
$properties$YCOORD
[1] 131040.3
$properties$WARD
[1] "Ward 8"
$properties$EVENTID
[1] "{8C019248-1E0C-4380-A62D-DA34B38F79F7}"
$properties$MAR_ADDRESS
[1] "2737 MARTIN LUTHER KING JR AVENUE SE"
$properties$MAR_SCORE
[1] 100
$properties$MAJORINJURIES_BICYCLIST
[1] 0
$properties$MINORINJURIES_BICYCLIST
[1] 0
$properties$UNKNOWNINJURIES_BICYCLIST
[1] 0
$properties$FATAL_BICYCLIST
[1] 0
$properties$MAJORINJURIES_DRIVER
[1] 1
$properties$MINORINJURIES_DRIVER
[1] 0
$properties$UNKNOWNINJURIES_DRIVER
[1] 0
$properties$FATAL_DRIVER
[1] 0
$properties$MAJORINJURIES_PEDESTRIAN
[1] 0
$properties$MINORINJURIES_PEDESTRIAN
[1] 0
$properties$UNKNOWNINJURIES_PEDESTRIAN
[1] 0
$properties$FATAL_PEDESTRIAN
[1] 0
$properties$TOTAL_VEHICLES
[1] 3
$properties$TOTAL_BICYCLES
[1] 0
$properties$TOTAL_PEDESTRIANS
[1] 0
$properties$PEDESTRIANSIMPAIRED
[1] 0
$properties$BICYCLISTSIMPAIRED
[1] 0
$properties$DRIVERSIMPAIRED
[1] 0
$properties$TOTAL_TAXIS
[1] 0
$properties$TOTAL_GOVERNMENT
[1] 0
$properties$SPEEDING_INVOLVED
[1] 0
$properties$NEARESTINTROUTEID
[1] "13026862"
$properties$NEARESTINTSTREETNAME
[1] "CYPRESS ST SE"
$properties$OFFINTERSECTION
[1] 39.91
$properties$INTAPPROACHDIRECTION
[1] "South"
$properties$LOCATIONERROR
NULL
$properties$LASTUPDATEDATE
NULL
$properties$MPDLATITUDE
NULL
$properties$MPDLONGITUDE
NULL
$properties$MPDGEOX
NULL
$properties$MPDGEOY
NULL
$properties$FATALPASSENGER
[1] 0
$properties$MAJORINJURIESPASSENGER
[1] 0
$properties$MINORINJURIESPASSENGER
[1] 0
$properties$UNKNOWNINJURIESPASSENGER
[1] 0
$properties$MAR_ID
[1] 44965
Additionally, we could extract the properties of each point individually and turn them into a dataframe like this:
|>
resp_json pluck("features") |>
pluck(1, "properties") |> # properties of record number 1
flatten_df()
# A tibble: 1 × 49
OBJECTID CRIMEID CCN REPOR…¹ ROUTEID MEASURE OFFSET STREE…² ROADW…³ FROMD…⁴
<int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <dbl>
1 168878669 234252… 1012… 1.28e12 130594… 2613. 0.03 11050 16250 1.28e12
# … with 39 more variables: ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
# XCOORD <dbl>, YCOORD <dbl>, WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>,
# MAR_SCORE <int>, MAJORINJURIES_BICYCLIST <int>,
# MINORINJURIES_BICYCLIST <int>, UNKNOWNINJURIES_BICYCLIST <int>,
# FATAL_BICYCLIST <int>, MAJORINJURIES_DRIVER <int>,
# MINORINJURIES_DRIVER <int>, UNKNOWNINJURIES_DRIVER <int>,
# FATAL_DRIVER <int>, MAJORINJURIES_PEDESTRIAN <int>, …
That looks a bit familiar! Perhaps by this point you can even envision how to do this across all of the records with a loop, or with lapply()
if you’re a bit advanced.
## with a loop
<- dplyr::tibble()
res
for (x in 1:100) {
<- dplyr::bind_rows(
res
res, |>
resp_json pluck("features") |>
pluck(x, "properties") |>
flatten_df()
)
}
head(res, 10)
# A tibble: 10 × 52
OBJECTID CRIMEID CCN REPOR…¹ ROUTEID MEASURE OFFSET STREE…² ROADW…³ FROMD…⁴
<int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <dbl>
1 1.69e8 234252… 1012… 1.28e12 130594… 2613. 0.03 11050 16250 1.28e12
2 1.69e8 234255… 1012… 1.28e12 110251… 1908. 0.02 1646 2293 1.28e12
3 1.69e8 234257… 1118… 1.32e12 120153… 3696. 44.9 -9 29355 1.32e12
4 1.69e8 234258… 1118… 1.32e12 120376… 865. 13.8 -9 23334 1.32e12
5 1.69e8 234259… 1118… 1.32e12 110274… 1742. 38.9 -9 29708 1.32e12
6 1.69e8 234260… 1118… 1.32e12 Route … 0 0 -9 31702 1.32e12
7 1.69e8 234261… 1118… 1.32e12 110646… 904. 0.03 5213 35839 1.32e12
8 1.69e8 234279… 1012… 1.28e12 110011… 1300. 0.04 6029 18959 1.28e12
9 1.69e8 234282… 1012… 1.28e12 Route … 0 0 -9 32393 1.28e12
10 1.69e8 234283… 1012… 1.28e12 110338… 1274. 0.01 2221 2732 1.28e12
# … with 42 more variables: ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
# XCOORD <dbl>, YCOORD <dbl>, WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>,
# MAR_SCORE <dbl>, MAJORINJURIES_BICYCLIST <int>,
# MINORINJURIES_BICYCLIST <int>, UNKNOWNINJURIES_BICYCLIST <int>,
# FATAL_BICYCLIST <int>, MAJORINJURIES_DRIVER <int>,
# MINORINJURIES_DRIVER <int>, UNKNOWNINJURIES_DRIVER <int>,
# FATAL_DRIVER <int>, MAJORINJURIES_PEDESTRIAN <int>, …
## with lapply
lapply(
1:length(resp_json[["features"]]),
|>
\(x) resp_json pluck("features") |>
pluck(x, "properties") |>
flatten_df()
|>
) ::bind_rows() |>
dplyrhead(10)
# A tibble: 10 × 52
OBJECTID CRIMEID CCN REPOR…¹ ROUTEID MEASURE OFFSET STREE…² ROADW…³ FROMD…⁴
<int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <dbl>
1 1.69e8 234252… 1012… 1.28e12 130594… 2613. 0.03 11050 16250 1.28e12
2 1.69e8 234255… 1012… 1.28e12 110251… 1908. 0.02 1646 2293 1.28e12
3 1.69e8 234257… 1118… 1.32e12 120153… 3696. 44.9 -9 29355 1.32e12
4 1.69e8 234258… 1118… 1.32e12 120376… 865. 13.8 -9 23334 1.32e12
5 1.69e8 234259… 1118… 1.32e12 110274… 1742. 38.9 -9 29708 1.32e12
6 1.69e8 234260… 1118… 1.32e12 Route … 0 0 -9 31702 1.32e12
7 1.69e8 234261… 1118… 1.32e12 110646… 904. 0.03 5213 35839 1.32e12
8 1.69e8 234279… 1012… 1.28e12 110011… 1300. 0.04 6029 18959 1.28e12
9 1.69e8 234282… 1012… 1.28e12 Route … 0 0 -9 32393 1.28e12
10 1.69e8 234283… 1012… 1.28e12 110338… 1274. 0.01 2221 2732 1.28e12
# … with 42 more variables: ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
# XCOORD <dbl>, YCOORD <dbl>, WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>,
# MAR_SCORE <dbl>, MAJORINJURIES_BICYCLIST <int>,
# MINORINJURIES_BICYCLIST <int>, UNKNOWNINJURIES_BICYCLIST <int>,
# FATAL_BICYCLIST <int>, MAJORINJURIES_DRIVER <int>,
# MINORINJURIES_DRIVER <int>, UNKNOWNINJURIES_DRIVER <int>,
# FATAL_DRIVER <int>, MAJORINJURIES_PEDESTRIAN <int>, …
Basically, map_dfr
abstracts these steps away and allows us to 1) apply a function (pluck
in this case) to each element of a list and 2) return the resulting object as a dataframe created by binding rows together, which is what we demonstrated with the for-loop and lapply()
.
Thanks to httr
and purrr
, in just a few lines of code we’ve gotten a response from the API and parsed it into a usable dataframe format.
3) we need to do some more cleanup!
Some additional cleanup is necessary here though, as you may notice columns which should be a timestamp are currently in an odd format. The simplest solution here is to add a mutate step to the purrr
pipe.
library(dplyr)
library(lubridate, include.only = "as_datetime")
<- resp_json |>
resp_df pluck("features") |>
map_dfr(
~pluck(.x, "properties")
|>
) mutate(
REPORTDATE = as_datetime(
/ 1000) - (6 * 60 * 60), tz = "UTC"
(REPORTDATE
), FROMDATE = as_datetime(
/ 1000) - (6 * 60 * 60), tz = "UTC"
(FROMDATE
)# with additional records, we need this line as well:
# LASTUPDATEDATE = as_datetime(
# (LASTUPDATEDATE / 1000) - (6 * 60 * 60), tz = "UTC"
# )
)
head(resp_df, 10)
# A tibble: 10 × 52
OBJECTID CRIMEID CCN REPORTDATE ROUTEID MEASURE OFFSET STREE…¹
<int> <chr> <chr> <dttm> <chr> <dbl> <dbl> <int>
1 168878669 23425210 101219… 2010-08-23 23:00:00 130594… 2613. 0.03 11050
2 168878670 23425550 101211… 2010-08-21 23:00:00 110251… 1908. 0.02 1646
3 168878671 23425792 111843… 2011-12-17 02:16:00 120153… 3696. 44.9 -9
4 168878672 23425816 111843… 2011-12-17 01:00:00 120376… 865. 13.8 -9
5 168878673 23425979 111848… 2011-12-17 23:00:00 110274… 1742. 38.9 -9
6 168878674 23426084 111841… 2011-12-16 19:48:00 Route … 0 0 -9
7 168878675 23426193 111843… 2011-12-17 00:53:00 110646… 904. 0.03 5213
8 168878676 23427912 101234… 2010-08-27 03:40:00 110011… 1300. 0.04 6029
9 168878677 23428279 101230… 2010-08-25 23:00:00 Route … 0 0 -9
10 168878678 23428375 101235… 2010-08-26 23:00:00 110338… 1274. 0.01 2221
# … with 44 more variables: ROADWAYSEGID <int>, FROMDATE <dttm>, ADDRESS <chr>,
# LATITUDE <dbl>, LONGITUDE <dbl>, XCOORD <dbl>, YCOORD <dbl>, WARD <chr>,
# EVENTID <chr>, MAR_ADDRESS <chr>, MAR_SCORE <dbl>,
# MAJORINJURIES_BICYCLIST <int>, MINORINJURIES_BICYCLIST <int>,
# UNKNOWNINJURIES_BICYCLIST <int>, FATAL_BICYCLIST <int>,
# MAJORINJURIES_DRIVER <int>, MINORINJURIES_DRIVER <int>,
# UNKNOWNINJURIES_DRIVER <int>, FATAL_DRIVER <int>, …
4) We need this to get all of the data!
There are hundreds of thousands of crash records in the endpoint we’re working with, but we have a rate limit of 1000–meaning that we can only get 1000 records at a time.
Luckily, there’s a pretty simple solution here. Basically, we want to:
- collect 1000 responses at a time
- with each batch of responses we get, increase the offset for the next API call
- collect the responses (the “features” section of the JSON) into a list
- parse the elements of this list into a dataframe.
My suggestion is to make this into a function! That way, we can easily contain the logic to one spot.
One improvement I would suggest over my original design is actually to create two functions: one for getting the data and one for parsing the results to a dataframe. I’m also going to add some sections implementing a record limit, as otherwise this function would run until it collected all 200k+ responses!
<- function (rate_limit = 1000, offset = 0, record_limit = NULL) {
get_dc_crashes_data if (rate_limit > 1000) {
warning(
sprintf("A limit value of %s exceeeds the allowed rate limit. Defaulting to maximum of 1000.", limit)
)= 1000
rate_limit
}
if (!is.null(record_limit)) {
<- record_limit
rate_limit
}
<- "https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Public_Safety_WebMercator/MapServer/24/query?outFields=*&where=1%3D1&f=geojson"
url
<- list()
results_list <- rate_limit
n_responses
while (n_responses > 0) {
<- GET(
resp
url, query = list(
resultOffset = offset,
resultRecordCount = rate_limit
)
)
if (status_code(resp) != 200) {
error("Bad request.")
}
<- resp |> parse_json()
json_resp
<- append(results_list, json_resp[["features"]])
results_list
if (!is.null(record_limit)) {
<- 0
n_responses else {
} <- length(json_resp[["features"]])
n_responses
}
<- offset + n_responses
offset
}
if (offset > 0 & length(results_list) == 0) {
stop("Offset of this length returned no results, do you already have all the results?")
}
results_list
}
<- function(list_of_results) {
parse_crash_data_to_df |>
list_of_results map_dfr(
~pluck(.x, "properties")
|>
) mutate(
REPORTDATE = lubridate::as_datetime(
/ 1000) - (6 * 60 * 60), tz = "UTC"
(REPORTDATE
), FROMDATE = lubridate::as_datetime(
/ 1000) - (6 * 60 * 60), tz = "UTC"
(FROMDATE
)
) }
Using these together should give us the same result as before.
get_dc_crashes_data(record_limit = 100) |>
parse_crash_data_to_df()
# A tibble: 100 × 52
OBJECTID CRIMEID CCN REPORTDATE ROUTEID MEASURE OFFSET STREE…¹
<int> <chr> <chr> <dttm> <chr> <dbl> <dbl> <int>
1 168878669 23425210 101219… 2010-08-23 23:00:00 130594… 2613. 0.03 11050
2 168878670 23425550 101211… 2010-08-21 23:00:00 110251… 1908. 0.02 1646
3 168878671 23425792 111843… 2011-12-17 02:16:00 120153… 3696. 44.9 -9
4 168878672 23425816 111843… 2011-12-17 01:00:00 120376… 865. 13.8 -9
5 168878673 23425979 111848… 2011-12-17 23:00:00 110274… 1742. 38.9 -9
6 168878674 23426084 111841… 2011-12-16 19:48:00 Route … 0 0 -9
7 168878675 23426193 111843… 2011-12-17 00:53:00 110646… 904. 0.03 5213
8 168878676 23427912 101234… 2010-08-27 03:40:00 110011… 1300. 0.04 6029
9 168878677 23428279 101230… 2010-08-25 23:00:00 Route … 0 0 -9
10 168878678 23428375 101235… 2010-08-26 23:00:00 110338… 1274. 0.01 2221
# … with 90 more rows, 44 more variables: ROADWAYSEGID <int>, FROMDATE <dttm>,
# ADDRESS <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, XCOORD <dbl>, YCOORD <dbl>,
# WARD <chr>, EVENTID <chr>, MAR_ADDRESS <chr>, MAR_SCORE <dbl>,
# MAJORINJURIES_BICYCLIST <int>, MINORINJURIES_BICYCLIST <int>,
# UNKNOWNINJURIES_BICYCLIST <int>, FATAL_BICYCLIST <int>,
# MAJORINJURIES_DRIVER <int>, MINORINJURIES_DRIVER <int>,
# UNKNOWNINJURIES_DRIVER <int>, FATAL_DRIVER <int>, …
Next Steps
From here, as I discussed on the GitHub project page, I pushed this data to a local PostgreSQL server so I could query it directly. I’ll talk more about that in my next post on this subject!