Revisiting Cycling in DC (this time, make it Rusty 🦀)

I’ve been playing around with Rust more recently, and thought I’d try to apply it to a real world project.
Rust
Author
Affiliation

Data Analyst at CollegeVine

Published

October 3, 2023

Introduction

In my last post on Julia, I mentioned that I had started playing around with Rust in my free time. I don’t have much experience in lower-level languages, and Rust stood out to me as a great option for diving in. In particular, I was drawn to Rust by the community, its speed, its helpful compiler messages, and its ability to write reliable software.

My main point in writing this post is to capture some musings on learning Rust from the perspective of someone who has mostly written R, so I’ll try to keep this short and sweet 😄

Wait, don’t you want to do data science? Why are you learning Rust?

Great question! My programming journey started with R in college–R was my door into all things data. Around the same time, I discovered Julia (though put off working with it in depth for a few years) and started to learn a bit of Python.

These languages all have something in common despite their differences: they’re all high-level languages, and I haven’t dabbled much in lower level ones. There are a couple of other reasons I can think of that make Rust an interesting choice for a Data Scientist:

  • why not? I feel the same about using Julia. Rust is an interesting language, and exposes you to different ways of programming that can influence your code (hopefully in a positive way) in other languages.
  • Polars recently got a big bump in funding–and I’m watching this tool eagerly.
  • I have a fantasy of eventually using Rust in a production setting as a pipeline language.

Revisting my Cycling in DC Project

A long time ago, I made a post where I detailed accessing the DC Open Data Portal’s API with R. As part of that project, I set up a process in R to scrape the data, clean it up, and then store it in a dbt project. I intended on returning to that project and doing some analysis on the data and walking through some dbt stuff, but I just never got around to it. 😅

Since I’ve been building up some knowledge of Rust, I thought this would actually be a great project to come back to. I’m pretty familiar with getting data from APIs, so theoretically this should be a simple project.

Learning Rust

I spent some time reading the book to get familiar with Rust’s syntax and programming style.

Much like other languages, once you break the surface on the language itself you’ll find that you suddenly have to learn a bunch of additional libraries on top of it (think pandas, the tidyverse, and a number of other packages)!

There’s a number of helpful examples in Rust out there. I think the design patterns in Rust particularly start to click when walking through a real example. I fought the compiler for a few hours to get my code working - but at the end of it, it feels pretty solid (though there’s always room for improvement).

My Experience Rewriting my R Code to Rust

Rust is an pretty verbose language, so I don’t want to share everything here. You can find all my code for this post here.

My experience at a high level can be summed up by one word: painful. I wouldn’t necessarily say it was a bad kind of painful. R abstracts away a lot of things from the user, which is nice when I’m trying to write analytics code or create a plot. But when trying to replicate the code, there are a few things that I just don’t think about that emerge.

Types, types, types

One is types–generally just having to think about types at all, honestly.

Let’s say I want to grab the output of an API’s method in R that counts the number of records returned by the API. If I wanted to put it in a function, it would look sometihng like this.

library(httr, include.only = "GET")
library(jsonlite, include.only = "parse_json")
library(purrr, include.only = "pluck")

get_total_record_count <- function() {
  url <- "https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Public_Safety_WebMercator/MapServer/24/query?where=1%3D1&outFields=*&returnCountOnly=true&outSR=4326&f=json"
  
  r <- GET(url)
  
  r |> 
    parse_json() |> 
    pluck("count")
}

get_total_record_count()
[1] 295575

In Rust, the equivalent function to do this looked like this:

async fn get_total_record_count() -> Result<i64, Box<dyn std::error::Error>> {
    let count_url = "https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Public_Safety_WebMercator/MapServer/24/query?where=1%3D1&outFields=*&returnCountOnly=true&outSR=4326&f=json";
    
    let client = reqwest::Client::builder() 
        .build()?;

    let api_response = client 
        .get(count_url)
        .send()
        .await?;

    let r: TotalRecordCount = api_response 
        .json::<TotalRecordCount>()
        .await?;
    
    return Ok(r.count);
}

struct TotalRecordCount {
    count:i64,
}

Having to specify the output type of the function like -> Result<i64, Box<dyn std::error::Error>> is not something I’m used to. Basically, all this is saying is: if the function successfully runs, return an integer. Otherwise, return an error. It’s pretty simple but easy to forget about when R gives you the error for free.

That being said, the R code I’ve written doesn’t guarantee the result will be an integer. In some contexts, this lack of guarantee can come back to bite you if your code really depends on data being of a certain type.

Verbosity

Rust is a pretty verbose language compared to what I’m used to. I don’t know how much I need to say about this, the code speaks for itself!

One clear example was the need to write different structs for the different levels of the API response. There are 3 separate structs in my code just to ensure the elements returned are the attributes of each feature layer, and the Crashes struct has to explicitly list out all the columns I want, their types, and indicate that they’re being renamed from their original name in the API. It’s a lot of code!

To be fair, Rust is designed to prioritize safety and expressiveness - meaning, Rust asks you to write more code up front so that it can catch errors at compile time. The result of this is software which should be more reliable, so there’s a reason for this verbosity.

Nulls

Rust doesn’t have null values - though it does have the Option<T> enum. It was easy enough to account for null values in the API this way by specifying the types in the Crashes struct to be Option<T>.

Conclusion

In conclusion, it was fun to revisit grabbing this data in Rust. In the future, I’d love to explore more with creating a system in Rust that:

  • integrates with a Postgres database
  • captures the data based on the data we already have available locally
  • processes the data and writes it to the Postgres database

I have a lot of learning to do before getting to that point though! So yeah, Rust is fun and challenging, and I expect that I’ll be writing more of it in the future.