Intro
At CollegeVine, I have had the opportunity to use R professionally which has been a really cool “full circle” moment for me. I started using R back in 2017 (🤯) in a stats class; it was my gateway into the data world and I still think it’s an excellent language for data science and analytics work.
Around the same time, I discovered a newer language called Julia that my professors had never heard of. I was intrigued by its promise to solve the so-called “two-language problem,” its promised speed, and its syntax, but at the time I didn’t really have the experience to use it for day-to-day tasks and the libraries were not quite as stable as they are now. But over the years, I kept coming back to Julia and ended up doing a lot of my side projects in the language.
I decided to finally take the leap and start using Julia where I could in my workflow–and not surprisingly, it’s been a great addition to my toolkit and I can use it mostly in place of R.
What my Workflow Looks Like
As a Data Analyst at CollegeVine, my responsibilities can vary greatly from day-to-day. More recently I’ve felt more like an Analytics Engineer, doing a lot of SQL wrangling and writing a lot of dbt YAML. But when I am using R, generally I’m doing things like:
- running a query to get some baseline data
- manipulating this data, typically with the tidyverse (
dplyr
,purrr
, etc.) - creating plots
- writing this data elsewhere - either to a CSV, to S3, or maybe even to Slack
My coworker wrote an extremely helpful internal package which we use on a daily basis in all of our R scripts at CollegeVine. The package contains a number of useful functions which abstract away setup steps, provides a set of functions for common data cleaning and manipulation tasks we encounter from our data, and a consistent high-level API that we all understand. My first step in incorporating Julia into my workflow would be to recreate our most-used functions from this package.
My post on writing an internal analytics package in Julia provides an idea of the general steps of what goes into creating a package in Julia, and is inspired by the internal R package that we use.
The Package
A General Overview
At a high level, the package contains a number of helper functions which do things like:
- decrypt passwords and store them in Julia’s
ENV
temporarily (basically a session-configuration that doesn’t persist) - connect to Redshift as well as query it
- read and write data from S3 while ensuring compliance with our bucket’s naming conventions
- generate YAML schemas from Julia dataframes for dbt
Some of the helper functions were just one to two lines long, and often were as simple as wrapping functions from other Julia packages to a more familiar syntax with helpful error messages or warnings. Maintaining the same function names and behavior as our R package made the experience of switching my workflow to Julia relatively painless.
Comparison to R
Since the code was ported from R, the API of the internal package is basically one-to-one. In fact, moving between R and Julia feels pretty seamless at this point.
There are a few things that have tripped me up, for example…R broadcasts functions by default. So if you have two vectors, foo
and bar
, it feels pefectly natural to write foo * bar
in R.
However, in Julia, this won’t work unless we specify that we want to broadcast the *
function.
I’ve mainly been bitten by this in the context of a DataFramesMeta
chain - for some macros/transformations, you can either specify that a function should be broadcast by adding .
in front of it, or you can use the @byrow
macro which tells DataFramesMeta
to apply transformations on a row-by-row basis.
Speaking of DataFramesMeta
, it’s pretty slick! It takes some getting used to coming over from the tidyverse, but for general data transformations it’s been sufficient. There’s also the Tidier.jl
package which basically implements the tidyverse in Julia (I’m attempting not to lean on this package too much and instead approach things the Julian way–still a cool idea!).
I do miss ggplot2
when working in Julia - I know I could technically call R from Julia to do plotting, but I feel that defeats the purpose of using Julia in the first place. Makie
is the plotting library which I have my eye on - but given that we’re primarily using Tableau for data visualizations I haven’t spent a lot of time attempting to streamline my visualization workflow in Julia.
I’ll probably keep using Julia
…with the caveat that I probably won’t be committing any code in it. Still, it’s been cool to be able to seamlessly integrate Julia into my workflow and use what’s been a hobby language up to this point as a professional tool!
Next up on my learning list: Rust. But more on that another day 😏