Designing an Internal Julia Package

In this post, I’ll explore the benefits of designing and maintaining an internal Julia package for your analytics team, as well as walk through an example of how to set up the package infrastructure.
Julia
Author
Affiliation

Data Analyst at CollegeVine

Published

May 4, 2023

Intro

In a previous post, I discussed the importance of presenting simple, practical projects one can do in Julia. One such project, if you or your team is interested in using Julia for analytics, is an internal package. Internal packages can be used to streamline your team’s workflow, abstract away common boilerplate code, and generally improve knowledge sharing and collaboration across your team.

We use several internal R packages at CollegeVine which partially inspired this post. Emily Riederer’s fantastic blog post is also a more in-depth look at how to design internal R packages that I think can be applied to any language!

In this post, I’ll go through a simple example of how an internal package can be used and review how to set this up in Julia.

Why Use an Internal Package?

Imagine for a moment that you’re a Junior Data Analyst starting at a new company called AnalyticWorks. For whatever reason, this company (more specifically, the Chief Data Scientist: John Michaelson) has decided that all of its analytics work will be done with Julia–a language that you may have heard of but are not particularly familiar with.

As a Junior Data Analyst, there are a few basic things that you may already be thinking about needing to do in your day-to-day that you’re now going to have to figure out how to do in Julia. Things like:

  • connecting to the company’s database
  • writing transformations for data
  • producing data visualizations with consistent theming

The anxiety spiral is probably starting already, even though this is a short list! On top of having to learn the language itself, you’ll have to figure out which packages to use for each of these tasks, manage those dependencies and ensure your coworkers can run your code, and adopt to the organization’s best practices for writing Julia. That’s a lot to take in!

An internal package provides a well-documented, well-tested suite of pre-written functions that allows a new analyst to focus on actually doing the analysis rather than spending their time mastering the art of writing boilerplate code. It also allows us to define a shared user experience for working with the company’s data and within the analytics or data science team.

A Short Guide to Package Creation in Julia

I’m going to assume if you’re reading this tutorial, you have a basic knowledge of Julia (eg. using the REPL, syntax, etc.) and version control using git.

Initial Setup

We’ll do the initial setup for our internal package using the PkgTemplates.jl package. We’re going to name this package after our company: AnalyticWorks.jl.

Note

PkgTemplates.jl is recommended by Julia’s official documentation. According to the docs, it makes things “easy, repeatable, and customizable.” I think it is all of those things, though this setup step can be accomplished without it 😄

First, we create an empty GitHub repository called AnalyticWorks.jl.

Now, we’ll run the template setup via PkgTemplates.jl. For example, my basic setup looks something like this:

using PkgTemplates
t = Template(
    user = "mistermichaelll", 
    dir = "~/Documents/github", 
    authors = "Michael Johnson", 
    julia = v"1.8.0", 
    plugins = [GitHubActions(; x86 = true)]
)
t("AnalyticWorks.jl")

PkgTemplates.jl will automatically generate the skeleton of your package, as well as initiate a git project and create the initial commits for you–so you’ll want to force push these commits to master using the terminal. On my computer, that looks like:

cd ~/Documents/github/AnalyticWorks 
git push -u -f origin main

The skeleton of our package is now in GitHub, and we can start developing.

Notice that I included an additional plugin in the initial package setup which creates a CI step in GitHub actions for the package. This allows us to ensure our package builds properly on different Julia versions and also lets us run tests when we make PRs.

Writing Some Functions

Now that we have the initial package created, we can start writing some functions to make our analyst’s life easier.

Remember, the goal here is to abstract away what we can to make the analyst’s life easier. So a great place to start is connecting to the company’s database.

For the purposes of this post, let’s assume that we’ll be connecting to a postgreSQL database. In Julia, we can use the LibPQ.jl library to connect and query our postgres database. But remember - we don’t want the analyst to have to worry about which library to use, and how to connect, or any of that!

So, let’s make a very high level function which abstracts this away. The bare-bones function looks like this:

using LibPQ
function connect_to_db()
    username = ENV["DB_USER"]
    password = ENV["DB_PASSWORD"]

    conn = LibPQ.Connection("dbname=postgres user=$username password=$password")

    return conn
end

This is a good starting point! From there, a user will probably also want to actually query the database. This is done with LibPQ.execute function, but again: we want to make this super clear and easy for our analyst, and that means that they should be able to run a query and get a dataframe back easily.

The simplest utility function we could create for this would look like:

function query_db(conn, query)
    LibPQ.execute(conn, query) |> DataFrame
end

Adding Functions to Our Package

To add these functions to our Julia package, we can create a new file in the src folder that we’ll call databaseFunctions.jl. We can put the code for our functions in there to keep things organized.

Note

You can write functions directly in the package module (in this case AnalyticWorks.jl) and use them as long as they’re listed as an export, but I recommended trying to separate things logically then include() those individual files in the module.

In this case, any utility functions that relate to working with a database will live in databaseFunctions.jl. We may also have a file like visualizationTools.jl for making plots, dataWrangling.jl for transformations, so on and so forth.

Then, we can add that file to our AnalyticWorks.jl file like this:

module AnalyticWorks
include("databaseFunctions.jl")

using LibPQ
using DataFrames

export connect_to_db
export query_db

end

Not every function has to be exported, but generally most of them will be. If you intend for a function to be accessible by the user, make sure you export it!

Tip

Generally, this is how I like to organize my Julia package modules.

module packageName
## everything that's included
include("foo.jl")
include("bar.jl")

## the dependencies and imports
using DataFrames 
using LibPQ 
import HTTP: request # helpful if you only need specific functions from package

## the user-facing functions
export create_foo 
export delete_bar 
export other_functions

end

For more information on modules, see the Julia docs.

Managing Package Dependencies

To ensure the LibPQ and DataFrames dependencies are accounted for, we need to add them to the Project.toml. To do this1:

  1. go to the root of your project’s directory via the command line.
  2. open the Julia REPL by running julia, then enter package mode by pressing ].
  3. enter activate ., which activates the Julia environment in the package folder.
  4. write add LibPQ, DataFrames–this will add LibPQ to your dependencies and the Project.toml file.

Trying it Out Locally

Here’s how you can test out the development version of your package on local:

  1. go to the root of your project’s directory via the command line.
  2. open the Julia REPL by running julia, then enter package mode by pressing ].
  3. enter dev .
  4. exit package mode by hitting backspace, enter using packageName
  5. try out your functions!

This is a great way to see what the user experience of your package is. For example, the use of our very bare bones AnalyticWorks.jl package would look something like this:

using AnalyticWorks

conn = connect_to_db()

query_db(
    "
    SELECT 
      *
    FROM my_clean_table
    "
)

Woohoo! We’re now connected to the database and we can query things, and from the user’s perspective we’re only using one package.

Testing Our Package

When developing a package for others to use, we should be absolutely sure that the package is working as we intend. Particularly when it comes to functions which transform data or do calculations.

Let’s say that at AnalyticWorks, we do a lot of calculating email click-through rates. Meaning that we deal a lot with datasets that look like this:

3×4 DataFrame
Row date email clicks impressions
String String Int64 Int64
1 2022-01-01 email1 100 1345
2 2022-01-05 email2 45 780
3 2022-01-10 email3 12 456

Let’s say that we want to create a function which calculates the click-through rate for us and returns this as a dataframe. The simplest thing we can write that accomplishes that is something like this:

function calculate_ctr(df, clicks_col, impressions_col)
    transform(
        df, 
        [clicks_col, impressions_col] => ByRow(/) => :ctr
    )
end

To make this user-facing, similar to the other functions we wrote, we’ll add that to a file in AnalyticWorks.jl called emailAnalysis.jl and our exports in the module as well.

Now let’s write some tests! Here’s one - there should never be a time when there is a negative click-through rate, as that implies a problem with the data.

To add tests, we can add those to the runtests.jl file in the test folder. I’ll add the same dataframe from above to that file to run tests against.

using AnalyticWorks
using Test
using DataFrames

negative_test_df = DataFrame(
    "date" => ["2022-01-01", "2022-01-05", "2022-01-10"],
    "email" => ["email1", "email2", "email3"],
    "clicks" => [-100, 45, 12],
    "impressions" => [1345, 780, 456]
)
3×4 DataFrame
Row date email clicks impressions
String String Int64 Int64
1 2022-01-01 email1 -100 1345
2 2022-01-05 email2 45 780
3 2022-01-10 email3 12 456

What I want to test here is whether or not my transformation function will fail if it encounters bad data. The worst case here would be that it doesn’t!

The easiest way to test this is to see whether the function returns nothing, which is the expected result if we configure an error and a return nothing statement.

@testset "AnalyticWorks.jl" begin
    @test calculate_ctr(negative_test_df, :clicks, :impressions) == nothing
end

As expected, this test fails because we didn’t include that in our initial function! Which means our function is due for a rewrite:

function calculate_ctr(df, clicks_col, impressions_col)
    if sum(df[!, clicks_col] .< 0) > 0
        @error "❌ ERROR: how can you have negative clicks?"
        return nothing
    elseif sum(df[!, impressions_col] .< 0) > 0
        @error "❌ ERROR: how can you have negative impressions?"
        return nothing
    end

    transform(
        df,
        [clicks_col, impressions_col] => ByRow(/) => :ctr
    )
end

Our test will now pass as we took care of this edge case and provided our end-user with some helpful (and sassy) guidance. And in CI in our package on GitHub, we can see that the test ran successfully:

┌ Error: ❌ ERROR: how can you have negative clicks?
└ @ AnalyticWorks ~/work/AnalyticWorks.jl/AnalyticWorks.jl/src/emailAnalysis.jl:3
Test Summary:    | Pass  Total  Time
AnalyticWorks.jl |    1      1  2.3s
     Testing AnalyticWorks tests passed 
Tip

Make sure your package is well-tested and considers edge cases like these, especially in functions which transform data! Broader point here: test all of your code!

See the Julia docs for more information about unit testing in Julia.

Tailoring Our Functions to Our Users

Let’s think about how we can improve the first database function we wrote. Here are a few guiding questions:

  • can we assume the user knows how to define their environment variables? If not, how do we make this clear to them?
  • is the database name always the same?

The answer to the first question is “probably not,” given that our audience is a Junior Analyst who has never used Julia in a professional setting. We should put some guardrails in place to ensure that the user understands why something fails and what they can immediately do to fix it.

For the second question, the answer could be “maybe, probably? Idk!”, in which case the safest option would be to provide an optional keyword argument that allows a user to change the database they’re connecting to.

Cleaning Up the connect_to_db() Function

Here’s an improved function which provides the user more context when something goes wrong, and also adds an additional keyword argument which allows the user to specify something other than the default database name:

function connect_to_db(;db_name = "postgres")
    try
        ENV["DB_USER"]
        ENV["DB_PASSWORD"]
    catch e
        @error "❌ You are missing one or more required environment variables." * "\nEnvironment variables are defined in the `startup.jl` file, please ensure the following variables are defined: DB_USER, DB_PASSWORD"
        return nothing
    end

    conn = nothing

    try
        username = ENV["DB_USER"]
        password = ENV["DB_PASSWORD"]
        conn = LibPQ.Connection("dbname=$db_name user=$username password=$password")
    catch e
        @error "❌ Error connecting to database: $e"
    end

    return conn
end

Now our function checks whether the relevant environment variables exists in a user’s startup.jl file. If they don’t, it throws an error describing what exactly is wrong and returns nothing. It also provides some helpful context if the connection itself fails.

Though there’s a bit more to think through here, I think that the upfront development cost is worth the ease of use for our Junior Analyst.

If a user tries to run this without setting up their startup.jl file, they’ll get something that looks like:

┌ Error: ❌ You are missing one or more required environment variables
│ Environment variables are defined in the `startup.jl` file, please ensure the following variables are defined: DB_USER, DB_PASSWORD
└ @ Main Untitled-1:8

Make Getting to startup.jl Easier

I’ve made a few references to the startup.jl file, which allows us to set environment variables that Julia can access across projects. The Julia docs make a reference to setting environment variables in this file:

Supposing that you want to set the environment variable JULIA_EDITOR to vim, you can type ENV[“JULIA_EDITOR”] = “vim” (for instance, in the REPL) to make this change on a case by case basis, or add the same to the user configuration file ~/.julia/config/startup.jl in the user’s home directory to have a permanent effect. The current value of the same environment variable can be determined by evaluating ENV[“JULIA_EDITOR”].

One way I like to use this is to ensure that I’m not hard-coding secrets into my scripts while allowing things like database usernames and passwords, API tokens, or other long variables persist across my Julia projects.

Caution

Note that there is a ton you can do with startup.jl beyond just setting ENV variables. After all, this is just code that gets run when you start Julia!

I’ve started a discussion on the Julia discourse about whether it’s ok to use startup.jl in this way since it isn’t specifically documented as a way of storing credentials like .Renviron is for R. My feeling is that it’s probably fine as long as you aren’t passing around your startup.jl file or committing it to version control. You could also set up a process to load secrets from another file instead of storing the credentials directly.

That being said, you should manage secrets in a way that makes the most sense for your organization, and your package design should reflect that. Whether that be the use of .env files or other methods.

The startup.jl file is not something that I would expect a Julia novice to be aware of - so if we really want to make life easier on the analyst, we could create a function that opens up startup.jl and point our user towards it in our error message!

Note

This is inspired by the edit_r_environ() function from the usethis R package. That function takes advantage of the rstudioapi to open the file in an RStudio window.

This function just runs a terminal command to open the file in whatever the user’s default editor is (Visual Studio Code in my case). It’s not quite the same, but a similar experience!

function edit_julia_startup(;user = "")
    if user == ""
        user = ENV["USER"]
    end

    path = "/Users/$user/.julia/config/startup.jl"

    command = "open"

    run(`$command $path`)
end

This function opens up the startup.jl file found in the default Julia path. We could adjust the error message as needed to point the user here.

┌ Error: ❌ You are missing one or more required environment variables.
│ Environment variables are defined in the `startup.jl` file, please ensure the following variables are defined: DB_USER, DB_PASSWORD
│ To edit your `startup.jl` file, please use the `edit_julia_startup()` function.
└ @ Main In[90]:1

For credentials that we do not want to persist across projects, we would have to set up a different system for accessing these variables.

Wrapping it All Up

In this post, we explored a very simple example of how to set up an internal Julia package aimed at setting a Junior Data Analyst on the path to successfully using Julia in their day-to-day.

Obviously, there’s a lot more ground we could cover here like utility functions for data visualization, writing good documentation, and more!

Ultimately, by incorporating best practices for code organization and documentation as well as thoughtful design patterns, your internal Julia package can become a valuable asset for your team and facilitate smoother collaboration and knowledge sharing.

Footnotes

  1. Most clearly explained in this article: https://medium.com/coffee-in-a-klein-bottle/developing-your-julia-package-682c1d309507↩︎