Intro
In a previous post, I discussed the importance of presenting simple, practical projects one can do in Julia. One such project, if you or your team is interested in using Julia for analytics, is an internal package. Internal packages can be used to streamline your team’s workflow, abstract away common boilerplate code, and generally improve knowledge sharing and collaboration across your team.
We use several internal R packages at CollegeVine which partially inspired this post. Emily Riederer’s fantastic blog post is also a more in-depth look at how to design internal R packages that I think can be applied to any language!
In this post, I’ll go through a simple example of how an internal package can be used and review how to set this up in Julia.
Why Use an Internal Package?
Imagine for a moment that you’re a Junior Data Analyst starting at a new company called AnalyticWorks. For whatever reason, this company (more specifically, the Chief Data Scientist: John Michaelson) has decided that all of its analytics work will be done with Julia–a language that you may have heard of but are not particularly familiar with.
As a Junior Data Analyst, there are a few basic things that you may already be thinking about needing to do in your day-to-day that you’re now going to have to figure out how to do in Julia. Things like:
- connecting to the company’s database
- writing transformations for data
- producing data visualizations with consistent theming
The anxiety spiral is probably starting already, even though this is a short list! On top of having to learn the language itself, you’ll have to figure out which packages to use for each of these tasks, manage those dependencies and ensure your coworkers can run your code, and adopt to the organization’s best practices for writing Julia. That’s a lot to take in!
An internal package provides a well-documented, well-tested suite of pre-written functions that allows a new analyst to focus on actually doing the analysis rather than spending their time mastering the art of writing boilerplate code. It also allows us to define a shared user experience for working with the company’s data and within the analytics or data science team.
A Short Guide to Package Creation in Julia
I’m going to assume if you’re reading this tutorial, you have a basic knowledge of Julia (eg. using the REPL, syntax, etc.) and version control using git.
Initial Setup
We’ll do the initial setup for our internal package using the PkgTemplates.jl
package. We’re going to name this package after our company: AnalyticWorks.jl
.
PkgTemplates.jl
is recommended by Julia’s official documentation. According to the docs, it makes things “easy, repeatable, and customizable.” I think it is all of those things, though this setup step can be accomplished without it 😄
First, we create an empty GitHub repository called AnalyticWorks.jl
.
Now, we’ll run the template setup via PkgTemplates.jl
. For example, my basic setup looks something like this:
PkgTemplates.jl
will automatically generate the skeleton of your package, as well as initiate a git project and create the initial commits for you–so you’ll want to force push these commits to master using the terminal. On my computer, that looks like:
The skeleton of our package is now in GitHub, and we can start developing.
Notice that I included an additional plugin in the initial package setup which creates a CI step in GitHub actions for the package. This allows us to ensure our package builds properly on different Julia versions and also lets us run tests when we make PRs.
Writing Some Functions
Now that we have the initial package created, we can start writing some functions to make our analyst’s life easier.
Remember, the goal here is to abstract away what we can to make the analyst’s life easier. So a great place to start is connecting to the company’s database.
For the purposes of this post, let’s assume that we’ll be connecting to a postgreSQL
database. In Julia, we can use the LibPQ.jl
library to connect and query our postgres database. But remember - we don’t want the analyst to have to worry about which library to use, and how to connect, or any of that!
So, let’s make a very high level function which abstracts this away. The bare-bones function looks like this:
This is a good starting point! From there, a user will probably also want to actually query the database. This is done with LibPQ.execute
function, but again: we want to make this super clear and easy for our analyst, and that means that they should be able to run a query and get a dataframe back easily.
The simplest utility function we could create for this would look like:
Adding Functions to Our Package
To add these functions to our Julia package, we can create a new file in the src
folder that we’ll call databaseFunctions.jl
. We can put the code for our functions in there to keep things organized.
You can write functions directly in the package module (in this case AnalyticWorks.jl
) and use them as long as they’re listed as an export, but I recommended trying to separate things logically then include()
those individual files in the module.
In this case, any utility functions that relate to working with a database will live in databaseFunctions.jl
. We may also have a file like visualizationTools.jl
for making plots, dataWrangling.jl
for transformations, so on and so forth.
Then, we can add that file to our AnalyticWorks.jl
file like this:
Not every function has to be exported, but generally most of them will be. If you intend for a function to be accessible by the user, make sure you export it!
Generally, this is how I like to organize my Julia package modules.
module packageName
## everything that's included
include("foo.jl")
include("bar.jl")
## the dependencies and imports
using DataFrames
using LibPQ
import HTTP: request # helpful if you only need specific functions from package
## the user-facing functions
export create_foo
export delete_bar
export other_functions
end
For more information on modules, see the Julia docs.
Managing Package Dependencies
To ensure the LibPQ
and DataFrames
dependencies are accounted for, we need to add them to the Project.toml
. To do this1:
- go to the root of your project’s directory via the command line.
- open the Julia REPL by running
julia
, then enter package mode by pressing]
. - enter
activate .
, which activates the Julia environment in the package folder. - write
add LibPQ, DataFrames
–this will add LibPQ to your dependencies and theProject.toml
file.
Trying it Out Locally
Here’s how you can test out the development version of your package on local:
- go to the root of your project’s directory via the command line.
- open the Julia REPL by running
julia
, then enter package mode by pressing]
. - enter
dev .
- exit package mode by hitting backspace, enter
using packageName
- try out your functions!
This is a great way to see what the user experience of your package is. For example, the use of our very bare bones AnalyticWorks.jl
package would look something like this:
Woohoo! We’re now connected to the database and we can query things, and from the user’s perspective we’re only using one package.
Testing Our Package
When developing a package for others to use, we should be absolutely sure that the package is working as we intend. Particularly when it comes to functions which transform data or do calculations.
Let’s say that at AnalyticWorks, we do a lot of calculating email click-through rates. Meaning that we deal a lot with datasets that look like this:
Row | date | clicks | impressions | |
---|---|---|---|---|
String | String | Int64 | Int64 | |
1 | 2022-01-01 | email1 | 100 | 1345 |
2 | 2022-01-05 | email2 | 45 | 780 |
3 | 2022-01-10 | email3 | 12 | 456 |
Let’s say that we want to create a function which calculates the click-through rate for us and returns this as a dataframe. The simplest thing we can write that accomplishes that is something like this:
To make this user-facing, similar to the other functions we wrote, we’ll add that to a file in AnalyticWorks.jl
called emailAnalysis.jl
and our exports in the module as well.
Now let’s write some tests! Here’s one - there should never be a time when there is a negative click-through rate, as that implies a problem with the data.
To add tests, we can add those to the runtests.jl
file in the test
folder. I’ll add the same dataframe from above to that file to run tests against.
using AnalyticWorks
using Test
using DataFrames
negative_test_df = DataFrame(
"date" => ["2022-01-01", "2022-01-05", "2022-01-10"],
"email" => ["email1", "email2", "email3"],
"clicks" => [-100, 45, 12],
"impressions" => [1345, 780, 456]
)
Row | date | clicks | impressions | |
---|---|---|---|---|
String | String | Int64 | Int64 | |
1 | 2022-01-01 | email1 | -100 | 1345 |
2 | 2022-01-05 | email2 | 45 | 780 |
3 | 2022-01-10 | email3 | 12 | 456 |
What I want to test here is whether or not my transformation function will fail if it encounters bad data. The worst case here would be that it doesn’t!
The easiest way to test this is to see whether the function returns nothing
, which is the expected result if we configure an error and a return nothing
statement.
As expected, this test fails because we didn’t include that in our initial function! Which means our function is due for a rewrite:
function calculate_ctr(df, clicks_col, impressions_col)
if sum(df[!, clicks_col] .< 0) > 0
@error "❌ ERROR: how can you have negative clicks?"
return nothing
elseif sum(df[!, impressions_col] .< 0) > 0
@error "❌ ERROR: how can you have negative impressions?"
return nothing
end
transform(
df,
[clicks_col, impressions_col] => ByRow(/) => :ctr
)
end
Our test will now pass as we took care of this edge case and provided our end-user with some helpful (and sassy) guidance. And in CI in our package on GitHub, we can see that the test ran successfully:
┌ Error: ❌ ERROR: how can you have negative clicks?
└ @ AnalyticWorks ~/work/AnalyticWorks.jl/AnalyticWorks.jl/src/emailAnalysis.jl:3
Test Summary: | Pass Total Time
AnalyticWorks.jl | 1 1 2.3s
Testing AnalyticWorks tests passed
Make sure your package is well-tested and considers edge cases like these, especially in functions which transform data! Broader point here: test all of your code!
See the Julia docs for more information about unit testing in Julia.
Tailoring Our Functions to Our Users
Let’s think about how we can improve the first database function we wrote. Here are a few guiding questions:
- can we assume the user knows how to define their environment variables? If not, how do we make this clear to them?
- is the database name always the same?
The answer to the first question is “probably not,” given that our audience is a Junior Analyst who has never used Julia in a professional setting. We should put some guardrails in place to ensure that the user understands why something fails and what they can immediately do to fix it.
For the second question, the answer could be “maybe, probably? Idk!”, in which case the safest option would be to provide an optional keyword argument that allows a user to change the database they’re connecting to.
Cleaning Up the connect_to_db()
Function
Here’s an improved function which provides the user more context when something goes wrong, and also adds an additional keyword argument which allows the user to specify something other than the default database name:
function connect_to_db(;db_name = "postgres")
try
ENV["DB_USER"]
ENV["DB_PASSWORD"]
catch e
@error "❌ You are missing one or more required environment variables." * "\nEnvironment variables are defined in the `startup.jl` file, please ensure the following variables are defined: DB_USER, DB_PASSWORD"
return nothing
end
conn = nothing
try
username = ENV["DB_USER"]
password = ENV["DB_PASSWORD"]
conn = LibPQ.Connection("dbname=$db_name user=$username password=$password")
catch e
@error "❌ Error connecting to database: $e"
end
return conn
end
Now our function checks whether the relevant environment variables exists in a user’s startup.jl
file. If they don’t, it throws an error describing what exactly is wrong and returns nothing. It also provides some helpful context if the connection itself fails.
Though there’s a bit more to think through here, I think that the upfront development cost is worth the ease of use for our Junior Analyst.
If a user tries to run this without setting up their startup.jl
file, they’ll get something that looks like:
Make Getting to startup.jl
Easier
I’ve made a few references to the startup.jl
file, which allows us to set environment variables that Julia can access across projects. The Julia docs make a reference to setting environment variables in this file:
Supposing that you want to set the environment variable JULIA_EDITOR to vim, you can type ENV[“JULIA_EDITOR”] = “vim” (for instance, in the REPL) to make this change on a case by case basis, or add the same to the user configuration file ~/.julia/config/startup.jl in the user’s home directory to have a permanent effect. The current value of the same environment variable can be determined by evaluating ENV[“JULIA_EDITOR”].
One way I like to use this is to ensure that I’m not hard-coding secrets into my scripts while allowing things like database usernames and passwords, API tokens, or other long variables persist across my Julia projects.
Note that there is a ton you can do with startup.jl
beyond just setting ENV
variables. After all, this is just code that gets run when you start Julia!
I’ve started a discussion on the Julia discourse about whether it’s ok to use startup.jl
in this way since it isn’t specifically documented as a way of storing credentials like .Renviron
is for R. My feeling is that it’s probably fine as long as you aren’t passing around your startup.jl
file or committing it to version control. You could also set up a process to load secrets from another file instead of storing the credentials directly.
That being said, you should manage secrets in a way that makes the most sense for your organization, and your package design should reflect that. Whether that be the use of .env
files or other methods.
The startup.jl
file is not something that I would expect a Julia novice to be aware of - so if we really want to make life easier on the analyst, we could create a function that opens up startup.jl
and point our user towards it in our error message!
This is inspired by the edit_r_environ()
function from the usethis
R package. That function takes advantage of the rstudioapi
to open the file in an RStudio window.
This function just runs a terminal command to open the file in whatever the user’s default editor is (Visual Studio Code in my case). It’s not quite the same, but a similar experience!
This function opens up the startup.jl
file found in the default Julia path. We could adjust the error message as needed to point the user here.
┌ Error: ❌ You are missing one or more required environment variables.
│ Environment variables are defined in the `startup.jl` file, please ensure the following variables are defined: DB_USER, DB_PASSWORD
│ To edit your `startup.jl` file, please use the `edit_julia_startup()` function.
└ @ Main In[90]:1
For credentials that we do not want to persist across projects, we would have to set up a different system for accessing these variables.
Wrapping it All Up
In this post, we explored a very simple example of how to set up an internal Julia package aimed at setting a Junior Data Analyst on the path to successfully using Julia in their day-to-day.
Obviously, there’s a lot more ground we could cover here like utility functions for data visualization, writing good documentation, and more!
Ultimately, by incorporating best practices for code organization and documentation as well as thoughtful design patterns, your internal Julia package can become a valuable asset for your team and facilitate smoother collaboration and knowledge sharing.
Footnotes
Most clearly explained in this article: https://medium.com/coffee-in-a-klein-bottle/developing-your-julia-package-682c1d309507↩︎