Analysts Should Test Their Code Too!

Unit testing shouldn’t just be for Software Engineers! In this post I explore a real-life situation where unit testing could have saved me from extra work.
Author
Affiliation

Data Analyst at CollegeVine

Published

December 27, 2022

In my time at CollegeVine, I’ve been lucky to work within an excellent engineering environment. From that, I’ve reevaluated some of my bad coding habits and picked up a few tips that Data Analysts can apply to their R workflow.

My first post (as the title implies) is about testing your code, even as a Data Analyst.

Test. Your. Code.

Writing unit tests may seem like an obvious idea (or even a minimum requirement) if you’re a Software Engineer, but I tend to get surprised reactions from other Data Analysts when I mention that we write unit tests even for analytics-focused packages or scripts. Even I was resistant to the idea at first, thinking it was just a friction point in moving quickly and getting analyses done.

My coworker broke this line of thinking by often posing the question: “how do you know this code is doing what you think it should be doing?” Sometimes, this answer is obvious: of course I know what the code is doing, I can see it! For example, if I punch the following into R’s console, I expect the answer to be 2.

1 + 1
#> 2

Simple right? It just works! And it’s immediately obvious that this does what we expect, because we are interacting with the code.

The danger arises when your code becomes more complex or you begin building pipelines which handle data transformation. Often, you are no longer directly interacting with the code; the code is running on a schedule, it’s automated, and you just have to trust that it behaves when you aren’t watching.

The danger here is not loud, clear errors which break the job and bring the whole thing tumbling down. The danger is in silent errors, a more sinister scenario where your code has failed and you are none-the-wiser.

A Real World Example

We often collect survey data about various topics in college admissions to produce insights reports. Recently, in the process of analyzing results, my coworker asked why I had excluded one of the questions from the summary of results I generated.

The problem was: I hadn’t, or…at least not intentionally.

What happened was that I had made a change to one of the functions I had written to fix an issue I had noticed. While I fixed the issue, my fix caused the function to drop questions of a certain type.

This was not a loud failure. My function still ran, the rest of the script worked perfectly, and the results still got compiled to a PDF. I hadn’t noticed the missing question in my glance over the results, and it took my coworker who worked on the survey to call it to my attention.

I could have avoided this by writing a test–something very simple so that when I made any changes to my function, I would have caught the fact that questions were missing from the survey results and gotten an error message telling me specifically where my code had failed. The testthat package makes this pretty easy.

test_set <- readRDS("test_file.rds") # we saved intermediate objects to test our functions on

test_that("all questions are present in response data", {
    n_questions_in_survey <- 10
    n_questions_in_response <- parse_survey_to_df(test_set) %>%
      pull(question) %>%
      length()
    
    expect_equal(n_questions_in_survey, n_questions_in_response)
})
#> ── Failure: all questions are present in response data ────────────────
#> `n_questions_in_survey` not equal to `n_questions_in_response`.
#> 1/1 mismatches
#> [1] 10 - 9 == 1

This would have enabled me to rewrite the function to pass the test case, and this whole scenario would never have happened.

Unit testing isn’t just for software!

You should be able to envision other scenarios where it may make sense to write tests beyond things like creating packages. Even in an analytics workflow, you could test things like:

  • verifying the shape or type of an input file
  • testing whether there is null data is present where it isn’t expected
  • that functions give the output that’s expected (whether the type, shape, proper value, etc.)

It’s not a bad idea to try and think through what might cause your code to fail silently. There is a whole school of thought where you come up with these edgecases and write your code to pass tests covering them as opposed to writing tests after the fact (if you’re interested, look into Test Driven Development).

Final Thoughts

Writing tests, even for your analytics scripts and custom functions, will make you a better and more defensive programmer; these are qualities which will serve you well in a data career and carry over to a variety of data disciplines like Data Science, Data Engineering, and more.

At the end of the day, if your code is going to fail…you want it to fail as loudly as possible.