Repeatability – Adventures in Why

Every data scientist dreams of doing analysis that makes a big impact to their organization, perhaps even altering the trajectory of the company! But any such analysis should be revisited as we continue to learn new information or when certain assumptions are called into question.

Have you ever revisited an analysis from 6 months ago, only to struggle even to re-run the code? Perhaps the data are no longer available because of a retention policy in the database. Perhaps a new library version has broken the code. Or maybe you can’t make sense of a complicated data transformation step. Being able to reproduce an analysis we’ve done in the past is perhaps the most important step in increasing our confidence that our recommendations will stand the test of time.

As a starting point, it is vital to control both the data and the code we are using. Version control systems like git are great for code and small datasets, but large datasets do not belong in git! Instead, consider storing datasets in a read-only folder in S3. There has been a recent wave of tools offering “version control for large data sets”, but to be honest I have yet to encounter one I prefer to S3.

Next, we need to control the libraries we are using. Sometimes new library versions introduce breaking changes that make it hard to re-run code, so we want to explicitly specify each library version. Assuming we are using python to do our analysis, poetry makes it really simple to do so. Moreover, poetry records the SHA256 checksum of the library itself, ensuring the library is exactly the same, in addition to providing some security benefits.

We also need to make sure we are using the same version of python each time we run the analysis. New versions of python tend to be backwards compatible, yet it is still prudent to record the version of python used originally using pyenv.

For the ultimate in repeatability, use Docker! That ensures everything about our runtime environment is identical whether we are running the code on our laptops or in the cloud.

Correctness goes hand-in-hand with repeatability, so your analysis code should have test cases. I like to use pytest as my test runner and pytest-cov to check my code coverage. Using a unit test library also creates an elegant approach to verifying repeatability: write a test case that runs your analysis and assert the results against outputs stored with the repository (I like to store the analysis results in JSON). If the test case passes, that means the analysis results are the same! You can make this even easier with Docker Compose: have docker-compose up run pytest and in a single command you’ll have a completely repeatable analysis environment. You might also consider having Docker Compose build your analysis report, perhaps using Jinja and LaTeX.

Finally, Continuous Integration and Continuous Deployment (CI/CD) systems like those provided by Gitlab and GitHub allow you to take certain actions when code is pushed or merge requests opened. Consider building your analysis package in such a system, running the test cases to verify repeatability, and “deploying” the resulting package to a protected folder in S3, or stored as an artifact associated with the repository.

It might seem like overkill to use dependency managers, unit tests, Docker, and CI/CD for all of our analyses, but good templates will pay dividends. I have found that integrating repeatability principles into an analysis workflow adds a very small amount of overhead, but gives me much greater confidence in the results I’m delivering.

Subscribe to Adventures in Why

Bob Wilson