Spinning up PostgreSQL in Docker for Easy Analysis
My typical analysis workflow is to start with data in some kind of database, perhaps Redshift or Snowflake. Often I’m working with millions or even billions of rows, but modern databases excel at operating with data at scale. Moreover, SQL is an intuitive and powerful tool for combining, filtering, and aggregating data. I’ll often do as much as I can in SQL, aggregate the data as much as I can, then export the data as a CSV to continue more advanced statistical calculations in python.
But sometimes the source data is already in CSV! Pandas is a python library for working with tabular datasets. For the most part anything you can do in SQL you can do in pandas, and there are lots of things you can do in pandas (using python) that would be hard or impossible to do in SQL. Sounds perfect right? The problem with pandas is simple things like joins and aggregations are way harder than they should be. The syntax is counter-intuitive. I’ve used pandas on a regular basis for five years and I still have to keep the documentation open for even simple tasks. I’m not alone! While I’ve met plenty of people who love pandas, I’ve never met anyone who is fluent in SQL and prefers pandas. To be clear, I still use pandas all the time, but if I can do something in SQL, or in pandas, I’m going to use SQL.
If the source data is already in CSV, we just have to use pandas right? I mean, who has two thumbs and would be crazy enough to import the data into a database just to avoid having to use pandas?
Here’s my new analysis workflow for this situation: I spin up PostgreSQL in a Docker container, import the CSV file(s) into tables, and then proceed as normal. This seems crazy, but hear me out.
Firstly, I can do any data cleansing with SQL statements, which inevitably end up being easier for someone else to read than pandas code. Seriously, I’ve seen the code that even pandas experts write, and it isn’t pretty. The strong typing of SQL tables helps catch data corruption and missing values, and forces you to deal with it up front. I’ll often import the data into staging tables having VARCHAR types, then transform and CAST my way into the final tables.
Especially with a dataset of questionable origin, I like to scrutinize the data in a dozen different ways before I even start the main analysis. I can do this so much faster with simple SQL queries in a query runner like Postico or the command line than I can with pandas. These checks are ephemeral: unless I find something wrong, they aren’t even worth mentioning in the analysis but are nonetheless part of my due diligence. When I’m ready to do the analysis itself, I write what is usually a simple SQL query, replacing often convoluted pandas code.
I typically wrap my analysis code in Docker as well, so a simple
docker-compose up
spins up the Postgres container and cleans up the
code, then runs the analysis in a separate container. This aids
repeatability since anyone can run that single command on any platform
and verify the results are the same.
Of course, this process might be overkill for some use cases. For analysts who aren’t fluent in SQL, stick with pandas! (But really, you should master SQL). For analysts who aren’t comfortable with Docker, this can be a challenging approach (though I highly recommend all data scientists develop a mastery of Docker). And for analysis tasks that are under a tight deadline, it might not be worth the time. But if you care about repeatability and clean, inspect-able processes, give it a try!