Thoughts on Models with Regularization
My very first data science project was a failure. I had just started at Tinder and I had been tasked with understanding the conversion behavior of Tinder’s recently launched subscription product, Tinder Plus. So I built a model that predicted whether a person would convert based on age, gender, how many swipes they had made, their “swipe-right ratio”, etc.
The model didn’t work at all. It predicted that no one would convert. It quickly became clear that it was because the overall conversion rate was pretty small. The algorithm could get a very high accuracy rate just by predicting that no one would convert, and it was a struggle to get it to change its mind.
The problem is that I was treating this as a classification problem, when I should have been treating it like a regression problem. (Actually, I should have been treating it as a causal inference problem, but it would be several years before I could appreciate that.) So I built a model that predicted the probability that a person would convert, based on various characteristics about that person. I used a simple logistic regression model because that’s what I had learned in school. (I would still use a logistic regression model today, at least as a starting point, due to speed of fitting models and ease of interpretation. Only once I felt I had squeezed every bit of performance out of logistic regression would I turn to something like XGBoost.)
At least this model didn’t issue ridiculous predictions like “everyone has a 0% chance of converting” or anything. But it did make roughly the same prediction for everyone: the overall average conversion rate. I had hoped it to say, “oh this person has an 80% chance of converting, and this other person has a 0.01% chance of converting”.
It got worse when I started making plots. It said that the 80 year olds on Tinder were the most likely to convert. (That was before I learned how to clean the data; some of the ages were definitely bogus!) It took me a long time to realize that, in their simplest form, logistic regression models are monotonic in each feature. If there is a positive correlation between age and conversion rate, if a 28 year old is more likely to convert than a 21 year old, then an 80 year old is even more likely to convert!
Intuitively, it seemed more plausible that the relationship between age and conversion rate was non-monotonic: it might increase as people got a bit older and had more money to burn, but then decrease past a certain point. But simple logistic regression models can’t capture this. I was lucky enough to take Rob Tibshirani’s class at Stanford which covered Generalized Additive Models, so I decided to try smoothing splines. And wow did that make a difference. The pictures looked way more plausible (conversion rate increased with age up to a certain point and then decreased – those 80 year olds weren’t converting), and more importantly the predictions were much more varied: some people were much more likely to convert than others.
I had included state (e.g. California or Nebraska) as a feature, but the predictions for the less populated states were all over the place. I needed regularization. I found a paper by Stephen Boyd on the Network Lasso and it seemed like exactly what I was looking for (it seems like whenever I’m stuck I’m able to find something, either in the Convex Optimization book or on Prof. Boyd’s list of publications, that gets me unstuck). The basic idea is that neighboring states would have their predictions smoothed together. A state like California had enough data that whatever the observed conversion rate was, that’s what would be predicted, but a state like Nebraska would inherit the conversion rate from neighboring states. I made some really pretty choropleth maps back then!
I had started to learn more about statistical inference and I knew point estimates weren’t enough. I also needed some way of capturing the uncertainty associated with models. I needed confidence intervals, so I implemented the bootstrap (which I also learned about in Rob Tibshirani’s class!) I noticed that the more regularization I used, the narrower the confidence intervals became. That didn’t seem right: why should there be less uncertainty associated with the predictions just because I was using more regularization?
Notably, there were no implementations of the Network Lasso in python at the time. Also I was just starting my career as a data scientist and was eager to prove myself. So I implemented my own. It used the Alternating Direction Method of Multipliers, as described in the paper, A Distributed Algorithm for Fitting Generalized Additive Models by E. Chu, A. Keshavarz, and S. Boyd, to fit generalized additive models with features both continuous and categorical. It was my first python library!
I called it gamdist because it fit Generalized Additive Models in a DISTributed way. And then when I left Tinder, they agreed to open source it. I continued to work on it for a bit after I left, but I just wasn’t sure what to do with it.
I had learned a lot about Machine Learning and Convex Optimization in my Master’s Program, but not much about classical statistics. This whole project made me realize how badly I needed to learn this stuff. I started reading textbook after textbook (the first one I read was All of Statistics by Larry Wasserman; it’s a good first book!). I slowly developed a mastery of statistical hypothesis testing, power analysis, and interval estimation.
And recently I’ve been thinking about regularization again. As I wrote in my post on Empirical Bayes, it’s clear to me that regularization improves point estimates. Even in my A/B tests, I think some flavor of regularization (such as Empirical Bayes) is called for. I’m disinclined to change the way I calculate confidence intervals, but improved point estimates are welcome. To be explicit, I used to think if I was using regularization for my point estimates, I also needed to use regularization in my confidence intervals. But I don’t think that’s true. This bypasses the problem I had previously where my models seemed to have too little uncertainty.
I also think that the sparse flavors of regularization, as discussed in Statistical Learning with Sparsity by Trevor Hastie, Robert Tibshirani, and Martin Wainwright are especially valuable as a method of model selection. This is kind of the same thing that a p-value addresses. A p-value is a disciplined way of answering the question: do I really think this feature is associated with the response? And I think for many years I was persuaded that a p-value is the One True Way of answering that question. But in high dimensions, a lasso-type estimator has strong theoretical guarantees, too. Why shouldn’t that be just as rigorous as a p-value?
More recently my work has shifted away from traditional A/B testing. I’m incorporating more techniques from observational causal inference for things like heterogeneous treatment effect estimation (still within the context of A/B testing, just much more interesting than a simple t-test). Model selection is much more important here than in simple A/B tests. So I’m excited to get caught up on the latest developments in regularization and start playing with generalized additive models again!
I’ve started to explore the python ecosystem for these types of models and came across yaglm. I’ve reached out to the author, Iain, who shares my enthusiasm for this type of modeling. I’m hoping to migrate the best parts of gamdist to yaglm when I get a chance.
I spent the last several years building the foundational statistics knowledge to complement the Machine Learning and Convex Optimization I learned in school. But now I’m excited to turn back to some of the modern developments in statistics, especially regarding high-dimensional models, with, of course, a causal interpretation.