Edgeworth Series in Python

I learned something cool today I wanted to jot down. I don’t have an immediate use case for it, but I also don’t want to forget it!

We often use distributions that can be reasonably approximated as Gaussian, typically due to the Central Limit Theorem. For example, if we have $n$ IID samples from some distribution $F$, then the sample average is approximately Gaussian. When $n$ is large (and the tails of the distribution are reasonable), the approximation is really good and there’s no point worrying about it. But if $n$ is modest, or even if $n$ is large but the underlying distribution is heavily skewed, the approximation may not be good enough.

If we have the cumulants of the distribution, we can use an Edgeworth series to improve the approximation. If instead we have the moments, we can convert to cumulants and then use an Edgeworth series. If we don’t have these, we can estimate the moments provided we can sample from the distribution or even a related distribution. Specifically, suppose we can sample from a distribution $G$ with the property that if $Y \sim G$, then $g(Y) \sim F$, where $g$ is a known function. Then we can use a Monte Carlo technique to approximate $\mu_1 = \int x \, dF = \int g(y) \, dG$ as $\mu_1^\ast = \frac{1}{B}\sum_{i=1}^B g(y^{(i)})$ where $y^{(i)}$ is sampled from $G$. We can estimate higher-order moments as $\mu_k^\ast = \frac{1}{B}\sum_{i=1}^B ( g(y^{(i)}) - \mu_1^\ast)^k$. By making $B$ sufficiently large, or by using more sophisticated Monte Carlo techniques, we can make the estimate as good as we want.

Evgeni Burovski wrote some python for converting moments into cumulants, and cumulants into a distribution having the same api as the distributions in scipy.stats. Specifically, we can calculate the cdf, pdf, or survivor function. It looks like this code was merged into statsmodels at some point, but I don’t see any reference to it in the current docs!

I came upon this while reading Theoretical Statistics by Cox and Hinkley. They discuss the Likelihood Ratio Test (LRT) at length for testing simple hypotheses. I’ll do a separate post on what I learned from this discussion, but one thing I was looking for in particular is a method for computing the power of the LRT. In simple cases, determining the distribution of the test statistic under the null and alternative hypotheses is straightforward, but they also provide guidance in more general cases. Specifically, when our data consist of IID samples $y^{(i)}$, under $H_0$, (the log of) the test statistic is $\sum_{i = 1}^n \log \frac{f_0(y^{(i)})}{f_A(y^{(i)})}$, where $f_0$ and $f_A$ are the likelihood functions under $H_0$ and $H_A$, resp. (Note that here we have specific distributions corresponding to the null and alternative hypotheses, not families of distributions which is much more common in practice.)

In general, it may be difficult to derive the distribution of the test statistic under $H_0$ or $H_A$, but we can instead calculate or estimate the moments $\mu_{1;0} = n \int \log \frac{f_0(y)}{f_A(y)} f_0(y) dy$ or $\mu_{1;A} = n \int \log \frac{f_0(y)}{f_A(y)} f_A(y) dy$, and similarly for the higher-order moments. My intuition tells me there’s some structure here that can be exploited with more sophisticated Monte Carlo techniques but I haven’t followed up on that.

Given $f_0$ and $f_A$, we can estimate the moments of the distribution of the test statistic under $H_0$, calculate the cumulants, and calculate the inverse survivor function of the distribution using Burovski’s code. This becomes the critical value for assessing statistical significance. Or, based on the data, we can compute the observed value of the test statistic and evaluate the survivor function at that point to calculate a p-value. Most excitingly, based on the critical value we calculated, we can plug it into the survivor function corresponding to the distribution under the alternative hypothesis to calculate the power!

Since the sample size enters the moment calculations in such a clean way, it should be possible to calculate the minimum sample size needed to achieve a desired power without having to repeat the Monte Carlo procedures each time. But I haven’t explored that either.

If we use just the first two moments (mean and variance), this is really just finding the Gaussian approximation to the distribution, and in many cases that’s probably fine. But adding the skew and kurtosis could greatly improve the approximation in certain cases, and is not really any extra work!

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.