Monte carlo methods are a class of statistical procedures that rely on random number generation to arrive at conclusions. There are many, arbitrarily complex methods that can involve pretty elaborate simulation strategies. This post will focus on two pretty simple, non-parametric methods and how to implement them in python. These two methods address two of the most important questions in applied statistics
- How certain can I be about an estimate?
- Is a difference between two estimates larger than expected by chance alone?
In addressing these two questions, I will also present python implementations and I will illustrate how these can be used with pandas data frames.
The bootstrap: How certain can I be about an estimate?
The idea of the bootstrap is simple: Use the empirical distribution as a replacement of the true underlying data distribution. What is the empirical distribution? It is simple the data that you observed! The bootstrap simply generates datasets that are similar to the observed dataset by sampling with replacement from the dataset. The idea is that the standard deviation of estimates derived from these re-sampled datasets should be a good approximation to the true standard error of your estimator. Here is a python function that implements a very simple form of the bootstrap
import numpy as np
def bootstrap(data, estimator=np.mean, summary=np.std, nsamples=2000):
n = len(data)
samples = [estimator(data[np.random.randint(n, size=n)])
for _ in range(nsamples)]
return summary(samples)
This function takes a dataset as the first argument.
In principle, this dataset can be anything in which observations can be indexed by integers along the first dimension — vectors, matrices, pandas data frames, ...
The second argument is an estimator
; this function takes a dataset and returns a number that summarizes the dataset.
If no estimator
is specified, the summary number will be the mean of the dataset, but it could be any other measure as well.
In fact, the estimator could even create a vector valued output if the summary
function is chosen correctly (we won't discuss that any further here).
The third argument is a function that summarizes the variability of the samples.
By default we summarize variability by the standard deviation, but another interesting choice might be based on np.percentile
.
Finally, we can specify how many "fake" datasets will be derived from the original dataset to estimate the variability of the statistic computed by the estimator
function.
Today, data analysis in python is usually done using pandas and the bootstrap
function above is well suited to be used with pandas' groupby-apply logic.
This is illustrated in the following code snippet:
import pandas as pd
df = pd.DataFrame({
'observer': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'iscorrect': [1, 1, 1, 0, 0, 0, 0, 1, 0, 1],
})
g = df.groupby('observer')
accuracy = pd.DataFrame(g.mean())
accuracy['sem'] = g.apply(lambda d: bootstrap(d['iscorrect']))
Permutation tests: Is this difference larger than chance variability?
The second question mentioned above is testing for differences. Ideally, we would like to apply a similar approach here, that doesn't require any assumptions about the true distribution of data. Permutation tests satisfy this requirement. Say you have a vector \(x\) of \(n\) observations, where some observations were made under condition 1 and others were made under condition 2. We can summarize this association by a second vector \(C=(c_i)_{i=1}^n\), such that \(c_i = 1\) if \(x_i\) was observed under condition 1 and \(c_i = 2\) if \(x_i\) was observed under condition 2. Permutation tests simply take the vector \(C\), reshuffle it and calculate the difference between conditions 1 and 2 for each of the reshuffled versions. This random reshuffling simulates a null-hypothesis of the form: There is no difference between the two conditions. For small datasets, we can actually go through all possible permutations of \(C\), but for larger datasets that can quickly become too much. Here is a python implementation of a permutation test that simply generates a number of random permutations:
import numpy as np
def permutation_test(data, conditions, nsamples=2000):
differences = np.empty(nsamples, 'd')
observed = np.mean(data[conditions == 1]) - np.mean(data[conditions == 2])
c = conditions.copy()
for i in range(nsamples):
np.random.shuffle(c)
differences[i] = np.mean(data[c == 1]) - np.mean(data[c == 2])
return observed, np.mean(differences >= observed)
To apply this permutation test to the above data frame df
, we can't directly use the elegance of pandas.
Instead, we would simply do (note that we call this function with the underlying numpy arrays, to allow for random assessment of values)
permutation_test(df['iscorrect'].values, df['observer'].values)
This returns two values, the observed difference between conditions (in this case observer 1 is 20% better than observer 2) and a \(p\) value.
The \(p\) value would be close to 0.5 for this example and due to the sampling it is subject to some small variability.
Increasing the number of monte carlo samples (nsamples
) will reduce this variability.