# Hypothesis testing for difference in Pearson / Spearman correlations

The Pearson and Spearman correlation coefficients measure how closely two variables are correlated. They’re useful as an evaluation metric in certain machine learning tasks, when you want the model to predict some kind of score, but the actual value of the score is arbitrary, and you only care that the model puts high-scoring items above low-scoring items.

An example of this is the STS-B task in the GLUE benchmark: the task is to rate pairs of sentences on how similar they are. The task is evaluated using Pearson and Spearman correlations against the human ground-truth. Now, if model A has Spearman correlation of 0.55 and model B has 0.51, how confident are you that model A is actually better?

Recently, the NLP research community has advocated for more significance testing (Dror et al., 2018): report a p-value when comparing two models, to distinguish between true improvements and fluctuations due to random chance. However, hypothesis testing is rarely done for Pearson and Spearman metrics — it’s not mentioned in the hitchhiker’s guide linked above, and not supported by the standard ML libraries in Python and R. In this post, I describe how to do significance testing for a difference in Pearson / Spearman correlations, and give some references to the statistics literature.

## Definitions and properties

The Pearson correlation coefficient is defined by: $r_{xy} = \frac{\sum_i^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_i^n (x_i - \bar{x})^2} \sqrt{\sum_i^n (y_i - \bar{y})^2}}$

The Spearman rank-order correlation is defined as the Pearson correlation between the ranks of the two variables, and measures the relative order between them. Both correlation coefficients range between -1 and 1.

Pearson’s correlation is simpler, has nicer statistical properties, and is default option in most software packages. However, de Winter et al. (2016) argues that Spearman’s correlation works better with non-normal data and is more robust to outliers, so is generally preferred over Pearson’s correlation.

## Significance testing

Suppose we have the predictions of model A and model B, and we wish to compute a p-value for whether their Pearson / Spearman correlation coefficients are different. Start by computing the correlation coefficients for both models against the ground truth.

Then, apply the Fisher transformation to each correlation coefficient: $z = \frac{1}{2} \log(\frac{1+r}{1-r})$

This transforms r which is between -1 and 1 into z, which ranges the whole real number line. It turns out that z is approximately normal, with nearly constant variance that only depends on N (the number of data points) and not on r.

For Pearson correlation, the standard deviation of the estimator $\hat r_p$ is given by: $\mathrm{SD}(\hat r_p) = \sqrt{\frac{1}{N-3}}$

For Spearman rank-order correlation, the standard deviation of the estimator $\hat r_s$ is given by: $\mathrm{SD}(\hat r_s) = \sqrt{\frac{1.060}{N-3}}$

Now, we can compute the p-value because the difference of the two z values follows a normal distribution with known variance.

## R implementation

The following R function computes a p-value for the two-tailed hypothesis test, given a ground truth vector and two model output vectors:

cor_significance_test <- function(truth, x1, x2, method="pearson") {
n <- length(truth)
cor1 <- cor(truth, x1, method=method)
cor2 <- cor(truth, x2, method=method)
fisher1 <- 0.5*log((1+cor1)/(1-cor1))
fisher2 <- 0.5*log((1+cor2)/(1-cor2))
if(method == "pearson") {
expected_sd <- sqrt(1/(n-3))
}
else if(method == "spearman") {
expected_sd <- sqrt(1.060/(n-3))
}
2*(1-pnorm(abs(fisher1-fisher2), sd=expected_sd))
}


Naturally, the one-tailed p-value is half of the two-sided one.

For details of other similar computations involving Pearson and Spearman correlations (eg: confidence intervals, unpaired hypothesis tests), I recommend the Handbook of Parametric and Nonparametric Statistical Procedures (Sheskin, 2000).

## Caveats and limitations

The formula for the Pearson correlation is solid and very accurate. For Spearman, the constant 1.060 has no theoretical backing and was rather derived experimentally by Fieller et al. (1957), by running simulations using variables from a bivariate normal distribution. Fieller claimed that the approximation was accurate for correlations between -0.8 and 0.8. Borkowf (2002) warns that this approximation may be off if the distribution is far from a bivariate normal.

The procedure here for Spearman correlation may not be appropriate if the correlation coefficient is very high (above 0.8) or if the data is not approximately normal. In that case, you might want to try permutation tests or bootstrapping methods — refer to Bishara and Hittner (2012) for a detailed discussion.

## References

1. Dror, Rotem, et al. “The hitchhiker’s guide to testing statistical significance in natural language processing.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.
2. de Winter, Joost CF, Samuel D. Gosling, and Jeff Potter. “Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data.” Psychological methods 21.3 (2016): 273.
3. Sheskin, David J. “Parametric and nonparametric statistical procedures.” Chapman & Hall/CRC: Boca Raton, FL (2000).
4. Fieller, Edgar C., Herman O. Hartley, and Egon S. Pearson. “Tests for rank correlation coefficients. I.” Biometrika 44.3/4 (1957): 470-481.
5. Borkowf, Craig B. “Computing the nonnull asymptotic variance and the asymptotic relative efficiency of Spearman’s rank correlation.” Computational statistics & data analysis 39.3 (2002): 271-286.
6. Bishara, Anthony J., and James B. Hittner. “Testing the significance of a correlation with nonnormal data: comparison of Pearson, Spearman, transformation, and resampling approaches.” Psychological methods 17.3 (2012): 399.

# Useful properties of ROC curves, AUC scoring, and Gini Coefficients

Receiver Operating Characteristic (ROC) curves and AUC values are often used to score binary classification models in Kaggle and in papers. However, for a long time I found them fairly unintuitive and confusing. In this blog post, I will explain some basic properties of ROC curves that are useful to know for Kaggle competitions, and how you should interpret them. Above: Example of a ROC curve

First, the definitions. A ROC curve plots the performance of a binary classifier under various threshold settings; this is measured by true positive rate and false positive rate. If your classifier predicts “true” more often, it will have more true positives (good) but also more false positives (bad). If your classifier is more conservative, predicting “true” less often, it will have fewer false positives but fewer true positives as well. The ROC curve is a graphical representation of this tradeoff.

A perfect classifier has a 100% true positive rate and 0% false positive rate, so its ROC curve passes through the upper left corner of the square. A completely random classifier (ie: predicting “true” with probability p and “false” with probability 1-p for all inputs) will by random chance correctly classify proportion p of the actual true values and incorrectly classify proportion p of the false values, so its true and false positive rates are both p. Therefore, a completely random classifier’s ROC curve is a straight line through the diagonal of the plot.

The AUC (Area Under Curve) is the area enclosed by the ROC curve. A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5. Usually, your model will score somewhere in between. The range of possible AUC values is [0, 1]. However, if your AUC is below 0.5, that means you can invert all the outputs of your classifier and get a better score, so you did something wrong.

The Gini Coefficient is 2*AUC – 1, and its purpose is to normalize the AUC so that a random classifier scores 0, and a perfect classifier scores 1. The range of possible Gini coefficient scores is [-1, 1]. If you search for “Gini Coefficient” on Google, you will find a closely related concept from economics that measures wealth inequality within a country.

Why do we care about AUC, why not just score by percentage accuracy?

AUC is good for classification problems with a class imbalance. Suppose the task is to detect dementia from speech, and 99% of people don’t have dementia and only 1% do. Then you can submit a classifier that always outputs “no dementia”, and that would achieve 99% accuracy. It would seem like your 99% accurate classifier is pretty good, when in fact it is completely useless. Using AUC scoring, your classifier would score 0.5.

In many classification problems, the cost of a false positive is different from the cost of a false negative. For example, it is worse to falsely imprison an innocent person than to let a guilty criminal get away, which is why our justice system assumes you’re innocent until proven guilty, and not the other way around. In a classification system, we would use a threshold rule, where everything above a certain probability is treated as 1, and everything below is treated as 0. However, deciding on where to draw the line requires weighing the cost of a false positive versus a false negative — this depends on external factors and has nothing to do with the classification problem.

AUC scoring lets us evaluate models independently of the threshold. This is why AUC is so popular in Kaggle: it enables competitors to focus on developing a good classifier without worrying about choosing the threshold, and let the organizers choose the threshold later.

(Note: This isn’t quite true — a classifier can sometimes be better at certain thresholds and worse at other thresholds. Sometimes it’s necessary to combine classifiers to get the best one for a particular threshold. Details in the paper linked at the end of this post.)

Next, here’s a mix of useful properties to know when working with ROC curves and AUC scoring.

AUC is not directly comparable to accuracy, precision, recall, or F1-score. If your model is achieving 0.65 AUC, it’s incorrect to interpret that as “65% accurate”. The reason is that AUC exists independently of a threshold and is immune to class imbalance, whereas accuracy / precision / recall / F1-score do require you picking a threshold, so you’re measuring two different things.

Only relative order matters for AUC score. When computing ROC AUC, we predict a probability for each data point, sort the points by predicted probability, and evaluate how close is it from a perfect ordering of the points. Therefore, AUC is invariant under scaling, or any transformation that preserves relative order. For example, predicting [0.03, 0.99, 0.05, 0.06] is the same as predicting [0.15, 0.92, 0.89, 0.91] because the relative ordering for the 4 items is the same in both cases.

A corollary of this is we can’t treat outputs of an AUC-optimized model as the likelihood that it’s true. Some models may be poorly calibrated (eg: its output is always between 0.3 and 0.32) but still achieve a good AUC score because its relative ordering is correct. This is something to look out for when blending together predictions of different models.

That’s my summary of the most important properties to know about ROC curves. There’s more that I haven’t talked about, like how to compute AUC score. If you’d like to learn more, I’d recommend reading “An introduction to ROC analysis” by Tom Fawcett.

# What if math contests were scored using Principal Component Analysis?

In many math competitions, all problems are weighted equally, even though the problems have very different difficulties. Sometimes, the harder problems are weighted more. But how should we assign weights to each problem?

Usually, the organizers make up weights based on how difficult they believe the problems are. However, they sometimes misjudge the difficulty of problems. Wouldn’t it be better if the weightings were determined from data? Let’s try Principal Component Analysis!

Principal Component Analysis (PCA) is a statistical procedure that finds a transformation of the data that maximizes the variance. In our case, the first principal component gives a relative weighting of the problems that maximizes the variance of the total scores. This makes sense because we want to separate the good and bad students in a math contest.

## IMO 2017 Data

The International Mathematics Olympiad (IMO) is an annual math competition for top high school students around the world. It consists of six problems, divided between two days: on each day, contestants are given 4.5 hours to solve three problems.

Here are the 2017 problems, if you want to try them. Above: Score distribution for IMO 2017

This year, 615 students wrote the IMO. Problems 1 and 4 were the easiest, with the majority of contestants receiving full scores. Problems 3 and 6 were the hardest: only 2 students solved the third problem. Problems 2 and 5 were somewhere in between.

This is a good dataset to play with, because the individual results show what each student scored for every problem.

## Derivation of PCA for the 1-dimensional case

Let $X$ be a matrix containing all the data, where each column represents one problem. There are 615 contestants and 6 problems so $X$ has 615 rows and 6 columns.

We wish to find a weight vector $\vec u \in \mathbb{R}^{6 \times 1}$ such that the variance of $X \vec u$ is maximized. Of course, scaling up $\vec u$ by a constant factor also increases the variance, so we need the constraint that $| \vec u | = 1$.

First, PCA requires that we center $X$ so that the mean for each of the problems is 0, so we subtract each column by its mean. This transformation shifts the total score by a constant, and doesn’t affect the relative weights of the problems.

Now, $X \vec u$ is a vector containing the total scores of all the contestants; its variance is the sum of squares of its elements, or $| X \vec u |^2$.

To maximize $|X \vec u |^2$ subject to $|\vec u| = 1$, we take the singular value decomposition of $X = U \Sigma V^T$. Then, the leftmost column of $V$ (corresponding to the largest singular value) gives $\vec u$ that maximizes $| X \vec u|^2$. This gives the first principal axis, and we are done.

## Experiments

Running PCA on the IMO 2017 data produced interesting results. After re-scaling the weights so that the minimum possible score is 0 and the maximum possible score is 42 (to match IMO’s scoring), PCA recommends the following weights:

• Problem 1: 9.15 points
• Problem 2: 9.73 points
• Problem 3: 0.15 points
• Problem 4: 15.34 points
• Problem 5: 5.59 points
• Problem 6: 2.05 points

This is the weighting that produces the highest variance. That’s right, solving the hardest problem in the history of the IMO would get you a fraction of 1 point. P4 had the highest variance of the six problems, so PCA gave it the highest weight. The scores and rankings produced by the PCA scheme are reasonably well-correlated with the original scores. Students that did well still did well, and students that did poorly still did poorly. The top students that solved the harder problems (2, 3, 5, 6) usually also solved the easier problems (1 and 2). The students that would be the unhappiest with this scheme are a small number of people who solved P3 or P6, but failed to solve P4.

Here’s a comparison of score distributions with the original and PCA scheme. There is a lot less separation between the best of the best students and the middle of the pack. It is easy to check that PCA does indeed produce higher variance than weighing all six problems equally. Now, let me comment on the strange results.

It’s clearly absurd to give 0.15 points to the hardest problem on the IMO, and make P4, a much easier problem, be worth 100 times more. But it makes sense from PCA’s perspective. About 99% of the students scored zero on P3, so its variance is very low. Given that PCA has a limited amount of weight to “spend” to increase the total variance, it would be wasteful to use much of it on P3.

The PCA score distribution has less separation between the good students and the best students. However, by giving a lot of weight to P1 and P4, it clearly separates mediocre students that solve one problem from the ones who couldn’t solve anything at all.

In summary, scoring math contests using PCA doesn’t work very well. Although it maximizes overall variance, math contests are asymmetrical as we care about differentiating between the students on the top end of the spectrum.

## Source Code

If you want to play with the data, I uploaded it as a Kaggle dataset.

The code for this analysis is available here.

# On Multiple Hypothesis Testing and the Bonferroni Correction

When you’re doing a statistical analysis, it’s easy to run into the multiple comparisons problem.

Imagine you’re analyzing a dataset. You perform a bunch of statistical tests, and one day you get a p-value of 0.02. This must be significant, right? Not so fast! If you tried a lot of tests, then you’ve fallen into the multiple comparisons fallacy — the more tests you do, the higher chance you get a p-value < 0.05 by pure chance.

Here’s an xkcd comic that illustrates this: They conducted 20 experiments and got a p-value < 0.05 on one of the experiments, thus concluding that green jelly beans cause acne. Later, other researchers will have trouble replicating their results — I wonder why?

What should they have done differently? Well, if they knew about the Bonferroni Correction, they would have divided the p-value 0.05 by the number of experiments, 20. Then, only a p-value smaller than 0.0025 is a truly significant correlation between jelly beans and acne.

Let’s dive in to explain why this division makes sense.

## Šidák Correction

Time for some basic probability. What’s the chance that the scenario in the xkcd comic would happen? In other words, if we conduct 20 experiments, each with probability 0.05 of producing a significant p-value, then how likely will at least one of the experiments produce a significant p-value? Assume all the experiments are independent.

The probability of an experiment not being significant is $1 - 0.05$, so the probability of all 20 experiments not being significant is $(1-0.05)^{20}$. Therefore the probability of at least 1 of 20 experiments being significant is $1 - (1-0.05)^{20} = 0.64$. Not too surprising now, isn’t it?

We want the probability of accidentally getting a significant p-value by chance to be 0.05, not 0.64 — the definition of p-value. So flip this around — we need to find an adjusted p-value $p_{adj}$ to give an overall p-value 0.05: $1 - (1 - p_{adj})^{20} = 0.05$

Solving for $p_{adj}$: $p_{adj} = 1 - 0.95^{1/20} \approx 0.00256$

Okay, this seems reasonably close to 0.0025. In general, if the overall p-value is $p$ and we are correcting for $N$ comparisons, then $p_{adj} = 1 - (1 - p)^{1/N}$

This is known as the Šidák Correction in literature.

## Bonferroni Correction

Šidák’s method works great, but eventually people started complaining that Šidák’s name had too many diacritics and looked for something simpler (also, it used to be difficult to compute nth roots back when they didn’t have computers). How can we approximate this formula?

Approximate? Use Taylor series, of course!

Assume $N$ is constant, and define: $f(p) = 1 - (1-p)^{1/N}$

We take the first two terms of the Taylor series of $f(p)$ centered at 0: $f(p) = f(0) + f'(0)p + O(p^2)$

Now $f(0) = 0$ and $f'(p) = \frac{1}{N} (1-p)^{-(N-1)/N}$ so $f'(0) = \frac1N$. Therefore, $f(p) = p_{adj} \approx \frac{p}{N}.$

That’s the derivation for the Bonferroni Correction.

Since we only took the first two terms of the Taylor series, this produces a $p_{adj}$ that’s slightly lower than necessary.

In the real world, $p$ is close to zero, so in practice it makes little difference whether we use the exact Šidák Correction or the Bonferroni approximation.

That’s it for now. Next time you do multiple comparisons, just remember to divide your p-value by $N$. Now you know why.

# Simple models in Kaggle competitions

This week I participated in the Porto Seguro Kaggle competition. Basically, you’re asked to predict a binary variable — whether or not an insurance claim will be filed — based on a bunch of numerical and categorical variables.

With over 5000 teams entering the competition, it was the largest Kaggle competition ever. I guess this is because it’s a fairly well-understood problem (binary classification) with a reasonably sized dataset, making it accessible to beginning data scientists.

Kaggle is a machine learning competition platform filled with thousands of smart data scientists, machine learning experts, and statistics PhDs, and I am not one of them. Still, I was curious to see how my relatively simple tools would fare against the sophisticated techniques on the leaderboard.

The first thing I tried was logistic regression. All you had to do was load the data into memory, invoke the glm() function in R, and output the predictions. Initially my logistic regression wasn’t working properly and I got a negative score. It took a day or so to figure out how to do logistic regression properly, which got me a score of 0.259 on the public leaderboard.

Next, I tried gradient boosted decision trees, which I had learned about in a stats class but never actually used before. In R, this is simple — I just needed to change the glm() call to gbm() and fit the model again. This improved my score to 0.265. It was near the end of the competition so I stopped here.

At this point, the top submission had a score of 0.291, and 0.288 was enough to get a gold medal. Yet despite being within 10% of the top submission in overall accuracy, I was still in the bottom half of the leaderboard, ranking in the 30th percentile.

The public leaderboard looked like this: Above: Public leaderboard of the Porto Seguro Kaggle competition two days before the deadline. Line in green is my submission, scoring 0.265.

This graph illustrates the nature of this competition. At first, progress is easy, and pretty much anyone who submitted anything that was not “predict all zeros” got over 0.200. From there, you make steady, incremental progress until about 0.280 or so, but afterwards, any further improvements is limited.

The top of the leaderboard is very crowded, with over 1000 teams having the score of 0.287. Many teams used ensembles of XGBoost and LightGBM models with elaborate feature engineering. In the final battle for the private leaderboard, score differences of less than 0.001 translated to hundreds of places on the leaderboard and spelled the difference between victory and defeat. Above: To run 90% as fast as Usain Bolt, you need to run 100 meters in 10.5 seconds. To get 90% of the winning score in Kaggle, you just need to call glm().

This pattern is common in Kaggle and machine learning — often, a simple model can do quite well, at least the same order of magnitude as a highly optimized solution. It’s quite remarkable that you can get a decent solution with a day or two of work, and then, 5000 smart people working for 2 months can only improve it by 10%. Perhaps this is obvious to someone doing machine learning long enough, but we should look back and consider how rare this is. The same does not apply to most activities. You cannot play piano for two days and become 90% as good as a concert pianist. Likewise, you cannot train for two days and run 90% as fast as Usain Bolt.

Simple models won’t win you Kaggle competitions, but we shouldn’t understate their effectiveness. Not only are they quick to develop, but they are also easier to interpret, and can be trained in a few seconds rather than hours. It’s comforting to see how far you can get with simple solutions — the gap between the best and the rest isn’t so big after all.

Read further discussion of this post on the Kaggle forums!

# What’s the difference between Mathematics and Statistics?

Statistics has a sort of funny and peculiar relationship with mathematics. In a lot of university departments, they’re lumped together and you have a Department of Mathematics and Statistics”. Other times, it’s grouped as a branch in applied math. Pure mathematicians tend to either think of it as an application of probability theory, or dislike it because it’s not rigorous enough”.

After having studied both, I feel it’s misleading to say that statistics is a branch of math. Rather, statistics is a separate discipline that uses math, but differs in fundamental ways from other branches of math, like combinatorics or differential equations or group theory. Statistics is the study of uncertainty, and this uncertainty permeates the subject so much that mathematics and statistics are fundamentally different modes of thinking. Above: if pure math and statistics were like games

## Definitions and Proofs

Math always follows a consistent definition-theorem-proof structure. No matter what branch of mathematics you’re studying, whether it be algebraic number theory or real analysis, the structure of a mathematical argument is more or less the same.

You begin by defining some object, let’s say a wug. After defining it, everybody can look at the definition and agree on which objects are wugs and which objects are not wugs.

Next, you proceed to prove interesting things about wugs, using marvelous arguments like proof by contradiction and induction. At every step of the proof, the reader can verify that indeed, this step follows logically from the definitions. After several of these proofs, you now understand a lot of properties of wugs and how they connect to other objects in the mathematical universe, and everyone is happy.

In statistics, it’s common to define things with intuition and examples, so you know it when you see it”; things are rarely so black-and-white like in mathematics. This is born out of necessity: statisticians work with real data, which tends to be messy and doesn’t lend itself easily to clean, rigorous definitions.

Take for example the concept of an outlier”. Many statistical methods behave badly when the data contains outliers, so it’s a common practice to identify outliers and remove them. But what exactly constitutes an outlier? Well, that depends on many criteria, like how many data points you have, how far it is from the rest of the points, and what kind of model you’re fitting. In the above plot, two points are potentially outliers. Should you remove them, or keep them, or maybe remove one of them? There’s no correct answer, and you have to use your judgment.

For another example, consider p-values. Usually, when you get a p-value under 0.05, it can be considered statistically significant. But this value is merely a guideline, not a law – it’s not like 0.048 is definitely significant and 0.051 is not.

Now let’s say you run an A/B-test and find that changing a button to blue results in higher clicks, with p-value of 0.059. Should you recommend to your boss that they make the change? What if you get 0.072, or 0.105? At what point does it become not significant? There is no correct answer, you have to use your judgment.

Take another example: heteroscedasticity. This is a fancy word that means the variance is unequal for different parts of your dataset. Heteroscedasticity is bad because a lot of models assume that the variance is constant, and if this assumption is violated then you’ll get wrong results, so you need to use a different model. Is this data heteroscedastic, or does it only look like the variance is uneven because there are so few points to the left of 3.5? Is the problem serious enough that fitting a linear model is invalid? There’s no correct answer, you have to use your judgment.

Another example: consider a linear regression model with two variables. When you plot the points on a graph, you should expect the points to roughly lie on a straight line. Not exactly on a line, of course, just roughly linear. But what if you get this: There is some evidence of non-linearity, but how much bendiness” can you accept before the data is definitely not roughly linear” and you have to use a different model? Again, there’s no correct answer, and you have to use your judgment.

I think you see the pattern here. In both math and statistics, you have models that only work if certain assumptions are satisfied. However, unlike math, there is no universal procedure that can tell you whether your data satisfies these assumptions.

Here are some common things that statistical models assume:

• A random variable is drawn from a normal (Gaussian) distribution
• Two random variables are independent
• Two random variables satisfy a linear relationship
• Variance is constant

Your data is not going to exactly fit a normal distribution, so all of these are approximations. A common saying in statistics goes: all models are wrong, but some are useful”.

On the other hand, if your data deviates significantly from your model assumptions, then the model breaks down and you get garbage results. There’s no universal black-and-white procedure to decide if your data is normally distributed, so at some point you have to step in and apply your judgment.

Aside: in this article I’m ignoring Mathematical Statistics, which is the part of statistics that tries to justify statistical methods using rigorous math. Mathematical Statistics follows the definition-theorem-proof pattern and is very much like any other branch of math. Any proofs you see in a stats course likely belongs in this category.

## Classical vs Statistical Algorithms

You might be wondering: without rigorous definitions and proofs, how do you be sure anything you’re doing is correct? Indeed, non-statistical (i.e. mathematical) and statistical methods have different ways of judging correctness”.

Non-statistical methods use theory to justify their correctness. For instance, we can prove by induction that Dijkstra’s algorithm always returns the shortest path in a graph, or that quicksort always arranges an array in sorted order. To compare running time, we use Big-O notation, a mathematical construct that formalizes runtimes of programs by looking at how they behave as their inputs get infinitely large.

Non-statistical algorithms focus primarily on worst-case analysis, even for approximation and randomized algorithms. The best known approximation algorithm for the Traveling Salesman problem has an approximation ratio of 1.5 – this means that even for the worst possible input, the algorithm gives a path that’s no more than 1.5 times longer than the optimal solution. It doesn’t make a difference if the algorithm performs a lot better than 1.5 for most practical inputs, because it’s always the worst case that we care about.

A statistical method is good if it can make inferences and predictions on real-world data. Broadly speaking, there are two main goals of statistics. The first is statistical inference: analyzing the data to understand the processes that gave rise to it; the second is prediction: using patterns from past data to predict the future. Therefore, data is crucial when evaluating two different statistical algorithms. No amount of theory will tell you whether a support vector machine is better than a decision tree classifier – the only way to find out is by running both on your data and seeing which one gives more accurate predictions. Above: the winning neural network architecture for ImageNet Challenge 2012. Currently, theory fails at explaining why this method works so well.

In machine learning, there is still theory that tries to formally describe how statistical models behave, but it’s far removed from practice. Consider, for instance, the concepts of VC dimension and PAC learnability. Basically, the theory gives conditions under which the model eventually converges to the best one as you give it more and more data, but is not concerned with how much data you need to achieve a desired accuracy rate.

This approach is highly theoretical and impractical for deciding which model works best for a particular dataset. Theory falls especially short in deep learning, where model hyperparameters and architectures are found by trial and error. Even with models that are theoretically well-understood, the theory can only serve as a guideline; you still need cross-validation to determine the best hyperparameters.

## Modelling the Real World

Both mathematics and statistics are tools we use to model and understand the world, but they do so in very different ways. Math creates an idealized model of reality where everything is clear and deterministic; statistics accepts that all knowledge is uncertain and tries to make sense of the data in spite of all the randomness. As for which approach is better – both approaches have their advantages and disadvantages.

Math is good for modelling domains where the rules are logical and can be expressed with equations. One example of this is physical processes: just a small set of rules is remarkably good for predicting what happens in the real world. Moreover, once we’ve figured out the mathematical laws that govern a system, they are infinitely generalizable — Newton’s laws can accurately predict the motion of celestial bodies even if we’ve only observed apples falling from trees. On the other hand, math is awkward at dealing with error and uncertainty. Mathematicians create an ideal version of reality, and hope that it’s close enough to the real thing.

Statistics shines when the rules of the game are uncertain. Rather than ignoring error, statistics embraces uncertainty. Every value has a confidence interval where you can expect it to be right about 95% of the time, but we can never be 100% sure about anything. But given enough data, the right model will separate the signal from the noise. This makes statistics a powerful tool when there are many unknown confounding factors, like modelling sociological phenomena or anything involving human decisions.

The downside is that statistics only works on the sample space where you have data; most models are bad at extrapolating past the range of data that it’s trained on. In other words, if we use a regression model with data of apples falling from trees, it will eventually be pretty good at predicting other apples falling from trees, but it won’t be able to predict the path of the moon. Thus, math enables us to understand the system at a deeper, more fundamental level than statistics.

Math is a beautiful subject that reduces a complicated system to its essence. But when you’re trying to understand how people behave, when the subjects are not always rational, learning from data is the way to go.

# The Power Law Distribution and the Harsh Reality of Language Learning

I’m an avid language learner, and sometimes people ask me: “how many languages do you speak?” If we’re counting all the languages in which I can have at least a basic conversation, then I can speak five languages — but can I really claim fluency in a language if I can barely read children’s books? Despite being a seemingly innocuous question, it’s not so simple to answer. In this article, I’ll try to explain why.

Let’s say you’re just starting to study Japanese. You might picture yourself being able to do the following things, after a few months or years of study:

1. Have a conversation with a Japanese person who doesn’t speak any English
2. Watch the latest episode of some anime in Japanese before the English subtitles come out
3. Overhear a conversation between two Japanese people in an elevator

After learning several languages, I discovered that the first task is a lot easier than the other two, by an order of magnitude. Whether in French or in Japanese, I would quickly learn enough of the language to talk to people, but the ability to understand movies and radio remains elusive even after years of study.

There is a fundamental difference in how language is used in one-on-one conversation versus the other two tasks. When conversing with a native speaker, it is possible for him to avoid colloquialisms, speak slower, and repeat things you didn’t understand using simpler words. On the other hand, when listening to native-level speech without the speaker adjusting for your language level, you need to be near native-level yourself to understand what’s going on.

We can justify this concept using statistics. By looking at how frequencies of English words are distributed, we show that after an initial period of rapid progress, it soon becomes exponentially harder to get better at a language. Conversely, even a small decrease in language complexity can drastically increase comprehension by non-native listeners.

## Reaching conversational level is easy

For the rest of this article, I’ll avoid using the word “fluent”, which is rather vague and misleading. Instead, I will call a “conversational” speaker someone who can conduct some level of conversation in a language, and a “near-native” speaker someone who can readily understand speech and media intended for native speakers.

It’s surprising how little of a language you actually need to know to have a decent conversation with someone. Basically, you need to know:

1. A set of about 1000-2000 very basic words (eg: person, happy, cat, slow, etc).
2. Enough grammar to form sentences (eg: present / future / past tenses; connecting words like “then”, “because”; conditionals, comparisons, etc). Grammar doesn’t need to be perfect, just close enough for the listener to understand what you’re trying to say.
3. When you want to say something but don’t know the word for it, be flexible enough to work around the issue and express it with words you do know.

For an example of English using only basic words, look at the Simple English Wikipedia. It shows that you can explain complex things using a vocabulary of only about 1000 words.

For another example, imagine that Bob, a native English speaker, is talking to Jing, an international student from China. Their conversation might go like this:

Bob: I read in the news that a baby got abducted by wolves yesterday…

Jing: Abducted? What do you mean?

Bob: He got taken away by wolves while the family was out camping.

Jing: Wow, that’s terrible! Is he okay now?

In this conversation, Jing indicates that she doesn’t understand a complex word, “abducted”, and Bob rephrases the idea using simpler words, and the conversation goes on. This pattern happens a lot when I’m conversing with native Japanese speakers.

After some time, Bob gets an intuitive feeling for what level of words Jing can understand, and naturally simplifies his speech to accommodate. This way, the two can converse without Jing explicitly interrupting and asking Bob to repeat what he said.

Consequently, reaching conversational level in a language is not very hard. Some people claim you can achieve “fluency” in 3 months for a language. I think this is a reasonable amount of time for reaching conversational level.

What if you don’t have the luxury of the speaker simplifying his level of speech for you? We shall see that the task becomes much harder.

## The curse of the Power Law

Initially, I was inspired to write this article after an experience with a group of French speakers. I could talk to any of them individually in French, which is hardly remarkable given that I studied the language since grade 4 and minored in it in university. However, when they talked between themselves, I was completely lost, and could only get a vague sense of what they were talking about.

Feeling slightly embarrassed, I sought an explanation for this phenomenon. Why was it that I could produce 20-page essays for university French classes, but struggled to understand dialogue in French movies and everyday conversations between French people?

The answer lies in the distribution of word frequencies in language. It doesn’t matter if you’re looking at English or French or Japanese — every natural language follows a power law distribution, which means that the frequency of every word is inversely proportional to its rank in the frequency table. In other words, the 1000th most common word appears twice as often as the 2000th most common word, and four times as often as the 4000th most common word, and so on.

(Aside: this phenomenon is sometimes called Zipf’s Law, but refers to the same thing. It’s unclear why this occurs, but the law holds in every natural language) Above: Power law distribution in natural languages

The power law distribution exhibits the long tail property, meaning that as you advance further to the right of the distribution (by learning more vocabulary), the words become less and less common, but never drops off completely. Furthermore, rare words like “constitution” or “fallacy” convey disproportionately more meaning than common words like “the” or “you”.

This is bad news for language learners. Even if you understand 90% of the words of a text, the remaining 10% are the most important words in the passage, so you actually understand much less than 90% of the meaning. Moreover, it takes exponentially more vocabulary and effort to understand 95% or 98% or 99% of the words in the text.

I set out to experimentally test this phenomenon in English. I took the Brown Corpus, containing a million words of various English text, and computed the size of vocabulary you would need to understand 50%, 80%, 90%, 95%, 98%, 99%, and 99.5% of the words in the corpus. By knowing 75 words, you already understand half of the words in a text! Of course, just knowing words like “the” and “it” doesn’t get you very far. Learning 2000 words is enough to have a decent conversation and understand 80% of the words in a text. However, it gets exponentially harder after that: to get from 80% to 98% comprehension, you need to learn more than 10 times as many words!

(Aside: in this analysis I’m considering conjugations like “swim” and “swimming” to be different words; if you count only the stems, you end up with lower word counts but they still follow a similar distribution)

How many words can you miss and still be able to figure out the meaning by inference? In a typical English novel, I encounter about one word per page that I’m unsure of, and a page contains about 200-250 words, so I estimate 99.5% comprehension is native level. When there are more than 5 words per page that I don’t know, then reading becomes very slow and difficult — this is about 98% comprehension.

Therefore I will consider 98% comprehension “near-native”: above this level, you can generally infer the remaining words from context. Below this level, say between 90% to 98% comprehension, you may understand generally what’s going on, but miss a lot of crucial details. Above: Perceived learning curve for a foreign language

This explains the difficulty of language learning. In the beginning, progress is fast, and in a short period of time you learn enough words to have conversations. After that, you reach a long intermediate-stage plateau where you’re learning more words, but don’t know enough to understand native-level speech, and anybody speaking to you must use a reduced vocabulary in order for you to understand. Eventually, you will know enough words to infer the rest from context, but you need a lot of work to reach this stage.

## Implications for language learners

The good news is that if you want to converse with people in a language, it’s perfectly doable in 3 to 6 months. On the other hand, to watch TV shows in the language without subtitles or understand people speaking naturally is going to take a lot more work — probably living for a few years in a country where the language is spoken.

Is there any shortcut instead of slowly learning thousands of words? I can’t say for sure, but somehow I doubt it. By nature, words are arbitrary clusters of sounds, so no amount of cleverness can help you deduce the meaning of words you’ve never seen before. And when the proportion of unknown words is above a certain threshold, it quickly becomes infeasible to try to infer meaning from context. We’ve reached the barrier imposed by the power law distribution.

Now I will briefly engage in some sociological speculation.

My university has a lot of international students. I’ve always noticed that these students tend to form social groups speaking their native non-English languages, and rarely assimilate into English-speaking social groups. At first I thought maybe this was because their English was bad — but I talked to a lot of international students in English and their English seemed okay: noticeably non-native but I didn’t feel there was a language barrier. After all, all our lectures are in English, and they get by.

However, I noticed that when I talked to international students, I subconsciously matched their rate of speaking, speaking just a little bit slower and clearer than normal. I would also avoid the usage of colloquialisms and cultural references that they might not understand.

If the same international student went out to a bar with a group of native English speakers, everyone else would be speaking at normal native speed. Even though she understands more than 90% of the words being spoken, it’s not quite enough to follow the discussion, and she doesn’t want to interrupt the conversation to clarify a word. As everything builds on what was previously said in the conversation, missing a word here and there means she is totally lost.

It’s not that immigrants don’t want to assimilate into our culture, but rather, we don’t realize how hard it is to master a language. On the surface, going from 90% to 98% comprehension looks like a small increase, but in reality, it takes an immense amount of work.