# Three books about AI / ML / Stats

In this edition of my book review blog post series, I will summarize three books I recently read about artificial intelligence. AI is a hot topic nowadays, and people inside and outside the field have very different perspectives. Even as an AI researcher, I found a lot to learn from these books.

## Emergence of the statistics discipline

Machine learning and statistics are closely related areas: ML can be viewed as statistics but with computers. Thus, to understand machine learning, it’s natural to start from the beginning and study the history of statistics.

The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century by David Salsburg

This book tells the story of how statistics emerged as a scientific discipline in the 20th century. The title of the book comes from a story where Fisher wanted to see if a lady can tell when tea or milk was added to the cup first; he comes up with a series of randomized experiments that motivate modern hypothesis testing. The book describes the lives and circumstances of the people involved, explaining the math pretty well using words, without getting too technical with equations. Some of the founding fathers of statistics:

• Karl Pearson (1857-1936) was the founder of mathematical statistics, devised methods of estimating statistical parameters from data, founded the journal Biometrika, and applied these methods to confirm Darwin’s theory of natural selection. He had a dominating personality, and his son Egon Pearson also became a famous statistician.

• William Gosset (1876-1937) discovered the Student t-distribution while working for Guinness, improving methods to brew beer. He had to publish under the pseudonym “Student” because Guinness wouldn’t let their employees publish.

• Ronald Fisher (1890-1962) was a genius that invented a lot of modern statistics including MLE for estimating parameters, ANOVA, and experimental design. He originally used these methods to study the effects of fertilizers on crop variation, and eventually became a distinguished professor. Fisher did not get along with Pearson, and also dismissed evidence that smoking caused cancer long after it was accepted by the scientific community.

• Jerzy Neyman (1894-1981) invented the standard textbook formulation of hypothesis testing against a null hypothesis, and introduced the concept of confidence interval. Fisher and many others were skeptical since it’s unclear what is the interpretation of p-value and 95% probability of the 95% confidence interval.

Statistics is now a crucial part in experiments across many scientific disciplines. Undoubtedly, statistics changed the way we do science, and this book tells the story of how it happened. I liked the first part of the book more, since it talks about the most influential figures in early statistics. By the latter half of the book, statistics had already diversified into numerous sub-disciplines, and the book jumps rapidly between a plethora of scientists.

## Causal reasoning: a limitation of ML

Book by Judea Pearl, one of the leaders of causal inference who received the 2011 Turing award for his work on Bayesian networks. Pearl points out a flaw that affects all machine learning models, from the simplest linear regression to the deepest neural networks — it’s impossible to tell the difference between causation and correlation using data alone. Every morning I hear the birds chirp before sunrise, so do the birds cause the sun to rise? Obviously not, but for a machine, this is surprisingly difficult to deduce.

The Book of Why: The New Science of Cause and Effect by Judea Pearl

Pearl gives three levels of causation, where each level can’t be built up from tools of the lower levels:

• Level 1 — Association: this is where most machine learning and statistics methods stand today. They can find correlations but can’t differentiate them from causation.

• Level 2 — Intervention: using causal diagrams and do-notation, you can tell whether X causes Y. The first step is to use this machinery to determine if a causal relation is possible from the data, then apply level 1 methods to compute the strength of the causality.

• Level 3 — Counterfactuals: given that you did X and Y happened, determine what would have happened if you did X’ instead.

The most reliable way to determine causality is through a randomized trial, but often this is impractical due to cost or ethics, and we only have observational data. A lot of scientists just control for as many variables as possible, but there are situations where this strategy is flawed. Using causal diagrams, the book explains more sophisticated techniques to determine causality, and a quick algorithm to decide if a variable should be controlled or not.

Causal inference is an active area of machine learning research, although an area that’s often ignored by mainstream ML. Judea Pearl thinks that figuring out a better representation of causation is a key missing ingredient for strong AI.

## AI in the far future

When will we have superhuman artificial general intelligence (AGI)? Well, it depends on who you ask. The media often portrays AGI as on the verge of being achieved in just a few years, but AI researchers predict it to be out of reach for several decades or even centuries.

Superintelligence: Paths, Dangers, Strategies by Nick Bostrom

Bostrom believes strong AGI has a serious possibility of being achieved in the near future, say, by 2050. And once this happens, AI is an existential threat to humanity. Once AI has initially exceeded human ability, it will rapidly improve itself or use its programming skills to develop even stronger AI, and humans will be left in the dust. It will be very difficult to keep a superintelligent AI boxed, since it can develop advanced technology and there are many ways that it can escape from its sandbox.

Depending on how the AI is programmed, it may have very different values from humans. In many of Bostrom’s hypothetical scenarios, an AI designed for some narrow task (eg: producing paperclips) decides to take over the world and unleash an army of self-replicating nanobots to turn every atom in the universe into paperclips. There are a lot of unsolved questions of how to design an agent that can only single-mindedly maximize an objective function, without risk of it doing catastrophic unintended actions to maximize its objective.

For now, there is no imminent possibility of AGI, so it’s unclear to what extent specifying the value function will actually be a problem. There are much more immediate dangers of AI technology, for example, unfair bias against certain groups, and economic consequences of automation taking over jobs. Andrew Ng famously said: “fearing a rise of killer robots is like worrying about overpopulation on Mars“. Nevertheless, Bostrom makes a valid point: the risks of superhuman AI to humanity is so great that it’s worth taking seriously and investing in further research.

# Hypothesis testing for difference in Pearson / Spearman correlations

The Pearson and Spearman correlation coefficients measure how closely two variables are correlated. They’re useful as an evaluation metric in certain machine learning tasks, when you want the model to predict some kind of score, but the actual value of the score is arbitrary, and you only care that the model puts high-scoring items above low-scoring items.

An example of this is the STS-B task in the GLUE benchmark: the task is to rate pairs of sentences on how similar they are. The task is evaluated using Pearson and Spearman correlations against the human ground-truth. Now, if model A has Spearman correlation of 0.55 and model B has 0.51, how confident are you that model A is actually better?

Recently, the NLP research community has advocated for more significance testing (Dror et al., 2018): report a p-value when comparing two models, to distinguish between true improvements and fluctuations due to random chance. However, hypothesis testing is rarely done for Pearson and Spearman metrics — it’s not mentioned in the hitchhiker’s guide linked above, and not supported by the standard ML libraries in Python and R. In this post, I describe how to do significance testing for a difference in Pearson / Spearman correlations, and give some references to the statistics literature.

## Definitions and properties

The Pearson correlation coefficient is defined by:

$r_{xy} = \frac{\sum_i^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_i^n (x_i - \bar{x})^2} \sqrt{\sum_i^n (y_i - \bar{y})^2}}$

The Spearman rank-order correlation is defined as the Pearson correlation between the ranks of the two variables, and measures the relative order between them. Both correlation coefficients range between -1 and 1.

Pearson’s correlation is simpler, has nicer statistical properties, and is default option in most software packages. However, de Winter et al. (2016) argues that Spearman’s correlation works better with non-normal data and is more robust to outliers, so is generally preferred over Pearson’s correlation.

## Significance testing

Suppose we have the predictions of model A and model B, and we wish to compute a p-value for whether their Pearson / Spearman correlation coefficients are different. Start by computing the correlation coefficients for both models against the ground truth.

Then, apply the Fisher transformation to each correlation coefficient:

$z = \frac{1}{2} \log(\frac{1+r}{1-r})$

This transforms r which is between -1 and 1 into z, which ranges the whole real number line. It turns out that z is approximately normal, with nearly constant variance that only depends on N (the number of data points) and not on r.

For Pearson correlation, the standard deviation of the estimator $\hat r_p$ is given by:

$\mathrm{SD}(\hat r_p) = \sqrt{\frac{1}{N-3}}$

For Spearman rank-order correlation, the standard deviation of the estimator $\hat r_s$ is given by:

$\mathrm{SD}(\hat r_s) = \sqrt{\frac{1.060}{N-3}}$

Now, we can compute the p-value because the difference of the two z values follows a normal distribution with known variance.

## R implementation

The following R function computes a p-value for the two-tailed hypothesis test, given a ground truth vector and two model output vectors:

cor_significance_test <- function(truth, x1, x2, method="pearson") {
n <- length(truth)
cor1 <- cor(truth, x1, method=method)
cor2 <- cor(truth, x2, method=method)
fisher1 <- 0.5*log((1+cor1)/(1-cor1))
fisher2 <- 0.5*log((1+cor2)/(1-cor2))
if(method == "pearson") {
expected_sd <- sqrt(1/(n-3))
}
else if(method == "spearman") {
expected_sd <- sqrt(1.060/(n-3))
}
2*(1-pnorm(abs(fisher1-fisher2), sd=expected_sd))
}

Naturally, the one-tailed p-value is half of the two-sided one.

For details of other similar computations involving Pearson and Spearman correlations (eg: confidence intervals, unpaired hypothesis tests), I recommend the Handbook of Parametric and Nonparametric Statistical Procedures (Sheskin, 2000).

## Caveats and limitations

The formula for the Pearson correlation is solid and very accurate. For Spearman, the constant 1.060 has no theoretical backing and was rather derived experimentally by Fieller et al. (1957), by running simulations using variables from a bivariate normal distribution. Fieller claimed that the approximation was accurate for correlations between -0.8 and 0.8. Borkowf (2002) warns that this approximation may be off if the distribution is far from a bivariate normal.

The procedure here for Spearman correlation may not be appropriate if the correlation coefficient is very high (above 0.8) or if the data is not approximately normal. In that case, you might want to try permutation tests or bootstrapping methods — refer to Bishara and Hittner (2012) for a detailed discussion.

## References

1. Dror, Rotem, et al. “The hitchhiker’s guide to testing statistical significance in natural language processing.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.
2. de Winter, Joost CF, Samuel D. Gosling, and Jeff Potter. “Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data.” Psychological methods 21.3 (2016): 273.
3. Sheskin, David J. “Parametric and nonparametric statistical procedures.” Chapman & Hall/CRC: Boca Raton, FL (2000).
4. Fieller, Edgar C., Herman O. Hartley, and Egon S. Pearson. “Tests for rank correlation coefficients. I.” Biometrika 44.3/4 (1957): 470-481.
5. Borkowf, Craig B. “Computing the nonnull asymptotic variance and the asymptotic relative efficiency of Spearman’s rank correlation.” Computational statistics & data analysis 39.3 (2002): 271-286.
6. Bishara, Anthony J., and James B. Hittner. “Testing the significance of a correlation with nonnormal data: comparison of Pearson, Spearman, transformation, and resampling approaches.” Psychological methods 17.3 (2012): 399.

# Useful properties of ROC curves, AUC scoring, and Gini Coefficients

Receiver Operating Characteristic (ROC) curves and AUC values are often used to score binary classification models in Kaggle and in papers. However, for a long time I found them fairly unintuitive and confusing. In this blog post, I will explain some basic properties of ROC curves that are useful to know for Kaggle competitions, and how you should interpret them.

Above: Example of a ROC curve

First, the definitions. A ROC curve plots the performance of a binary classifier under various threshold settings; this is measured by true positive rate and false positive rate. If your classifier predicts “true” more often, it will have more true positives (good) but also more false positives (bad). If your classifier is more conservative, predicting “true” less often, it will have fewer false positives but fewer true positives as well. The ROC curve is a graphical representation of this tradeoff.

A perfect classifier has a 100% true positive rate and 0% false positive rate, so its ROC curve passes through the upper left corner of the square. A completely random classifier (ie: predicting “true” with probability p and “false” with probability 1-p for all inputs) will by random chance correctly classify proportion p of the actual true values and incorrectly classify proportion p of the false values, so its true and false positive rates are both p. Therefore, a completely random classifier’s ROC curve is a straight line through the diagonal of the plot.

The AUC (Area Under Curve) is the area enclosed by the ROC curve. A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5. Usually, your model will score somewhere in between. The range of possible AUC values is [0, 1]. However, if your AUC is below 0.5, that means you can invert all the outputs of your classifier and get a better score, so you did something wrong.

The Gini Coefficient is 2*AUC – 1, and its purpose is to normalize the AUC so that a random classifier scores 0, and a perfect classifier scores 1. The range of possible Gini coefficient scores is [-1, 1]. If you search for “Gini Coefficient” on Google, you will find a closely related concept from economics that measures wealth inequality within a country.

Why do we care about AUC, why not just score by percentage accuracy?

AUC is good for classification problems with a class imbalance. Suppose the task is to detect dementia from speech, and 99% of people don’t have dementia and only 1% do. Then you can submit a classifier that always outputs “no dementia”, and that would achieve 99% accuracy. It would seem like your 99% accurate classifier is pretty good, when in fact it is completely useless. Using AUC scoring, your classifier would score 0.5.

In many classification problems, the cost of a false positive is different from the cost of a false negative. For example, it is worse to falsely imprison an innocent person than to let a guilty criminal get away, which is why our justice system assumes you’re innocent until proven guilty, and not the other way around. In a classification system, we would use a threshold rule, where everything above a certain probability is treated as 1, and everything below is treated as 0. However, deciding on where to draw the line requires weighing the cost of a false positive versus a false negative — this depends on external factors and has nothing to do with the classification problem.

AUC scoring lets us evaluate models independently of the threshold. This is why AUC is so popular in Kaggle: it enables competitors to focus on developing a good classifier without worrying about choosing the threshold, and let the organizers choose the threshold later.

(Note: This isn’t quite true — a classifier can sometimes be better at certain thresholds and worse at other thresholds. Sometimes it’s necessary to combine classifiers to get the best one for a particular threshold. Details in the paper linked at the end of this post.)

Next, here’s a mix of useful properties to know when working with ROC curves and AUC scoring.

AUC is not directly comparable to accuracy, precision, recall, or F1-score. If your model is achieving 0.65 AUC, it’s incorrect to interpret that as “65% accurate”. The reason is that AUC exists independently of a threshold and is immune to class imbalance, whereas accuracy / precision / recall / F1-score do require you picking a threshold, so you’re measuring two different things.

Only relative order matters for AUC score. When computing ROC AUC, we predict a probability for each data point, sort the points by predicted probability, and evaluate how close is it from a perfect ordering of the points. Therefore, AUC is invariant under scaling, or any transformation that preserves relative order. For example, predicting [0.03, 0.99, 0.05, 0.06] is the same as predicting [0.15, 0.92, 0.89, 0.91] because the relative ordering for the 4 items is the same in both cases.

A corollary of this is we can’t treat outputs of an AUC-optimized model as the likelihood that it’s true. Some models may be poorly calibrated (eg: its output is always between 0.3 and 0.32) but still achieve a good AUC score because its relative ordering is correct. This is something to look out for when blending together predictions of different models.

That’s my summary of the most important properties to know about ROC curves. There’s more that I haven’t talked about, like how to compute AUC score. If you’d like to learn more, I’d recommend reading “An introduction to ROC analysis” by Tom Fawcett.

# What if math contests were scored using Principal Component Analysis?

In many math competitions, all problems are weighted equally, even though the problems have very different difficulties. Sometimes, the harder problems are weighted more. But how should we assign weights to each problem?

Usually, the organizers make up weights based on how difficult they believe the problems are. However, they sometimes misjudge the difficulty of problems. Wouldn’t it be better if the weightings were determined from data?

Let’s try Principal Component Analysis!

Principal Component Analysis (PCA) is a statistical procedure that finds a transformation of the data that maximizes the variance. In our case, the first principal component gives a relative weighting of the problems that maximizes the variance of the total scores. This makes sense because we want to separate the good and bad students in a math contest.

## IMO 2017 Data

The International Mathematics Olympiad (IMO) is an annual math competition for top high school students around the world. It consists of six problems, divided between two days: on each day, contestants are given 4.5 hours to solve three problems.

Here are the 2017 problems, if you want to try them.

Above: Score distribution for IMO 2017

This year, 615 students wrote the IMO. Problems 1 and 4 were the easiest, with the majority of contestants receiving full scores. Problems 3 and 6 were the hardest: only 2 students solved the third problem. Problems 2 and 5 were somewhere in between.

This is a good dataset to play with, because the individual results show what each student scored for every problem.

## Derivation of PCA for the 1-dimensional case

Let $X$ be a matrix containing all the data, where each column represents one problem. There are 615 contestants and 6 problems so $X$ has 615 rows and 6 columns.

We wish to find a weight vector $\vec u \in \mathbb{R}^{6 \times 1}$ such that the variance of $X \vec u$ is maximized. Of course, scaling up $\vec u$ by a constant factor also increases the variance, so we need the constraint that $| \vec u | = 1$.

First, PCA requires that we center $X$ so that the mean for each of the problems is 0, so we subtract each column by its mean. This transformation shifts the total score by a constant, and doesn’t affect the relative weights of the problems.

Now, $X \vec u$ is a vector containing the total scores of all the contestants; its variance is the sum of squares of its elements, or $| X \vec u |^2$.

To maximize $|X \vec u |^2$ subject to $|\vec u| = 1$, we take the singular value decomposition of $X = U \Sigma V^T$. Then, the leftmost column of $V$ (corresponding to the largest singular value) gives $\vec u$ that maximizes $| X \vec u|^2$. This gives the first principal axis, and we are done.

## Experiments

Running PCA on the IMO 2017 data produced interesting results. After re-scaling the weights so that the minimum possible score is 0 and the maximum possible score is 42 (to match IMO’s scoring), PCA recommends the following weights:

• Problem 1: 9.15 points
• Problem 2: 9.73 points
• Problem 3: 0.15 points
• Problem 4: 15.34 points
• Problem 5: 5.59 points
• Problem 6: 2.05 points

This is the weighting that produces the highest variance. That’s right, solving the hardest problem in the history of the IMO would get you a fraction of 1 point. P4 had the highest variance of the six problems, so PCA gave it the highest weight.

The scores and rankings produced by the PCA scheme are reasonably well-correlated with the original scores. Students that did well still did well, and students that did poorly still did poorly. The top students that solved the harder problems (2, 3, 5, 6) usually also solved the easier problems (1 and 2). The students that would be the unhappiest with this scheme are a small number of people who solved P3 or P6, but failed to solve P4.

Here’s a comparison of score distributions with the original and PCA scheme. There is a lot less separation between the best of the best students and the middle of the pack. It is easy to check that PCA does indeed produce higher variance than weighing all six problems equally.

Now, let me comment on the strange results.

It’s clearly absurd to give 0.15 points to the hardest problem on the IMO, and make P4, a much easier problem, be worth 100 times more. But it makes sense from PCA’s perspective. About 99% of the students scored zero on P3, so its variance is very low. Given that PCA has a limited amount of weight to “spend” to increase the total variance, it would be wasteful to use much of it on P3.

The PCA score distribution has less separation between the good students and the best students. However, by giving a lot of weight to P1 and P4, it clearly separates mediocre students that solve one problem from the ones who couldn’t solve anything at all.

In summary, scoring math contests using PCA doesn’t work very well. Although it maximizes overall variance, math contests are asymmetrical as we care about differentiating between the students on the top end of the spectrum.

## Source Code

If you want to play with the data, I uploaded it as a Kaggle dataset.

The code for this analysis is available here.

# On Multiple Hypothesis Testing and the Bonferroni Correction

When you’re doing a statistical analysis, it’s easy to run into the multiple comparisons problem.

Imagine you’re analyzing a dataset. You perform a bunch of statistical tests, and one day you get a p-value of 0.02. This must be significant, right? Not so fast! If you tried a lot of tests, then you’ve fallen into the multiple comparisons fallacy — the more tests you do, the higher chance you get a p-value < 0.05 by pure chance.

Here’s an xkcd comic that illustrates this:

They conducted 20 experiments and got a p-value < 0.05 on one of the experiments, thus concluding that green jelly beans cause acne. Later, other researchers will have trouble replicating their results — I wonder why?

What should they have done differently? Well, if they knew about the Bonferroni Correction, they would have divided the p-value 0.05 by the number of experiments, 20. Then, only a p-value smaller than 0.0025 is a truly significant correlation between jelly beans and acne.

Let’s dive in to explain why this division makes sense.

## Šidák Correction

Time for some basic probability. What’s the chance that the scenario in the xkcd comic would happen? In other words, if we conduct 20 experiments, each with probability 0.05 of producing a significant p-value, then how likely will at least one of the experiments produce a significant p-value? Assume all the experiments are independent.

The probability of an experiment not being significant is $1 - 0.05$, so the probability of all 20 experiments not being significant is $(1-0.05)^{20}$. Therefore the probability of at least 1 of 20 experiments being significant is $1 - (1-0.05)^{20} = 0.64$. Not too surprising now, isn’t it?

We want the probability of accidentally getting a significant p-value by chance to be 0.05, not 0.64 — the definition of p-value. So flip this around — we need to find an adjusted p-value $p_{adj}$ to give an overall p-value 0.05:

$1 - (1 - p_{adj})^{20} = 0.05$

Solving for $p_{adj}$:

$p_{adj} = 1 - 0.95^{1/20} \approx 0.00256$

Okay, this seems reasonably close to 0.0025. In general, if the overall p-value is $p$ and we are correcting for $N$ comparisons, then

$p_{adj} = 1 - (1 - p)^{1/N}$

This is known as the Šidák Correction in literature.

## Bonferroni Correction

Šidák’s method works great, but eventually people started complaining that Šidák’s name had too many diacritics and looked for something simpler (also, it used to be difficult to compute nth roots back when they didn’t have computers). How can we approximate this formula?

Approximate? Use Taylor series, of course!

Assume $N$ is constant, and define:

$f(p) = 1 - (1-p)^{1/N}$

We take the first two terms of the Taylor series of $f(p)$ centered at 0:

$f(p) = f(0) + f'(0)p + O(p^2)$

Now $f(0) = 0$ and $f'(p) = \frac{1}{N} (1-p)^{-(N-1)/N}$ so $f'(0) = \frac1N$. Therefore,

$f(p) = p_{adj} \approx \frac{p}{N}.$

That’s the derivation for the Bonferroni Correction.

Since we only took the first two terms of the Taylor series, this produces a $p_{adj}$ that’s slightly lower than necessary.

In the real world, $p$ is close to zero, so in practice it makes little difference whether we use the exact Šidák Correction or the Bonferroni approximation.

That’s it for now. Next time you do multiple comparisons, just remember to divide your p-value by $N$. Now you know why.

# Simple models in Kaggle competitions

This week I participated in the Porto Seguro Kaggle competition. Basically, you’re asked to predict a binary variable — whether or not an insurance claim will be filed — based on a bunch of numerical and categorical variables.

With over 5000 teams entering the competition, it was the largest Kaggle competition ever. I guess this is because it’s a fairly well-understood problem (binary classification) with a reasonably sized dataset, making it accessible to beginning data scientists.

Kaggle is a machine learning competition platform filled with thousands of smart data scientists, machine learning experts, and statistics PhDs, and I am not one of them. Still, I was curious to see how my relatively simple tools would fare against the sophisticated techniques on the leaderboard.

The first thing I tried was logistic regression. All you had to do was load the data into memory, invoke the glm() function in R, and output the predictions. Initially my logistic regression wasn’t working properly and I got a negative score. It took a day or so to figure out how to do logistic regression properly, which got me a score of 0.259 on the public leaderboard.

Next, I tried gradient boosted decision trees, which I had learned about in a stats class but never actually used before. In R, this is simple — I just needed to change the glm() call to gbm() and fit the model again. This improved my score to 0.265. It was near the end of the competition so I stopped here.

At this point, the top submission had a score of 0.291, and 0.288 was enough to get a gold medal. Yet despite being within 10% of the top submission in overall accuracy, I was still in the bottom half of the leaderboard, ranking in the 30th percentile.

The public leaderboard looked like this:

Above: Public leaderboard of the Porto Seguro Kaggle competition two days before the deadline. Line in green is my submission, scoring 0.265.

This graph illustrates the nature of this competition. At first, progress is easy, and pretty much anyone who submitted anything that was not “predict all zeros” got over 0.200. From there, you make steady, incremental progress until about 0.280 or so, but afterwards, any further improvements is limited.

The top of the leaderboard is very crowded, with over 1000 teams having the score of 0.287. Many teams used ensembles of XGBoost and LightGBM models with elaborate feature engineering. In the final battle for the private leaderboard, score differences of less than 0.001 translated to hundreds of places on the leaderboard and spelled the difference between victory and defeat.

Above: To run 90% as fast as Usain Bolt, you need to run 100 meters in 10.5 seconds. To get 90% of the winning score in Kaggle, you just need to call glm().

This pattern is common in Kaggle and machine learning — often, a simple model can do quite well, at least the same order of magnitude as a highly optimized solution. It’s quite remarkable that you can get a decent solution with a day or two of work, and then, 5000 smart people working for 2 months can only improve it by 10%. Perhaps this is obvious to someone doing machine learning long enough, but we should look back and consider how rare this is. The same does not apply to most activities. You cannot play piano for two days and become 90% as good as a concert pianist. Likewise, you cannot train for two days and run 90% as fast as Usain Bolt.

Simple models won’t win you Kaggle competitions, but we shouldn’t understate their effectiveness. Not only are they quick to develop, but they are also easier to interpret, and can be trained in a few seconds rather than hours. It’s comforting to see how far you can get with simple solutions — the gap between the best and the rest isn’t so big after all.

Read further discussion of this post on the Kaggle forums!

# What’s the difference between Mathematics and Statistics?

Statistics has a sort of funny and peculiar relationship with mathematics. In a lot of university departments, they’re lumped together and you have a Department of Mathematics and Statistics”. Other times, it’s grouped as a branch in applied math. Pure mathematicians tend to either think of it as an application of probability theory, or dislike it because it’s not rigorous enough”.

After having studied both, I feel it’s misleading to say that statistics is a branch of math. Rather, statistics is a separate discipline that uses math, but differs in fundamental ways from other branches of math, like combinatorics or differential equations or group theory. Statistics is the study of uncertainty, and this uncertainty permeates the subject so much that mathematics and statistics are fundamentally different modes of thinking.

Above: if pure math and statistics were like games

## Definitions and Proofs

Math always follows a consistent definition-theorem-proof structure. No matter what branch of mathematics you’re studying, whether it be algebraic number theory or real analysis, the structure of a mathematical argument is more or less the same.

You begin by defining some object, let’s say a wug. After defining it, everybody can look at the definition and agree on which objects are wugs and which objects are not wugs.

Next, you proceed to prove interesting things about wugs, using marvelous arguments like proof by contradiction and induction. At every step of the proof, the reader can verify that indeed, this step follows logically from the definitions. After several of these proofs, you now understand a lot of properties of wugs and how they connect to other objects in the mathematical universe, and everyone is happy.

In statistics, it’s common to define things with intuition and examples, so you know it when you see it”; things are rarely so black-and-white like in mathematics. This is born out of necessity: statisticians work with real data, which tends to be messy and doesn’t lend itself easily to clean, rigorous definitions.

Take for example the concept of an outlier”. Many statistical methods behave badly when the data contains outliers, so it’s a common practice to identify outliers and remove them. But what exactly constitutes an outlier? Well, that depends on many criteria, like how many data points you have, how far it is from the rest of the points, and what kind of model you’re fitting.

In the above plot, two points are potentially outliers. Should you remove them, or keep them, or maybe remove one of them? There’s no correct answer, and you have to use your judgment.

For another example, consider p-values. Usually, when you get a p-value under 0.05, it can be considered statistically significant. But this value is merely a guideline, not a law – it’s not like 0.048 is definitely significant and 0.051 is not.

Now let’s say you run an A/B-test and find that changing a button to blue results in higher clicks, with p-value of 0.059. Should you recommend to your boss that they make the change? What if you get 0.072, or 0.105? At what point does it become not significant? There is no correct answer, you have to use your judgment.

Take another example: heteroscedasticity. This is a fancy word that means the variance is unequal for different parts of your dataset. Heteroscedasticity is bad because a lot of models assume that the variance is constant, and if this assumption is violated then you’ll get wrong results, so you need to use a different model.

Is this data heteroscedastic, or does it only look like the variance is uneven because there are so few points to the left of 3.5? Is the problem serious enough that fitting a linear model is invalid? There’s no correct answer, you have to use your judgment.

Another example: consider a linear regression model with two variables. When you plot the points on a graph, you should expect the points to roughly lie on a straight line. Not exactly on a line, of course, just roughly linear. But what if you get this:

There is some evidence of non-linearity, but how much bendiness” can you accept before the data is definitely not roughly linear” and you have to use a different model? Again, there’s no correct answer, and you have to use your judgment.

I think you see the pattern here. In both math and statistics, you have models that only work if certain assumptions are satisfied. However, unlike math, there is no universal procedure that can tell you whether your data satisfies these assumptions.

Here are some common things that statistical models assume:

• A random variable is drawn from a normal (Gaussian) distribution
• Two random variables are independent
• Two random variables satisfy a linear relationship
• Variance is constant

Your data is not going to exactly fit a normal distribution, so all of these are approximations. A common saying in statistics goes: all models are wrong, but some are useful”.

On the other hand, if your data deviates significantly from your model assumptions, then the model breaks down and you get garbage results. There’s no universal black-and-white procedure to decide if your data is normally distributed, so at some point you have to step in and apply your judgment.

Aside: in this article I’m ignoring Mathematical Statistics, which is the part of statistics that tries to justify statistical methods using rigorous math. Mathematical Statistics follows the definition-theorem-proof pattern and is very much like any other branch of math. Any proofs you see in a stats course likely belongs in this category.

## Classical vs Statistical Algorithms

You might be wondering: without rigorous definitions and proofs, how do you be sure anything you’re doing is correct? Indeed, non-statistical (i.e. mathematical) and statistical methods have different ways of judging correctness”.

Non-statistical methods use theory to justify their correctness. For instance, we can prove by induction that Dijkstra’s algorithm always returns the shortest path in a graph, or that quicksort always arranges an array in sorted order. To compare running time, we use Big-O notation, a mathematical construct that formalizes runtimes of programs by looking at how they behave as their inputs get infinitely large.

Non-statistical algorithms focus primarily on worst-case analysis, even for approximation and randomized algorithms. The best known approximation algorithm for the Traveling Salesman problem has an approximation ratio of 1.5 – this means that even for the worst possible input, the algorithm gives a path that’s no more than 1.5 times longer than the optimal solution. It doesn’t make a difference if the algorithm performs a lot better than 1.5 for most practical inputs, because it’s always the worst case that we care about.

A statistical method is good if it can make inferences and predictions on real-world data. Broadly speaking, there are two main goals of statistics. The first is statistical inference: analyzing the data to understand the processes that gave rise to it; the second is prediction: using patterns from past data to predict the future. Therefore, data is crucial when evaluating two different statistical algorithms. No amount of theory will tell you whether a support vector machine is better than a decision tree classifier – the only way to find out is by running both on your data and seeing which one gives more accurate predictions.

Above: the winning neural network architecture for ImageNet Challenge 2012. Currently, theory fails at explaining why this method works so well.

In machine learning, there is still theory that tries to formally describe how statistical models behave, but it’s far removed from practice. Consider, for instance, the concepts of VC dimension and PAC learnability. Basically, the theory gives conditions under which the model eventually converges to the best one as you give it more and more data, but is not concerned with how much data you need to achieve a desired accuracy rate.

This approach is highly theoretical and impractical for deciding which model works best for a particular dataset. Theory falls especially short in deep learning, where model hyperparameters and architectures are found by trial and error. Even with models that are theoretically well-understood, the theory can only serve as a guideline; you still need cross-validation to determine the best hyperparameters.

## Modelling the Real World

Both mathematics and statistics are tools we use to model and understand the world, but they do so in very different ways. Math creates an idealized model of reality where everything is clear and deterministic; statistics accepts that all knowledge is uncertain and tries to make sense of the data in spite of all the randomness. As for which approach is better – both approaches have their advantages and disadvantages.

Math is good for modelling domains where the rules are logical and can be expressed with equations. One example of this is physical processes: just a small set of rules is remarkably good for predicting what happens in the real world. Moreover, once we’ve figured out the mathematical laws that govern a system, they are infinitely generalizable — Newton’s laws can accurately predict the motion of celestial bodies even if we’ve only observed apples falling from trees. On the other hand, math is awkward at dealing with error and uncertainty. Mathematicians create an ideal version of reality, and hope that it’s close enough to the real thing.

Statistics shines when the rules of the game are uncertain. Rather than ignoring error, statistics embraces uncertainty. Every value has a confidence interval where you can expect it to be right about 95% of the time, but we can never be 100% sure about anything. But given enough data, the right model will separate the signal from the noise. This makes statistics a powerful tool when there are many unknown confounding factors, like modelling sociological phenomena or anything involving human decisions.

The downside is that statistics only works on the sample space where you have data; most models are bad at extrapolating past the range of data that it’s trained on. In other words, if we use a regression model with data of apples falling from trees, it will eventually be pretty good at predicting other apples falling from trees, but it won’t be able to predict the path of the moon. Thus, math enables us to understand the system at a deeper, more fundamental level than statistics.

Math is a beautiful subject that reduces a complicated system to its essence. But when you’re trying to understand how people behave, when the subjects are not always rational, learning from data is the way to go.