How to read research papers for fun and profit

One skill that I’ve learned after a year in grad school is how to effectively read research papers. Previously I had found them impenetrable, but now I find them a great source of information about cutting-edge science while it is being done and before it’s made its way into textbooks. Now I read about 4-5 of them every week.

My research area is natural language processing and machine learning, but I read papers in lots of fields, not just in AI and computer science. Papers are my go-to source for a myriad of scientific inquiries, for example: does drinking alcohol cause cancer? Are women more talkative than men? Was winter in Toronto abnormally cold this year? Etc.

Why read scientific papers?

If you try to Google questions like these, you typically end up on Wikipedia or some random article on the internet. Research papers are an underutilized resource that have several advantages over other common sources of information on the internet.

Advantages over articles on the internet: no matter what topic, you will undoubtedly find articles on it on the internet. Some of these articles are excellent, but others are opinionated nonsense. Without being an expert yourself, it can be difficult to decide what information to trust. Peer-reviewed research papers are held to a much higher minimum quality standard, and for every claim they make, they have to clearly state their evidence, assumptions, how they arrived at the conclusion, and their degree of confidence in their result. You can examine the paper for yourself and decide if the assumptions are reasonable and the conclusions follow logically, rather than trust someone else’s word for it. With some digging deeper and some critical thinking, you can avoid a lot of misinformation on the internet.

Advantages over Wikipedia: Wikipedia is a pretty reliable source of truth; in fact, it often cites scientific papers as its sources. However, Wikipedia is written to be concise, so that oftentimes, a 30-page research paper is summarized to 1-2 sentences. If you only read Wikipedia, you will miss a lot of the nuances contained in the original paper, and only develop a cursory understanding compared to going directly to the source.

Finding the right paper to read

If your professor or colleague has assigned you a specific paper to read, then you can skip this section.

A big part of the challenge of reading papers is deciding which ones to read. There are a lot of papers out there, and only a few will be relevant to you. Therefore, deciding what to read is a nontrivial skill in itself.

Research papers are the most useful when you have a specific problem or question in mind. When I first started out reading papers, I approached this the wrong way. One day, I’d suddenly decide “hmm, complexity theory is pretty interesting, let’s go on arXiv and look at some recent complexity theory papers“. Then, I’d open a few, attempt to read them, get confused, and conclude I’m not smart enough to read complexity theory papers. Why is this a bad idea? A research paper exists to answer a very specific question, so it makes no sense to pick up a random paper without the background context. What is the problem? What approaches have been tried in the past, and how have they failed? Without understanding background information like this, it’s impossible to appreciate the contribution of a specific paper.

2.pngAbove: Use the forward citation and related article buttons on Google Scholar to explore relevant papers.

It’s helpful to think of each research paper as a node in a massive, interconnected graph. Rather than each paper existing as a standalone item, a paper is deeply connected to the research that came before and after it.

Google Scholar is your best friend for exploring this graph. Begin by entering a few keywords and picking a few promising hits from the first 2-3 pages. Good, this is your starting point. Here are some heuristics for traversing the paper graph:

  • To go forward in time, look at works that cited this paper. A paper being cited usually means one of two things: (1) the future paper uses some technique or result developed in the current paper for some other purpose, or (2) the future paper improves on the techniques in the current paper. Citations of the second type are more useful.
  • To go backward in time, look at the paper’s introduction and related work. This puts the paper in context of previous work. Occasionally, you find a survey paper that doesn’t contribute anything novel of its own, but summarizes a bunch of previous related work; these are really helpful when you’re beginning your research in a topic.
  • Citation count is a good indicator of a paper’s importance and merit. If the paper has under 10 citations, take its claims with a grain of salt (even more so if it’s an arXiv preprint and not a peer-reviewed paper). Over 100 citations means the paper has made a significant contribution; over 1000 citations indicates a landmark paper in the field and is probably worth reading. Citation count is not a perfect metric, especially for very recent work, but it’s a useful heuristic that’s applicable across disciplines.

The first pass: High level overview

Great, you’ve decided on a paper to read. Now how to read it effectively?

Reading a paper is not like reading a novel. When you read a novel, you start at the beginning and read linearly until you reach the end. However, reading a paper is most efficient by hopping around the sections as appropriate, rather than read linearly from beginning to end.

The goal of your first reading of a paper is to first get a high level overview of the paper, before diving into the details. As you go through the paper, here are some good questions that you should be asking yourself:

  • What is the problem being solved?
  • What approaches have been tried before, and what are their limitations?
  • What is this paper’s novel contribution?
  • What experiments were done, using what dataset? How successful were the results?
  • Can the method in this paper be applied to my problem?
  • If not, what assumptions are needed for this method to work?

3.pngAbove: Treat each paper as a node in a massive graph of research, rather than a standalone item in a vacuum.

When I read a paper, I usually proceed in the following order:

  1. Abstract: a long paragraph that summarizes the entire paper. Read this to decide if the rest of the paper is worth reading or not.
  2. Introduction, diagrams, tables, and conclusion. Often, reading the diagrams and captions gives you a good idea of what’s going on with minimal effort.
  3. If the field is unfamiliar to you, then note down any interesting references in the introduction and related works sections to explore later. If the field is familiar, then just skim these sections.
  4. Read the main body of the paper: model, experiment, and discussion, without getting too bogged down in the details. If a section is confusing, skip it for now and come back to it on a second reading.

That’s it — you’ve finished reading a paper! Now you can either go back and read it again, focusing on the details you skimmed over the first pass, or move on to a different paper that you’ve added to your backlog.

When reading a paper, you should not expect to understand every aspect of the paper by the time you’re done. You can always refer back to the paper at a later time, as needed. Generally, you don’t need to understand all the details, unless you’re trying to replicate or extend the paper.

Help, I’m stuck!

Sometimes, despite your best efforts, you find that a paper is impenetrable. It’s not necessarily your fault — some papers are hastily written hours before a conference deadline. What do you do now?

Look for a video or blog post explaining the paper. If you’re lucky, someone may have recorded a lecture where the author presents the paper at a conference. Maybe somebody wrote a blog post summarizing the paper (Colah’s blog has great summaries of machine learning research). These are often better at explaining things than the actual paper.

If there’s a lot of background terminology that don’t make sense, it may be better to consult other sources like textbooks and course lectures rather than papers. This is especially true if the research is not new (>10 years old). Research papers are not always the best at explaining a concept clearly: by their nature, they document research as it’s being done. Sometimes, the paper paints an incomplete picture of something that’s better understood later. Textbook writers can look back on research after it’s already done, and thereby benefit from hindsight knowledge that didn’t exist when the paper was written.

Basic statistics is useful in many experimental fields — concepts like linear / logistic regression, p-values, hypothesis testing, and common statistical distribution. Any paper that deals with experimental data will use at least some statistics, so it’s worthwhile to be comfortable with basic stats.


That’s it for my advice. The densely packed two-column pages of text may appear daunting to the uninitiated reader, but they can be conquered with a bit of practice. Whether it’s for work or for fun, you definitely don’t need a PhD to read papers.

Publishing Negative Results in Machine Learning is like Proving Dragons don’t Exist

I’ve been reading a lot of machine learning papers lately, and one thing I’ve noticed is that the vast majority of papers report positive results — “we used method X on problem Y, and beat the state-of-the-art results”. Very rarely do you see a paper that reports that something doesn’t work.

The result is publication bias — if we only publish the results of experiments that succeed, even statistically significant results could be due to random chance, rather than anything actually significant happening. Many areas of science are facing a replication crisis, where published research cannot be replicated.

There is some community discussion of encouraging more negative paper submissions, but as of now, negative results are rarely publishable. If you attempt an experiment but don’t get the results you expected, your best hope is to try a bunch of variations of the experiment until you get some positive result (perhaps on a special case of the problem), after which you pretend the failed experiments never happened. With few exceptions, any positive result is better than a negative result, like “we tried method X on problem Y, and it didn’t work”.

Why publication bias is not so bad

I just described a cynical view of academia, but actually, there’s a good reason why the community prefers positive results. Negative results are simply not very useful, and contribute very little to human knowledge.

Now why is that? When a new paper beats the state-of-the-art results on a popular benchmark, that’s definite proof that the method works. The converse is not true. If your model fails to produce good results, it could be due to a number of reasons:

  • Your dataset is too small / too noisy
  • You’re using the wrong batch size / activation function / regularization
  • You’re using the wrong loss function / wrong optimizer
  • Your model is overfitting
  • You have a bug in your code

lattice2.pngAbove: Only when everything is correct will you get positive results; many things can cause a model to fail. (Source)

So if you try method X on problem Y and it doesn’t work, you gain very little information. In particular, you haven’t proved that method X cannot work. Sure, you found that your specific setup didn’t work, but have you tried making modification Z? Negative results in machine learning are rare because you can’t possibly anticipate all possible variations of your method and convince people that all of them won’t work.

Searching for dragons

Suppose we’re scientists attending the International Conference of Flying Creatures (ICFC). Somebody mentioned it would be nice if we had dragons. Dragons are useful. You could do all sorts of cool stuff with a dragon, like ride it into battle.

1.jpg

“But wait!” you exclaim: “Dragons don’t exist!”

I glance at you questioningly: “How come? We haven’t found one yet, but we’ll probably find one soon.”

Your intuition tells you dragons shouldn’t exist, but you can’t articulate a convincing argument why. So you go home, and you and your team of grad students labor for a few years and publish a series of papers:

  • “We looked for dragons in China and we didn’t find any”
  • “We looked for dragons in Europe and we didn’t find any”
  • “We looked for dragons in North America and we didn’t find any”

Eventually, the community is satisfied that dragons probably don’t exist, for if they did, someone would have found one by now. But a few scientists still harbor the possibility that there may be dragons lying around in a remote jungle somewhere. We just don’t know for sure.

This remains the state of things for a few years until a colleague publishes a breakthrough result:

  • “Here’s a calculation that shows that any dragon with a wing span longer than 5 meters will collapse under its own weight”

You read the paper, and indeed, the logic is impeccable. This settles the matter once and for all: dragons don’t exist (or at least the large, flying sort of dragons).

When negative results are actually publishable

The research community dislikes negative results because they don’t prove a whole lot — you can have a lot of negative results and still not be sure that the task is impossible. In order for a negative result to be valuable, it needs to present a convincing argument why the task is impossible, and not just a list of experiments that you tried that failed.

This is difficult, but it can be done. Let me give an example from computational linguistics. Recurrent neural networks (RNNs) can, in theory, compute any function defined over a sequence. In practice, however, they had difficulty remembering long-term dependencies. Attempts to train RNNs using gradient descent ran into numerical difficulties known as the vanishing / exploding gradient problem.

Then, Bengio et al. (1994) formulated a mathematical model of an RNN as an iteratively applied function. Using ideas from dynamical systems theory, they showed that as the input sequence gets longer and longer, the result is more and more sensitive to noise. The details are technical, but the gist of it is that under some reasonable assumptions, training RNNs using gradient descent is impossible. This is a rare example of a negative result in machine learning — it’s an excellent paper and I’d recommend reading it.

3.pngAbove: A Long Short Term Memory (LSTM) network handles long term dependencies by adding a memory cell (Source)

Soon after the vanishing gradient problem was understood, researchers invented the LSTM (Hochreiter and Schmidhuber, 1997). Since training RNNs with gradient descent was hopeless, they added a ‘latching’ mechanism that allows state to persist through many iterations, thus avoiding the vanishing gradient problem. Unlike plain RNNs, LSTMs can handle long term dependencies and can be trained with gradient descent; they are among the most ubiquitous deep learning architectures in NLP today.


After reading the breakthrough dragon paper, you pace around your office, thinking. Large, flying dragons can’t exist after all, as they would collapse under their own weight — but what about smaller, non-flying dragons? Maybe we’ve been looking for the wrong type of dragons all along? Armed with new knowledge, you embark on a new search…

4.jpgAbove: Komodo Dragon, Indonesia

…and sure enough, you find one 🙂

Simple models in Kaggle competitions

This week I participated in the Porto Seguro Kaggle competition. Basically, you’re asked to predict a binary variable — whether or not an insurance claim will be filed — based on a bunch of numerical and categorical variables.

With over 5000 teams entering the competition, it was the largest Kaggle competition ever. I guess this is because it’s a fairly well-understood problem (binary classification) with a reasonably sized dataset, making it accessible to beginning data scientists.

Kaggle is a machine learning competition platform filled with thousands of smart data scientists, machine learning experts, and statistics PhDs, and I am not one of them. Still, I was curious to see how my relatively simple tools would fare against the sophisticated techniques on the leaderboard.


The first thing I tried was logistic regression. All you had to do was load the data into memory, invoke the glm() function in R, and output the predictions. Initially my logistic regression wasn’t working properly and I got a negative score. It took a day or so to figure out how to do logistic regression properly, which got me a score of 0.259 on the public leaderboard.

Next, I tried gradient boosted decision trees, which I had learned about in a stats class but never actually used before. In R, this is simple — I just needed to change the glm() call to gbm() and fit the model again. This improved my score to 0.265. It was near the end of the competition so I stopped here.

At this point, the top submission had a score of 0.291, and 0.288 was enough to get a gold medal. Yet despite being within 10% of the top submission in overall accuracy, I was still in the bottom half of the leaderboard, ranking in the 30th percentile.

The public leaderboard looked like this:

Rplot.pngAbove: Public leaderboard of the Porto Seguro Kaggle competition two days before the deadline. Line in green is my submission, scoring 0.265.

This graph illustrates the nature of this competition. At first, progress is easy, and pretty much anyone who submitted anything that was not “predict all zeros” got over 0.200. From there, you make steady, incremental progress until about 0.280 or so, but afterwards, any further improvements is limited.

The top of the leaderboard is very crowded, with over 1000 teams having the score of 0.287. Many teams used ensembles of XGBoost and LightGBM models with elaborate feature engineering. In the final battle for the private leaderboard, score differences of less than 0.001 translated to hundreds of places on the leaderboard and spelled the difference between victory and defeat.

591926572-christophe-lemaitre-of-france-usain-bolt-of-jamaica.jpg.CROP.promo-xlarge2.jpgAbove: To run 90% as fast as Usain Bolt, you need to run 100 meters in 10.5 seconds. To get 90% of the winning score in Kaggle, you just need to call glm().

This pattern is common in Kaggle and machine learning — often, a simple model can do quite well, at least the same order of magnitude as a highly optimized solution. It’s quite remarkable that you can get a decent solution with a day or two of work, and then, 5000 smart people working for 2 months can only improve it by 10%. Perhaps this is obvious to someone doing machine learning long enough, but we should look back and consider how rare this is. The same does not apply to most activities. You cannot play piano for two days and become 90% as good as a concert pianist. Likewise, you cannot train for two days and run 90% as fast as Usain Bolt.

Simple models won’t win you Kaggle competitions, but we shouldn’t understate their effectiveness. Not only are they quick to develop, but they are also easier to interpret, and can be trained in a few seconds rather than hours. It’s comforting to see how far you can get with simple solutions — the gap between the best and the rest isn’t so big after all.

Read further discussion of this post on the Kaggle forums!

What’s the difference between Mathematics and Statistics?

Statistics has a sort of funny and peculiar relationship with mathematics. In a lot of university departments, they’re lumped together and you have a “Department of Mathematics and Statistics”. Other times, it’s grouped as a branch in applied math. Pure mathematicians tend to either think of it as an application of probability theory, or dislike it because it’s “not rigorous enough”.

After having studied both, I feel it’s misleading to say that statistics is a branch of math. Rather, statistics is a separate discipline that uses math, but differs in fundamental ways from other branches of math, like combinatorics or differential equations or group theory. Statistics is the study of uncertainty, and this uncertainty permeates the subject so much that mathematics and statistics are fundamentally different modes of thinking.

mathstats.png

Above: if pure math and statistics were like games

 

Definitions and Proofs

Math always follows a consistent definition-theorem-proof structure. No matter what branch of mathematics you’re studying, whether it be algebraic number theory or real analysis, the structure of a mathematical argument is more or less the same.

You begin by defining some object, let’s say a wug. After defining it, everybody can look at the definition and agree on which objects are wugs and which objects are not wugs.

Next, you proceed to prove interesting things about wugs, using marvelous arguments like proof by contradiction and induction. At every step of the proof, the reader can verify that indeed, this step follows logically from the definitions. After several of these proofs, you now understand a lot of properties of wugs and how they connect to other objects in the mathematical universe, and everyone is happy.

In statistics, it’s common to define things with intuition and examples, so “you know it when you see it”; things are rarely so black-and-white like in mathematics. This is born out of necessity: statisticians work with real data, which tends to be messy and doesn’t lend itself easily to clean, rigorous definitions.

Take for example the concept of an “outlier”. Many statistical methods behave badly when the data contains outliers, so it’s a common practice to identify outliers and remove them. But what exactly constitutes an outlier? Well, that depends on many criteria, like how many data points you have, how far it is from the rest of the points, and what kind of model you’re fitting.

scatterplot1

In the above plot, two points are potentially outliers. Should you remove them, or keep them, or maybe remove one of them? There’s no correct answer, and you have to use your judgment.


For another example, consider p-values. Usually, when you get a p-value under 0.05, it can be considered statistically significant. But this value is merely a guideline, not a law – it’s not like 0.048 is definitely significant and 0.051 is not.

Now let’s say you run an A/B-test and find that changing a button to blue results in higher clicks, with p-value of 0.059. Should you recommend to your boss that they make the change? What if you get 0.072, or 0.105? At what point does it become not significant? There is no correct answer, you have to use your judgment.


Take another example: heteroscedasticity. This is a fancy word that means the variance is unequal for different parts of your dataset. Heteroscedasticity is bad because a lot of models assume that the variance is constant, and if this assumption is violated then you’ll get wrong results, so you need to use a different model.

 

1-qYXXQN-1eumcnYgTJZnaww

Is this data heteroscedastic, or does it only look like the variance is uneven because there are so few points to the left of 3.5? Is the problem serious enough that fitting a linear model is invalid? There’s no correct answer, you have to use your judgment.


Another example: consider a linear regression model with two variables. When you plot the points on a graph, you should expect the points to roughly lie on a straight line. Not exactly on a line, of course, just roughly linear. But what if you get this:

Rplot

There is some evidence of non-linearity, but how much “bendiness” can you accept before the data is definitely not “roughly linear” and you have to use a different model? Again, there’s no correct answer, and you have to use your judgment.


I think you see the pattern here. In both math and statistics, you have models that only work if certain assumptions are satisfied. However, unlike math, there is no universal procedure that can tell you whether your data satisfies these assumptions.

Here are some common things that statistical models assume:

  • A random variable is drawn from a normal (Gaussian) distribution
  • Two random variables are independent
  • Two random variables satisfy a linear relationship
  • Variance is constant

Your data is not going to exactly fit a normal distribution, so all of these are approximations. A common saying in statistics goes: “all models are wrong, but some are useful”.

On the other hand, if your data deviates significantly from your model assumptions, then the model breaks down and you get garbage results. There’s no universal black-and-white procedure to decide if your data is normally distributed, so at some point you have to step in and apply your judgment.

Aside: in this article I’m ignoring Mathematical Statistics, which is the part of statistics that tries to justify statistical methods using rigorous math. Mathematical Statistics follows the definition-theorem-proof pattern and is very much like any other branch of math. Any proofs you see in a stats course likely belongs in this category.

 

Classical vs Statistical Algorithms

You might be wondering: without rigorous definitions and proofs, how do you be sure anything you’re doing is correct? Indeed, non-statistical (i.e. mathematical) and statistical methods have different ways of judging “correctness”.

Non-statistical methods use theory to justify their correctness. For instance, we can prove by induction that Dijkstra’s algorithm always returns the shortest path in a graph, or that quicksort always arranges an array in sorted order. To compare running time, we use Big-O notation, a mathematical construct that formalizes runtimes of programs by looking at how they behave as their inputs get infinitely large.

Non-statistical algorithms focus primarily on worst-case analysis, even for approximation and randomized algorithms. The best known approximation algorithm for the Traveling Salesman problem has an approximation ratio of 1.5 – this means that even for the worst possible input, the algorithm gives a path that’s no more than 1.5 times longer than the optimal solution. It doesn’t make a difference if the algorithm performs a lot better than 1.5 for most practical inputs, because it’s always the worst case that we care about.

A statistical method is good if it can make inferences and predictions on real-world data. Broadly speaking, there are two main goals of statistics. The first is statistical inference: analyzing the data to understand the processes that gave rise to it; the second is prediction: using patterns from past data to predict the future. Therefore, data is crucial when evaluating two different statistical algorithms. No amount of theory will tell you whether a support vector machine is better than a decision tree classifier – the only way to find out is by running both on your data and seeing which one gives more accurate predictions.

2 Above: the winning neural network architecture for ImageNet Challenge 2012. Currently, theory fails at explaining why this method works so well.

In machine learning, there is still theory that tries to formally describe how statistical models behave, but it’s far removed from practice. Consider, for instance, the concepts of VC dimension and PAC learnability. Basically, the theory gives conditions under which the model eventually converges to the best one as you give it more and more data, but is not concerned with how much data you need to achieve a desired accuracy rate.

This approach is highly theoretical and impractical for deciding which model works best for a particular dataset. Theory falls especially short in deep learning, where model hyperparameters and architectures are found by trial and error. Even with models that are theoretically well-understood, the theory can only serve as a guideline; you still need cross-validation to determine the best hyperparameters.

 

Modelling the Real World

Both mathematics and statistics are tools we use to model and understand the world, but they do so in very different ways. Math creates an idealized model of reality where everything is clear and deterministic; statistics accepts that all knowledge is uncertain and tries to make sense of the data in spite of all the randomness. As for which approach is better – both approaches have their advantages and disadvantages.

Math is good for modelling domains where the rules are logical and can be expressed with equations. One example of this is physical processes: just a small set of rules is remarkably good for predicting what happens in the real world. Moreover, once we’ve figured out the mathematical laws that govern a system, they are infinitely generalizable — Newton’s laws can accurately predict the motion of celestial bodies even if we’ve only observed apples falling from trees. On the other hand, math is awkward at dealing with error and uncertainty. Mathematicians create an ideal version of reality, and hope that it’s close enough to the real thing.

Statistics shines when the rules of the game are uncertain. Rather than ignoring error, statistics embraces uncertainty. Every value has a confidence interval where you can expect it to be right about 95% of the time, but we can never be 100% sure about anything. But given enough data, the right model will separate the signal from the noise. This makes statistics a powerful tool when there are many unknown confounding factors, like modelling sociological phenomena or anything involving human decisions.

The downside is that statistics only works on the sample space where you have data; most models are bad at extrapolating past the range of data that it’s trained on. In other words, if we use a regression model with data of apples falling from trees, it will eventually be pretty good at predicting other apples falling from trees, but it won’t be able to predict the path of the moon. Thus, math enables us to understand the system at a deeper, more fundamental level than statistics.

Math is a beautiful subject that reduces a complicated system to its essence. But when you’re trying to understand how people behave, when the subjects are not always rational, learning from data is the way to go.

The Power Law Distribution and the Harsh Reality of Language Learning

I’m an avid language learner, and sometimes people ask me: “how many languages do you speak?” If we’re counting all the languages in which I can have at least a basic conversation, then I can speak five languages — but can I really claim fluency in a language if I can barely read children’s books? Despite being a seemingly innocuous question, it’s not so simple to answer. In this article, I’ll try to explain why.

Let’s say you’re just starting to study Japanese. You might picture yourself being able to do the following things, after a few months or years of study:

  1. Have a conversation with a Japanese person who doesn’t speak any English
  2. Watch the latest episode of some anime in Japanese before the English subtitles come out
  3. Overhear a conversation between two Japanese people in an elevator

After learning several languages, I discovered that the first task is a lot easier than the other two, by an order of magnitude. Whether in French or in Japanese, I would quickly learn enough of the language to talk to people, but the ability to understand movies and radio remains elusive even after years of study.

There is a fundamental difference in how language is used in one-on-one conversation versus the other two tasks. When conversing with a native speaker, it is possible for him to avoid colloquialisms, speak slower, and repeat things you didn’t understand using simpler words. On the other hand, when listening to native-level speech without the speaker adjusting for your language level, you need to be near native-level yourself to understand what’s going on.

We can justify this concept using statistics. By looking at how frequencies of English words are distributed, we show that after an initial period of rapid progress, it soon becomes exponentially harder to get better at a language. Conversely, even a small decrease in language complexity can drastically increase comprehension by non-native listeners.

Reaching conversational level is easy

For the rest of this article, I’ll avoid using the word “fluent”, which is rather vague and misleading. Instead, I will call a “conversational” speaker someone who can conduct some level of conversation in a language, and a “near-native” speaker someone who can readily understand speech and media intended for native speakers.

It’s surprising how little of a language you actually need to know to have a decent conversation with someone. Basically, you need to know:

  1. A set of about 1000-2000 very basic words (eg: person, happy, cat, slow, etc).
  2. Enough grammar to form sentences (eg: present / future / past tenses; connecting words like “then”, “because”; conditionals, comparisons, etc). Grammar doesn’t need to be perfect, just close enough for the listener to understand what you’re trying to say.
  3. When you want to say something but don’t know the word for it, be flexible enough to work around the issue and express it with words you do know.

For an example of English using only basic words, look at the Simple English Wikipedia. It shows that you can explain complex things using a vocabulary of only about 1000 words.

For another example, imagine that Bob, a native English speaker, is talking to Jing, an international student from China. Their conversation might go like this:

Bob: I read in the news that a baby got abducted by wolves yesterday…

Jing: Abducted? What do you mean?

Bob: He got taken away by wolves while the family was out camping.

Jing: Wow, that’s terrible! Is he okay now?

In this conversation, Jing indicates that she doesn’t understand a complex word, “abducted”, and Bob rephrases the idea using simpler words, and the conversation goes on. This pattern happens a lot when I’m conversing with native Japanese speakers.

After some time, Bob gets an intuitive feeling for what level of words Jing can understand, and naturally simplifies his speech to accommodate. This way, the two can converse without Jing explicitly interrupting and asking Bob to repeat what he said.

Consequently, reaching conversational level in a language is not very hard. Some people claim you can achieve “fluency” in 3 months for a language. I think this is a reasonable amount of time for reaching conversational level.

What if you don’t have the luxury of the speaker simplifying his level of speech for you? We shall see that the task becomes much harder.

The curse of the Power Law

Initially, I was inspired to write this article after an experience with a group of French speakers. I could talk to any of them individually in French, which is hardly remarkable given that I studied the language since grade 4 and minored in it in university. However, when they talked between themselves, I was completely lost, and could only get a vague sense of what they were talking about.

Feeling slightly embarrassed, I sought an explanation for this phenomenon. Why was it that I could produce 20-page essays for university French classes, but struggled to understand dialogue in French movies and everyday conversations between French people?

The answer lies in the distribution of word frequencies in language. It doesn’t matter if you’re looking at English or French or Japanese — every natural language follows a power law distribution, which means that the frequency of every word is inversely proportional to its rank in the frequency table. In other words, the 1000th most common word appears twice as often as the 2000th most common word, and four times as often as the 4000th most common word, and so on.

(Aside: this phenomenon is sometimes called Zipf’s Law, but refers to the same thing. It’s unclear why this occurs, but the law holds in every natural language)

1.pngAbove: Power law distribution in natural languages

The power law distribution exhibits the long tail property, meaning that as you advance further to the right of the distribution (by learning more vocabulary), the words become less and less common, but never drops off completely. Furthermore, rare words like “constitution” or “fallacy” convey disproportionately more meaning than common words like “the” or “you”.

This is bad news for language learners. Even if you understand 90% of the words of a text, the remaining 10% are the most important words in the passage, so you actually understand much less than 90% of the meaning. Moreover, it takes exponentially more vocabulary and effort to understand 95% or 98% or 99% of the words in the text.

I set out to experimentally test this phenomenon in English. I took the Brown Corpus, containing a million words of various English text, and computed the size of vocabulary you would need to understand 50%, 80%, 90%, 95%, 98%, 99%, and 99.5% of the words in the corpus.

2.png

By knowing 75 words, you already understand half of the words in a text! Of course, just knowing words like “the” and “it” doesn’t get you very far. Learning 2000 words is enough to have a decent conversation and understand 80% of the words in a text. However, it gets exponentially harder after that: to get from 80% to 98% comprehension, you need to learn more than 10 times as many words!

(Aside: in this analysis I’m considering conjugations like “swim” and “swimming” to be different words; if you count only the stems, you end up with lower word counts but they still follow a similar distribution)

How many words can you miss and still be able to figure out the meaning by inference? In a typical English novel, I encounter about one word per page that I’m unsure of, and a page contains about 200-250 words, so I estimate 99.5% comprehension is native level. When there are more than 5 words per page that I don’t know, then reading becomes very slow and difficult — this is about 98% comprehension.

Therefore I will consider 98% comprehension “near-native”: above this level, you can generally infer the remaining words from context. Below this level, say between 90% to 98% comprehension, you may understand generally what’s going on, but miss a lot of crucial details.

3.pngAbove: Perceived learning curve for a foreign language

This explains the difficulty of language learning. In the beginning, progress is fast, and in a short period of time you learn enough words to have conversations. After that, you reach a long intermediate-stage plateau where you’re learning more words, but don’t know enough to understand native-level speech, and anybody speaking to you must use a reduced vocabulary in order for you to understand. Eventually, you will know enough words to infer the rest from context, but you need a lot of work to reach this stage.

Implications for language learners

The good news is that if you want to converse with people in a language, it’s perfectly doable in 3 to 6 months. On the other hand, to watch TV shows in the language without subtitles or understand people speaking naturally is going to take a lot more work — probably living for a few years in a country where the language is spoken.

Is there any shortcut instead of slowly learning thousands of words? I can’t say for sure, but somehow I doubt it. By nature, words are arbitrary clusters of sounds, so no amount of cleverness can help you deduce the meaning of words you’ve never seen before. And when the proportion of unknown words is above a certain threshold, it quickly becomes infeasible to try to infer meaning from context. We’ve reached the barrier imposed by the power law distribution.


Now I will briefly engage in some sociological speculation.

My university has a lot of international students. I’ve always noticed that these students tend to form social groups speaking their native non-English languages, and rarely assimilate into English-speaking social groups. At first I thought maybe this was because their English was bad — but I talked to a lot of international students in English and their English seemed okay: noticeably non-native but I didn’t feel there was a language barrier. After all, all our lectures are in English, and they get by.

However, I noticed that when I talked to international students, I subconsciously matched their rate of speaking, speaking just a little bit slower and clearer than normal. I would also avoid the usage of colloquialisms and cultural references that they might not understand.

If the same international student went out to a bar with a group of native English speakers, everyone else would be speaking at normal native speed. Even though she understands more than 90% of the words being spoken, it’s not quite enough to follow the discussion, and she doesn’t want to interrupt the conversation to clarify a word. As everything builds on what was previously said in the conversation, missing a word here and there means she is totally lost.

It’s not that immigrants don’t want to assimilate into our culture, but rather, we don’t realize how hard it is to master a language. On the surface, going from 90% to 98% comprehension looks like a small increase, but in reality, it takes an immense amount of work.

Read further discussion of this article on /r/languagelearning!

Applying to Graduate School in Computer Science

So you’re thinking of grad school. Four years of undergrad is not enough for you and you’re craving for more knowledge. Or perhaps you want to delay your entry into the “real world” for a couple more years. Well, grad school is the right place!

About a year ago, I decided I wanted to do grad school. However, most of my peers were on track to work in the industry after graduation. The process of applying for grad school is daunting, especially since it varies from country to country and from one subject to another. This is why I am writing a guide to grad school applications in Computer Science, for Canada and the USA: a compilation of things I wish I knew a year ago.

Why grad school?

People go to grad school for different reasons. Most people I know in computer science and software engineering plan to start working full-time — a reasonable choice, given the high salaries in the industry right now. I figured that I had the rest of my life to code for a company; there’s no rush to start working immediately.

Graduate programs come in three broad categories:

  1. Coursework Master’s. Typically about 1 year, this is basically an extension of undergrad. You take a bunch of graduate courses, which are more in-depth than undergrad classes, but you don’t do any research. This is useful for gaining more knowledge before going to work in the industry.
  2. Thesis Master’s. About 1-2 years, depending on the school. At first, you take some courses like in coursework master’s, but the primary goal is to produce a master’s thesis, which is equivalent to about one conference paper of original research. This is a good way to get some research experience, without the time commitment of a Ph.D (and is valuable as a stepping stone if you do decide to get one).
  3. Ph.D. A longer commitment of 4-5 years. In a Ph. D., you produce an extensive amount of original research; by the time you write your thesis, you will be a world expert on your specific topic. I like this illustrated explanation of what it’s like to do a Ph. D.

There are hybrid programs too, like thesis master’s often transition directly into a Ph. D, and also there are regional differences on how these programs work (more on this later).

Can I get into grad school?

As you may expect, top grad school programs are very competitive, and a typical grad student is a top student in his undergraduate class. So what do schools look for in their grad school admissions process?

Grades are a big factor: generally, an undergrad GPA of 85% or higher is good for grad school (with 80% being the bare minimum). However, even more important than GPA is your research experience. Publishing papers in conferences would be ideal, and research experience can make up for a lackluster transcript.

Unfortunately, Waterloo students are at a disadvantage here: with the co-op program, most people spend their undergrad years focusing on internships rather than research, which is considered less valuable. Don’t be discouraged though: my only research experience was through two part-time URA’s, and I have zero publications, but I still got into a good grad school.

Picking a research area

In grad school, you specialize on a specific area of computer science, for example, machine learning, or databases, or theory, or programming languages. You have to indicate roughly what area you want to study in your application, but it’s okay to not know exactly what you want to research.

For me, I wanted to do something involving artificial intelligence or data science or machine learning. Eventually I decided on natural language processing (NLP), since it’s an area of machine learning, and I like studying languages.

Some people have a specific professor that they want to work with, in which case it’s helpful to reach out to them beforehand and mention it in your statement of purpose. Otherwise, as in my case, you don’t need to explicitly contact potential advisers if you have nothing to say; you get to indicate your adviser preferences in your application.

Choosing your schools

The most important thing to look for in a grad school is the quality of the research group. You may be tempted to look at overall computer science rankings, but this can be misleading because different schools have strengths in different research areas. There are other factors to consider, like location, city environment (big city or college town), and social life.

It’s a good idea to apply to a variety of schools of different levels of competitiveness. However, each application costs about $100, so it can be expensive to apply to too many — 5-10 applications is a good balance.

I decided to apply to five schools: two in Canada and three in the USA. My main criteria were (1), a reputable research program in NLP, and (2), I wanted to live in a big city. After some deliberation, I decided to apply to the following:

  • Ph. D. at University of Washington
  • Ph. D. at UC Berkeley
  • Ph. D. at Columbia University
  • M. Sc. at University of Toronto
  • M. Sc. at University of British Columbia

I didn’t apply to the University of Waterloo, where I’m doing my undergrad, despite it being pretty highly ranked in Canada — after studying here for five years, I needed a change of scenery.

Differences between Canada and USA

You might have noticed that my three applications in the USA were all Ph. D. programs, while my two in Canada were master’s. Graduate school works quite differently in Canada vs in the USA. In Canada, most students do a master’s after undergrad and then go on to do a Ph. D., but in the USA, you enter into Ph. D. directly after undergrad, skipping the master’s.

There are master’s programs in the USA too, but they are almost exclusively coursework master’s, and are very expensive ($50k+ tuition per year). In contrast, thesis master’s programs in Canada and Ph. D. programs in the USA are fully funded, so you get paid a stipend of around $20k-30k a year.

A big reason to do a master’s in the USA is for visa purposes: for Chinese and Indian citizens, getting the H1-B is much easier with a master’s in the country, so the investment can be worth it. Otherwise, it’s probably not worth getting a master’s in the USA; studying in Canada is much cheaper.

If you go to grad school in Canada, you can apply for the CGS-M and OGS government scholarships for master’s students. Unfortunately, Canadian citizens are ineligible for most scholarships if you study in the USA.

Taking the GRE

Another difference for the USA is that the Graduate Record Exam (GRE) is required for all grad school admissions in the USA. This is a 4-hour-long computer-administered test with a reading, writing, and math component. If you’ve taken the SAT, this test is very similar. For grad school applications in computer science, only the general exam is required, and not the computer science subject test.

The GRE plays a fairly minor role in admissions: a terrible score will hurt your chances, but a perfect score will not help you that much. The quantitative and verbal sections are scored between 130-170, and for science and engineering programs, a good score is around 165+ for quantitative and 160+ for verbal.

The quantitative (math) section is a cakewalk for any computer science major, but the verbal section can be challenging if English is not your native language. It does require some preparation (1-6 months is recommended). I studied for a month and did quite well.

Applications are due on December 15 for most schools, so you should take the GRE in October at the latest (and earlier if you plan on taking it multiple times).

Letters of Recommendation

Most grad school and scholarship applications require three letters of recommendation; out of all requirements, this one requires the most planning. The ideal recommendation comes from a professor that you have done research with. If you go to Waterloo and are considering grad school, doing a part-time URA (undergraduate research assistantship) is a good way to secure a few recommendation letters.

It may be difficult to find three professors that you’ve worked with, so the next best thing is a weaker letter from a professor whose class you did well in. As a last resort, at most one letter may come from a non-academic source (like your co-op supervisor). I was lucky that one of my research projects was co-supervised by two different professors, so I got two letters that way.

Statement of Purpose

The statement of purpose is a two-page essay where you describe your academic history and research interests, and convince the admissions committee that you are the ideal candidate to admit. If you have internship experience, talk about what you learned any why it’s relevant for research.

Chances are that the first revision of your statement of purpose will suck (this was certainly the case for me), so get friends and professors to proofread it. After several revisions, here’s my final statement of purpose.

Offers of Admission

That’s about it — after you hit submit on all your applications by December 15, you can sit back and enjoy your final semester. With any luck, you will receive this in your inbox around the beginning of February:

Untitled

In the end, I got accepted to both master’s programs in Canada (UBC and UofT), but got rejected from all three Ph. D. programs in the USA (Washington, Berkeley, and Columbia). I chose to accept the UofT offer, where I will study NLP starting this fall.

Hopefully this guide has been helpful, and good luck with your applications!

Are programming competitions a good use of time?

10 minutes remaining in the contest, but you’re still a few points short of advancing. Armed with your mighty coding powers, the first three problems fall quickly, but problem 4 is proving a tough nut to crack. After four incorrect attempts, your time is running short, and you’re searching desperately for an off-by-one error, an edge case you haven’t considered.

You read the problem statement one more time, and at last, you find it. An integer overflow bug. With a wide grin, you fix the bug with a few quick keystrokes. You upload the source code file…

1

Accepted! You sit back, feeling both relieved and elated. With the addition of 25 points, your advancement to the next round is now guaranteed.

It’s not hard to see why programming contests are so appealing — this is programming distilled to its essence. No need to figure out how to integrate an unfamiliar API, or refactor code to be unit testable, or make sense of vague business requirements. In competitive programming, each problem has a self-contained, algorithmic solution, and an automated judge instantly determines if it’s correct or not.

2Above: Programming contests contain the same types of problems as technical interviews at companies like Google.

Competitive programming promises even more glory. Win enough contests, and you get an interview at Facebook or Google, where they ask you… you guessed it… more algorithm coding questions!

By doing programming contests, you gain an intimate understanding of data structures and algorithms and their complexities. While your colleagues vaguely know the difference between a depth-first versus a breadst-first search, you develop a much deeper intuition. You will never forget that one contest where you used DFS instead of BFS, causing your solution to time out.

Unlike the real world, competitive programming offers an arena of pure meritocracy. As long as you solve difficult problems fast, you will surely rise through the ranks. In the Google Code Jam, thousands of programmers try their luck in the qualifying round, but this number is reduced to 3000 by Round 2, and 500 by Round 3. Then for the grand finale, the top 25 elite coders are flown in to compete on-site in the world finals.

3Above: 25 of the world’s best compete in the Google Code Jam world finals.

I used to look up in awe at red coders (the highest rated users have red usernames). By the time I solved the first problem, they would have not only solved it in 10 minutes, but also solved 2-3 even harder ones. What was it like to think at that level, to possess that much coding wizardry?

And for some time, I strove to be like them. I studied my dynamic programming and graph algorithms in my spare time, and practiced on SPOJ and Hackerrank and Codeforces. I competed in my university’s ACM tryouts, and three times failed to qualify for the ACM team. So it goes.

In the last few years, I got to talk to a lot of competitive programmers, most of whom were far better than myself. Perhaps I was searching for some magical quality that gave them coding superpowers, but none was to be found. Instead, the key to a high rating was simply many years of dedication and hard work.

It’s illuminating to read the answers on this Quora question: “How does it feel to finally be red at TopCoder after years of hard work?” The short answer: nothing much really.

4Above: Rating graph of Codeforces user netman. Getting to red takes years of practice.

Given the amount of time it takes to master competitive programming, one naturally wonders: is this really a good use of time? In a contest, you are ultimately solving problems that other people have solved already, so nothing new is being produced. Although solving a contest problem is satisfying, I find it a lot more rewarding to build projects or apps with my novel ideas.

Recently, Google found that among its engineers, being good at programming competitions is negatively correlated to being good at software engineering.

In the video, Peter Norvig notes that competitive programmers are “used to going really fast, cranking the answer out and moving to the next thing, but you do better if you’re more reflective and go slowly and get things right”.

Ironically, the same thing that makes programming contests so attractive is its own downfall. Contests focus on data structures and algorithms, which are just a small part of software engineering. Other skills like UI design, databases, network architecture, distributed systems, etc, are not touched in programming contests.

Even if you only look at algorithmic problems, competitive programming is still not representative of reality. Due to limitations in automated judging, contest problems are limited to exact, deterministic algorithms that have a single correct answer. This rules out entire classes of randomized and approximate algorithms. Algorithms now rely more and more on data and machine learning, and less on combinatorial approaches, which further renders competitive programming less relevant.

Now, are programming contests useful? Yes, but only up to a point. Solving contest problems is an excellent way to familiarize yourself with a programming language and its data structures, as well as get better at converting procedural ideas to code. These are very useful skills for a coding interview. However, even the most difficult Facebook/Google interview questions are maybe around a Codeforces Div2 C (or Div1 A) difficulty, which is a long way from the hardest contest problems.

5Above: Beyond a certain point, skills learned in programming contests are only useful for programming contests.

I would put the inflection point at about 1700 Codeforces rating (enough to solve Div2 A and B consistently). Beyond that, you continue to improve, but be aware that you’ll be studying things solely for contests that have little relevance anywhere else, for example, Fenwick trees, max flow algorithms, bitmask DP, and other increasingly obscure topics.

So far, I’ve been fairly critical of competitive programming, but rather than deride it as a waste of time, I think it’s best to view it as a sport. Like soccer or basketball, the function of sports in society is to inspire excellence, and above all, to have fun. Terry Tao wrote a similar article on math competitions; I’d agree with him.

My advice to you: do programming contests if you find them fun and you enjoy tackling hard problems. But don’t take it too seriously: it takes years of dedicated effort to get extremely good at it, dedication that very few people have. Unless you’re at or aiming to compete at the World Final level, you definitely shouldn’t be spending a significant amount of time getting better at contests. Your time is better spent studying machine learning, or statistics, or compilers, or distributed systems, or just about anything else in computer science.

I have accounts on Hackerrank and Codeforces if you want to follow me. As per my own advice, I’m no longer actively trying to get better, but I still do problems for fun occasionally.

Edit: This article has some interesting discussion on Reddit and Codeforces.