Simple models in Kaggle competitions

This week I participated in the Porto Seguro Kaggle competition. Basically, you’re asked to predict a binary variable — whether or not an insurance claim will be filed — based on a bunch of numerical and categorical variables.

With over 5000 teams entering the competition, it was the largest Kaggle competition ever. I guess this is because it’s a fairly well-understood problem (binary classification) with a reasonably sized dataset, making it accessible to beginning data scientists.

Kaggle is a machine learning competition platform filled with thousands of smart data scientists, machine learning experts, and statistics PhDs, and I am not one of them. Still, I was curious to see how my relatively simple tools would fare against the sophisticated techniques on the leaderboard.


The first thing I tried was logistic regression. All you had to do was load the data into memory, invoke the glm() function in R, and output the predictions. Initially my logistic regression wasn’t working properly and I got a negative score. It took a day or so to figure out how to do logistic regression properly, which got me a score of 0.259 on the public leaderboard.

Next, I tried gradient boosted decision trees, which I had learned about in a stats class but never actually used before. In R, this is simple — I just needed to change the glm() call to gbm() and fit the model again. This improved my score to 0.265. It was near the end of the competition so I stopped here.

At this point, the top submission had a score of 0.291, and 0.288 was enough to get a gold medal. Yet despite being within 10% of the top submission in overall accuracy, I was still in the bottom half of the leaderboard, ranking in the 30th percentile.

The public leaderboard looked like this:

Rplot.pngAbove: Public leaderboard of the Porto Seguro Kaggle competition two days before the deadline. Line in green is my submission, scoring 0.265.

This graph illustrates the nature of this competition. At first, progress is easy, and pretty much anyone who submitted anything that was not “predict all zeros” got over 0.200. From there, you make steady, incremental progress until about 0.280 or so, but afterwards, any further improvements is limited.

The top of the leaderboard is very crowded, with over 1000 teams having the score of 0.287. Many teams used ensembles of XGBoost and LightGBM models with elaborate feature engineering. In the final battle for the private leaderboard, score differences of less than 0.001 translated to hundreds of places on the leaderboard and spelled the difference between victory and defeat.

591926572-christophe-lemaitre-of-france-usain-bolt-of-jamaica.jpg.CROP.promo-xlarge2.jpgAbove: To run 90% as fast as Usain Bolt, you need to run 100 meters in 10.5 seconds. To get 90% of the winning score in Kaggle, you just need to call glm().

This pattern is common in Kaggle and machine learning — often, a simple model can do quite well, at least the same order of magnitude as a highly optimized solution. It’s quite remarkable that you can get a decent solution with a day or two of work, and then, 5000 smart people working for 2 months can only improve it by 10%. Perhaps this is obvious to someone doing machine learning long enough, but we should look back and consider how rare this is. The same does not apply to most activities. You cannot play piano for two days and become 90% as good as a concert pianist. Likewise, you cannot train for two days and run 90% as fast as Usain Bolt.

Simple models won’t win you Kaggle competitions, but we shouldn’t understate their effectiveness. Not only are they quick to develop, but they are also easier to interpret, and can be trained in a few seconds rather than hours. It’s comforting to see how far you can get with simple solutions — the gap between the best and the rest isn’t so big after all.

Read further discussion of this post on the Kaggle forums!

What’s the difference between Mathematics and Statistics?

Statistics has a sort of funny and peculiar relationship with mathematics. In a lot of university departments, they’re lumped together and you have a Department of Mathematics and Statistics”. Other times, it’s grouped as a branch in applied math. Pure mathematicians tend to either think of it as an application of probability theory, or dislike it because it’s not rigorous enough”.

After having studied both, I feel it’s misleading to say that statistics is a branch of math. Rather, statistics is a separate discipline that uses math, but differs in fundamental ways from other branches of math, like combinatorics or differential equations or group theory. Statistics is the study of uncertainty, and this uncertainty permeates the subject so much that mathematics and statistics are fundamentally different modes of thinking.

mathstats.png

Above: if pure math and statistics were like games

 

Definitions and Proofs

Math always follows a consistent definition-theorem-proof structure. No matter what branch of mathematics you’re studying, whether it be algebraic number theory or real analysis, the structure of a mathematical argument is more or less the same.

You begin by defining some object, let’s say a wug. After defining it, everybody can look at the definition and agree on which objects are wugs and which objects are not wugs.

Next, you proceed to prove interesting things about wugs, using marvelous arguments like proof by contradiction and induction. At every step of the proof, the reader can verify that indeed, this step follows logically from the definitions. After several of these proofs, you now understand a lot of properties of wugs and how they connect to other objects in the mathematical universe, and everyone is happy.

In statistics, it’s common to define things with intuition and examples, so you know it when you see it”; things are rarely so black-and-white like in mathematics. This is born out of necessity: statisticians work with real data, which tends to be messy and doesn’t lend itself easily to clean, rigorous definitions.

Take for example the concept of an outlier”. Many statistical methods behave badly when the data contains outliers, so it’s a common practice to identify outliers and remove them. But what exactly constitutes an outlier? Well, that depends on many criteria, like how many data points you have, how far it is from the rest of the points, and what kind of model you’re fitting.

scatterplot1

In the above plot, two points are potentially outliers. Should you remove them, or keep them, or maybe remove one of them? There’s no correct answer, and you have to use your judgment.


For another example, consider p-values. Usually, when you get a p-value under 0.05, it can be considered statistically significant. But this value is merely a guideline, not a law – it’s not like 0.048 is definitely significant and 0.051 is not.

Now let’s say you run an A/B-test and find that changing a button to blue results in higher clicks, with p-value of 0.059. Should you recommend to your boss that they make the change? What if you get 0.072, or 0.105? At what point does it become not significant? There is no correct answer, you have to use your judgment.


Take another example: heteroscedasticity. This is a fancy word that means the variance is unequal for different parts of your dataset. Heteroscedasticity is bad because a lot of models assume that the variance is constant, and if this assumption is violated then you’ll get wrong results, so you need to use a different model.

 

1-qYXXQN-1eumcnYgTJZnaww

Is this data heteroscedastic, or does it only look like the variance is uneven because there are so few points to the left of 3.5? Is the problem serious enough that fitting a linear model is invalid? There’s no correct answer, you have to use your judgment.


Another example: consider a linear regression model with two variables. When you plot the points on a graph, you should expect the points to roughly lie on a straight line. Not exactly on a line, of course, just roughly linear. But what if you get this:

Rplot

There is some evidence of non-linearity, but how much bendiness” can you accept before the data is definitely not roughly linear” and you have to use a different model? Again, there’s no correct answer, and you have to use your judgment.


I think you see the pattern here. In both math and statistics, you have models that only work if certain assumptions are satisfied. However, unlike math, there is no universal procedure that can tell you whether your data satisfies these assumptions.

Here are some common things that statistical models assume:

  • A random variable is drawn from a normal (Gaussian) distribution
  • Two random variables are independent
  • Two random variables satisfy a linear relationship
  • Variance is constant

Your data is not going to exactly fit a normal distribution, so all of these are approximations. A common saying in statistics goes: all models are wrong, but some are useful”.

On the other hand, if your data deviates significantly from your model assumptions, then the model breaks down and you get garbage results. There’s no universal black-and-white procedure to decide if your data is normally distributed, so at some point you have to step in and apply your judgment.

Aside: in this article I’m ignoring Mathematical Statistics, which is the part of statistics that tries to justify statistical methods using rigorous math. Mathematical Statistics follows the definition-theorem-proof pattern and is very much like any other branch of math. Any proofs you see in a stats course likely belongs in this category.

 

Classical vs Statistical Algorithms

You might be wondering: without rigorous definitions and proofs, how do you be sure anything you’re doing is correct? Indeed, non-statistical (i.e. mathematical) and statistical methods have different ways of judging correctness”.

Non-statistical methods use theory to justify their correctness. For instance, we can prove by induction that Dijkstra’s algorithm always returns the shortest path in a graph, or that quicksort always arranges an array in sorted order. To compare running time, we use Big-O notation, a mathematical construct that formalizes runtimes of programs by looking at how they behave as their inputs get infinitely large.

Non-statistical algorithms focus primarily on worst-case analysis, even for approximation and randomized algorithms. The best known approximation algorithm for the Traveling Salesman problem has an approximation ratio of 1.5 – this means that even for the worst possible input, the algorithm gives a path that’s no more than 1.5 times longer than the optimal solution. It doesn’t make a difference if the algorithm performs a lot better than 1.5 for most practical inputs, because it’s always the worst case that we care about.

A statistical method is good if it can make inferences and predictions on real-world data. Broadly speaking, there are two main goals of statistics. The first is statistical inference: analyzing the data to understand the processes that gave rise to it; the second is prediction: using patterns from past data to predict the future. Therefore, data is crucial when evaluating two different statistical algorithms. No amount of theory will tell you whether a support vector machine is better than a decision tree classifier – the only way to find out is by running both on your data and seeing which one gives more accurate predictions.

2 Above: the winning neural network architecture for ImageNet Challenge 2012. Currently, theory fails at explaining why this method works so well.

In machine learning, there is still theory that tries to formally describe how statistical models behave, but it’s far removed from practice. Consider, for instance, the concepts of VC dimension and PAC learnability. Basically, the theory gives conditions under which the model eventually converges to the best one as you give it more and more data, but is not concerned with how much data you need to achieve a desired accuracy rate.

This approach is highly theoretical and impractical for deciding which model works best for a particular dataset. Theory falls especially short in deep learning, where model hyperparameters and architectures are found by trial and error. Even with models that are theoretically well-understood, the theory can only serve as a guideline; you still need cross-validation to determine the best hyperparameters.

 

Modelling the Real World

Both mathematics and statistics are tools we use to model and understand the world, but they do so in very different ways. Math creates an idealized model of reality where everything is clear and deterministic; statistics accepts that all knowledge is uncertain and tries to make sense of the data in spite of all the randomness. As for which approach is better – both approaches have their advantages and disadvantages.

Math is good for modelling domains where the rules are logical and can be expressed with equations. One example of this is physical processes: just a small set of rules is remarkably good for predicting what happens in the real world. Moreover, once we’ve figured out the mathematical laws that govern a system, they are infinitely generalizable — Newton’s laws can accurately predict the motion of celestial bodies even if we’ve only observed apples falling from trees. On the other hand, math is awkward at dealing with error and uncertainty. Mathematicians create an ideal version of reality, and hope that it’s close enough to the real thing.

Statistics shines when the rules of the game are uncertain. Rather than ignoring error, statistics embraces uncertainty. Every value has a confidence interval where you can expect it to be right about 95% of the time, but we can never be 100% sure about anything. But given enough data, the right model will separate the signal from the noise. This makes statistics a powerful tool when there are many unknown confounding factors, like modelling sociological phenomena or anything involving human decisions.

The downside is that statistics only works on the sample space where you have data; most models are bad at extrapolating past the range of data that it’s trained on. In other words, if we use a regression model with data of apples falling from trees, it will eventually be pretty good at predicting other apples falling from trees, but it won’t be able to predict the path of the moon. Thus, math enables us to understand the system at a deeper, more fundamental level than statistics.

Math is a beautiful subject that reduces a complicated system to its essence. But when you’re trying to understand how people behave, when the subjects are not always rational, learning from data is the way to go.

The Power Law Distribution and the Harsh Reality of Language Learning

I’m an avid language learner, and sometimes people ask me: “how many languages do you speak?” If we’re counting all the languages in which I can have at least a basic conversation, then I can speak five languages — but can I really claim fluency in a language if I can barely read children’s books? Despite being a seemingly innocuous question, it’s not so simple to answer. In this article, I’ll try to explain why.

Let’s say you’re just starting to study Japanese. You might picture yourself being able to do the following things, after a few months or years of study:

  1. Have a conversation with a Japanese person who doesn’t speak any English
  2. Watch the latest episode of some anime in Japanese before the English subtitles come out
  3. Overhear a conversation between two Japanese people in an elevator

After learning several languages, I discovered that the first task is a lot easier than the other two, by an order of magnitude. Whether in French or in Japanese, I would quickly learn enough of the language to talk to people, but the ability to understand movies and radio remains elusive even after years of study.

There is a fundamental difference in how language is used in one-on-one conversation versus the other two tasks. When conversing with a native speaker, it is possible for him to avoid colloquialisms, speak slower, and repeat things you didn’t understand using simpler words. On the other hand, when listening to native-level speech without the speaker adjusting for your language level, you need to be near native-level yourself to understand what’s going on.

We can justify this concept using statistics. By looking at how frequencies of English words are distributed, we show that after an initial period of rapid progress, it soon becomes exponentially harder to get better at a language. Conversely, even a small decrease in language complexity can drastically increase comprehension by non-native listeners.

Reaching conversational level is easy

For the rest of this article, I’ll avoid using the word “fluent”, which is rather vague and misleading. Instead, I will call a “conversational” speaker someone who can conduct some level of conversation in a language, and a “near-native” speaker someone who can readily understand speech and media intended for native speakers.

It’s surprising how little of a language you actually need to know to have a decent conversation with someone. Basically, you need to know:

  1. A set of about 1000-2000 very basic words (eg: person, happy, cat, slow, etc).
  2. Enough grammar to form sentences (eg: present / future / past tenses; connecting words like “then”, “because”; conditionals, comparisons, etc). Grammar doesn’t need to be perfect, just close enough for the listener to understand what you’re trying to say.
  3. When you want to say something but don’t know the word for it, be flexible enough to work around the issue and express it with words you do know.

For an example of English using only basic words, look at the Simple English Wikipedia. It shows that you can explain complex things using a vocabulary of only about 1000 words.

For another example, imagine that Bob, a native English speaker, is talking to Jing, an international student from China. Their conversation might go like this:

Bob: I read in the news that a baby got abducted by wolves yesterday…

Jing: Abducted? What do you mean?

Bob: He got taken away by wolves while the family was out camping.

Jing: Wow, that’s terrible! Is he okay now?

In this conversation, Jing indicates that she doesn’t understand a complex word, “abducted”, and Bob rephrases the idea using simpler words, and the conversation goes on. This pattern happens a lot when I’m conversing with native Japanese speakers.

After some time, Bob gets an intuitive feeling for what level of words Jing can understand, and naturally simplifies his speech to accommodate. This way, the two can converse without Jing explicitly interrupting and asking Bob to repeat what he said.

Consequently, reaching conversational level in a language is not very hard. Some people claim you can achieve “fluency” in 3 months for a language. I think this is a reasonable amount of time for reaching conversational level.

What if you don’t have the luxury of the speaker simplifying his level of speech for you? We shall see that the task becomes much harder.

The curse of the Power Law

Initially, I was inspired to write this article after an experience with a group of French speakers. I could talk to any of them individually in French, which is hardly remarkable given that I studied the language since grade 4 and minored in it in university. However, when they talked between themselves, I was completely lost, and could only get a vague sense of what they were talking about.

Feeling slightly embarrassed, I sought an explanation for this phenomenon. Why was it that I could produce 20-page essays for university French classes, but struggled to understand dialogue in French movies and everyday conversations between French people?

The answer lies in the distribution of word frequencies in language. It doesn’t matter if you’re looking at English or French or Japanese — every natural language follows a power law distribution, which means that the frequency of every word is inversely proportional to its rank in the frequency table. In other words, the 1000th most common word appears twice as often as the 2000th most common word, and four times as often as the 4000th most common word, and so on.

(Aside: this phenomenon is sometimes called Zipf’s Law, but refers to the same thing. It’s unclear why this occurs, but the law holds in every natural language)

1.pngAbove: Power law distribution in natural languages

The power law distribution exhibits the long tail property, meaning that as you advance further to the right of the distribution (by learning more vocabulary), the words become less and less common, but never drops off completely. Furthermore, rare words like “constitution” or “fallacy” convey disproportionately more meaning than common words like “the” or “you”.

This is bad news for language learners. Even if you understand 90% of the words of a text, the remaining 10% are the most important words in the passage, so you actually understand much less than 90% of the meaning. Moreover, it takes exponentially more vocabulary and effort to understand 95% or 98% or 99% of the words in the text.

I set out to experimentally test this phenomenon in English. I took the Brown Corpus, containing a million words of various English text, and computed the size of vocabulary you would need to understand 50%, 80%, 90%, 95%, 98%, 99%, and 99.5% of the words in the corpus.

2.png

By knowing 75 words, you already understand half of the words in a text! Of course, just knowing words like “the” and “it” doesn’t get you very far. Learning 2000 words is enough to have a decent conversation and understand 80% of the words in a text. However, it gets exponentially harder after that: to get from 80% to 98% comprehension, you need to learn more than 10 times as many words!

(Aside: in this analysis I’m considering conjugations like “swim” and “swimming” to be different words; if you count only the stems, you end up with lower word counts but they still follow a similar distribution)

How many words can you miss and still be able to figure out the meaning by inference? In a typical English novel, I encounter about one word per page that I’m unsure of, and a page contains about 200-250 words, so I estimate 99.5% comprehension is native level. When there are more than 5 words per page that I don’t know, then reading becomes very slow and difficult — this is about 98% comprehension.

Therefore I will consider 98% comprehension “near-native”: above this level, you can generally infer the remaining words from context. Below this level, say between 90% to 98% comprehension, you may understand generally what’s going on, but miss a lot of crucial details.

3.pngAbove: Perceived learning curve for a foreign language

This explains the difficulty of language learning. In the beginning, progress is fast, and in a short period of time you learn enough words to have conversations. After that, you reach a long intermediate-stage plateau where you’re learning more words, but don’t know enough to understand native-level speech, and anybody speaking to you must use a reduced vocabulary in order for you to understand. Eventually, you will know enough words to infer the rest from context, but you need a lot of work to reach this stage.

Implications for language learners

The good news is that if you want to converse with people in a language, it’s perfectly doable in 3 to 6 months. On the other hand, to watch TV shows in the language without subtitles or understand people speaking naturally is going to take a lot more work — probably living for a few years in a country where the language is spoken.

Is there any shortcut instead of slowly learning thousands of words? I can’t say for sure, but somehow I doubt it. By nature, words are arbitrary clusters of sounds, so no amount of cleverness can help you deduce the meaning of words you’ve never seen before. And when the proportion of unknown words is above a certain threshold, it quickly becomes infeasible to try to infer meaning from context. We’ve reached the barrier imposed by the power law distribution.


Now I will briefly engage in some sociological speculation.

My university has a lot of international students. I’ve always noticed that these students tend to form social groups speaking their native non-English languages, and rarely assimilate into English-speaking social groups. At first I thought maybe this was because their English was bad — but I talked to a lot of international students in English and their English seemed okay: noticeably non-native but I didn’t feel there was a language barrier. After all, all our lectures are in English, and they get by.

However, I noticed that when I talked to international students, I subconsciously matched their rate of speaking, speaking just a little bit slower and clearer than normal. I would also avoid the usage of colloquialisms and cultural references that they might not understand.

If the same international student went out to a bar with a group of native English speakers, everyone else would be speaking at normal native speed. Even though she understands more than 90% of the words being spoken, it’s not quite enough to follow the discussion, and she doesn’t want to interrupt the conversation to clarify a word. As everything builds on what was previously said in the conversation, missing a word here and there means she is totally lost.

It’s not that immigrants don’t want to assimilate into our culture, but rather, we don’t realize how hard it is to master a language. On the surface, going from 90% to 98% comprehension looks like a small increase, but in reality, it takes an immense amount of work.

Read further discussion of this article on /r/languagelearning!

Applying to Graduate School in Computer Science

So you’re thinking of grad school. Four years of undergrad is not enough for you and you’re craving for more knowledge. Or perhaps you want to delay your entry into the “real world” for a couple more years. Well, grad school is the right place!

About a year ago, I decided I wanted to do grad school. However, most of my peers were on track to work in the industry after graduation. The process of applying for grad school is daunting, especially since it varies from country to country and from one subject to another. This is why I am writing a guide to grad school applications in Computer Science, for Canada and the USA: a compilation of things I wish I knew a year ago.

Why grad school?

People go to grad school for different reasons. Most people I know in computer science and software engineering plan to start working full-time — a reasonable choice, given the high salaries in the industry right now. I figured that I had the rest of my life to code for a company; there’s no rush to start working immediately.

Graduate programs come in three broad categories:

  1. Coursework Master’s. Typically about 1 year, this is basically an extension of undergrad. You take a bunch of graduate courses, which are more in-depth than undergrad classes, but you don’t do any research. This is useful for gaining more knowledge before going to work in the industry.
  2. Thesis Master’s. About 1-2 years, depending on the school. At first, you take some courses like in coursework master’s, but the primary goal is to produce a master’s thesis, which is equivalent to about one conference paper of original research. This is a good way to get some research experience, without the time commitment of a Ph.D (and is valuable as a stepping stone if you do decide to get one).
  3. Ph.D. A longer commitment of 4-5 years. In a Ph. D., you produce an extensive amount of original research; by the time you write your thesis, you will be a world expert on your specific topic. I like this illustrated explanation of what it’s like to do a Ph. D.

There are hybrid programs too, like thesis master’s often transition directly into a Ph. D, and also there are regional differences on how these programs work (more on this later).

Can I get into grad school?

As you may expect, top grad school programs are very competitive, and a typical grad student is a top student in his undergraduate class. So what do schools look for in their grad school admissions process?

Grades are a big factor: generally, an undergrad GPA of 85% or higher is good for grad school (with 80% being the bare minimum). However, even more important than GPA is your research experience. Publishing papers in conferences would be ideal, and research experience can make up for a lackluster transcript.

Unfortunately, Waterloo students are at a disadvantage here: with the co-op program, most people spend their undergrad years focusing on internships rather than research, which is considered less valuable. Don’t be discouraged though: my only research experience was through two part-time URA’s, and I have zero publications, but I still got into a good grad school.

Picking a research area

In grad school, you specialize on a specific area of computer science, for example, machine learning, or databases, or theory, or programming languages. You have to indicate roughly what area you want to study in your application, but it’s okay to not know exactly what you want to research.

For me, I wanted to do something involving artificial intelligence or data science or machine learning. Eventually I decided on natural language processing (NLP), since it’s an area of machine learning, and I like studying languages.

Some people have a specific professor that they want to work with, in which case it’s helpful to reach out to them beforehand and mention it in your statement of purpose. Otherwise, as in my case, you don’t need to explicitly contact potential advisers if you have nothing to say; you get to indicate your adviser preferences in your application.

Choosing your schools

The most important thing to look for in a grad school is the quality of the research group. You may be tempted to look at overall computer science rankings, but this can be misleading because different schools have strengths in different research areas. There are other factors to consider, like location, city environment (big city or college town), and social life.

It’s a good idea to apply to a variety of schools of different levels of competitiveness. However, each application costs about $100, so it can be expensive to apply to too many — 5-10 applications is a good balance.

I decided to apply to five schools: two in Canada and three in the USA. My main criteria were (1), a reputable research program in NLP, and (2), I wanted to live in a big city. After some deliberation, I decided to apply to the following:

  • Ph. D. at University of Washington
  • Ph. D. at UC Berkeley
  • Ph. D. at Columbia University
  • M. Sc. at University of Toronto
  • M. Sc. at University of British Columbia

I didn’t apply to the University of Waterloo, where I’m doing my undergrad, despite it being pretty highly ranked in Canada — after studying here for five years, I needed a change of scenery.

Differences between Canada and USA

You might have noticed that my three applications in the USA were all Ph. D. programs, while my two in Canada were master’s. Graduate school works quite differently in Canada vs in the USA. In Canada, most students do a master’s after undergrad and then go on to do a Ph. D., but in the USA, you enter into Ph. D. directly after undergrad, skipping the master’s.

There are master’s programs in the USA too, but they are almost exclusively coursework master’s, and are very expensive ($50k+ tuition per year). In contrast, thesis master’s programs in Canada and Ph. D. programs in the USA are fully funded, so you get paid a stipend of around $20k-30k a year.

A big reason to do a master’s in the USA is for visa purposes: for Chinese and Indian citizens, getting the H1-B is much easier with a master’s in the country, so the investment can be worth it. Otherwise, it’s probably not worth getting a master’s in the USA; studying in Canada is much cheaper.

If you go to grad school in Canada, you can apply for the CGS-M and OGS government scholarships for master’s students. Unfortunately, Canadian citizens are ineligible for most scholarships if you study in the USA.

Taking the GRE

Another difference for the USA is that the Graduate Record Exam (GRE) is required for all grad school admissions in the USA. This is a 4-hour-long computer-administered test with a reading, writing, and math component. If you’ve taken the SAT, this test is very similar. For grad school applications in computer science, only the general exam is required, and not the computer science subject test.

The GRE plays a fairly minor role in admissions: a terrible score will hurt your chances, but a perfect score will not help you that much. The quantitative and verbal sections are scored between 130-170, and for science and engineering programs, a good score is around 165+ for quantitative and 160+ for verbal.

The quantitative (math) section is a cakewalk for any computer science major, but the verbal section can be challenging if English is not your native language. It does require some preparation (1-6 months is recommended). I studied for a month and did quite well.

Applications are due on December 15 for most schools, so you should take the GRE in October at the latest (and earlier if you plan on taking it multiple times).

Letters of Recommendation

Most grad school and scholarship applications require three letters of recommendation; out of all requirements, this one requires the most planning. The ideal recommendation comes from a professor that you have done research with. If you go to Waterloo and are considering grad school, doing a part-time URA (undergraduate research assistantship) is a good way to secure a few recommendation letters.

It may be difficult to find three professors that you’ve worked with, so the next best thing is a weaker letter from a professor whose class you did well in. As a last resort, at most one letter may come from a non-academic source (like your co-op supervisor). I was lucky that one of my research projects was co-supervised by two different professors, so I got two letters that way.

Statement of Purpose

The statement of purpose is a two-page essay where you describe your academic history and research interests, and convince the admissions committee that you are the ideal candidate to admit. If you have internship experience, talk about what you learned any why it’s relevant for research.

Chances are that the first revision of your statement of purpose will suck (this was certainly the case for me), so get friends and professors to proofread it. After several revisions, here’s my final statement of purpose.

Offers of Admission

That’s about it — after you hit submit on all your applications by December 15, you can sit back and enjoy your final semester. With any luck, you will receive this in your inbox around the beginning of February:

Untitled

In the end, I got accepted to both master’s programs in Canada (UBC and UofT), but got rejected from all three Ph. D. programs in the USA (Washington, Berkeley, and Columbia). I chose to accept the UofT offer, where I will study NLP starting this fall.

Hopefully this guide has been helpful, and good luck with your applications!

Are programming competitions a good use of time?

10 minutes remaining in the contest, but you’re still a few points short of advancing. Armed with your mighty coding powers, the first three problems fall quickly, but problem 4 is proving a tough nut to crack. After four incorrect attempts, your time is running short, and you’re searching desperately for an off-by-one error, an edge case you haven’t considered.

You read the problem statement one more time, and at last, you find it. An integer overflow bug. With a wide grin, you fix the bug with a few quick keystrokes. You upload the source code file…

1

Accepted! You sit back, feeling both relieved and elated. With the addition of 25 points, your advancement to the next round is now guaranteed.

It’s not hard to see why programming contests are so appealing — this is programming distilled to its essence. No need to figure out how to integrate an unfamiliar API, or refactor code to be unit testable, or make sense of vague business requirements. In competitive programming, each problem has a self-contained, algorithmic solution, and an automated judge instantly determines if it’s correct or not.

2Above: Programming contests contain the same types of problems as technical interviews at companies like Google.

Competitive programming promises even more glory. Win enough contests, and you get an interview at Facebook or Google, where they ask you… you guessed it… more algorithm coding questions!

By doing programming contests, you gain an intimate understanding of data structures and algorithms and their complexities. While your colleagues vaguely know the difference between a depth-first versus a breadst-first search, you develop a much deeper intuition. You will never forget that one contest where you used DFS instead of BFS, causing your solution to time out.

Unlike the real world, competitive programming offers an arena of pure meritocracy. As long as you solve difficult problems fast, you will surely rise through the ranks. In the Google Code Jam, thousands of programmers try their luck in the qualifying round, but this number is reduced to 3000 by Round 2, and 500 by Round 3. Then for the grand finale, the top 25 elite coders are flown in to compete on-site in the world finals.

3Above: 25 of the world’s best compete in the Google Code Jam world finals.

I used to look up in awe at red coders (the highest rated users have red usernames). By the time I solved the first problem, they would have not only solved it in 10 minutes, but also solved 2-3 even harder ones. What was it like to think at that level, to possess that much coding wizardry?

And for some time, I strove to be like them. I studied my dynamic programming and graph algorithms in my spare time, and practiced on SPOJ and Hackerrank and Codeforces. I competed in my university’s ACM tryouts, and three times failed to qualify for the ACM team. So it goes.

In the last few years, I got to talk to a lot of competitive programmers, most of whom were far better than myself. Perhaps I was searching for some magical quality that gave them coding superpowers, but none was to be found. Instead, the key to a high rating was simply many years of dedication and hard work.

It’s illuminating to read the answers on this Quora question: “How does it feel to finally be red at TopCoder after years of hard work?” The short answer: nothing much really.

4Above: Rating graph of Codeforces user netman. Getting to red takes years of practice.

Given the amount of time it takes to master competitive programming, one naturally wonders: is this really a good use of time? In a contest, you are ultimately solving problems that other people have solved already, so nothing new is being produced. Although solving a contest problem is satisfying, I find it a lot more rewarding to build projects or apps with my novel ideas.

Recently, Google found that among its engineers, being good at programming competitions is negatively correlated to being good at software engineering.

In the video, Peter Norvig notes that competitive programmers are “used to going really fast, cranking the answer out and moving to the next thing, but you do better if you’re more reflective and go slowly and get things right”.

Ironically, the same thing that makes programming contests so attractive is its own downfall. Contests focus on data structures and algorithms, which are just a small part of software engineering. Other skills like UI design, databases, network architecture, distributed systems, etc, are not touched in programming contests.

Even if you only look at algorithmic problems, competitive programming is still not representative of reality. Due to limitations in automated judging, contest problems are limited to exact, deterministic algorithms that have a single correct answer. This rules out entire classes of randomized and approximate algorithms. Algorithms now rely more and more on data and machine learning, and less on combinatorial approaches, which further renders competitive programming less relevant.

Now, are programming contests useful? Yes, but only up to a point. Solving contest problems is an excellent way to familiarize yourself with a programming language and its data structures, as well as get better at converting procedural ideas to code. These are very useful skills for a coding interview. However, even the most difficult Facebook/Google interview questions are maybe around a Codeforces Div2 C (or Div1 A) difficulty, which is a long way from the hardest contest problems.

5Above: Beyond a certain point, skills learned in programming contests are only useful for programming contests.

I would put the inflection point at about 1700 Codeforces rating (enough to solve Div2 A and B consistently). Beyond that, you continue to improve, but be aware that you’ll be studying things solely for contests that have little relevance anywhere else, for example, Fenwick trees, max flow algorithms, bitmask DP, and other increasingly obscure topics.

So far, I’ve been fairly critical of competitive programming, but rather than deride it as a waste of time, I think it’s best to view it as a sport. Like soccer or basketball, the function of sports in society is to inspire excellence, and above all, to have fun. Terry Tao wrote a similar article on math competitions; I’d agree with him.

My advice to you: do programming contests if you find them fun and you enjoy tackling hard problems. But don’t take it too seriously: it takes years of dedicated effort to get extremely good at it, dedication that very few people have. Unless you’re at or aiming to compete at the World Final level, you definitely shouldn’t be spending a significant amount of time getting better at contests. Your time is better spent studying machine learning, or statistics, or compilers, or distributed systems, or just about anything else in computer science.

I have accounts on Hackerrank and Codeforces if you want to follow me. As per my own advice, I’m no longer actively trying to get better, but I still do problems for fun occasionally.

Edit: This article has some interesting discussion on Reddit and Codeforces.

How to succeed in your first tech internship

Congratulations, you’ve just landed your first software engineer internship! You’ve passed a round or two of interviews, signed an offer letter, and you’re slated to start next month. What now? You might be a bit excited, a bit apprehensive, wondering what the startup life is like, are you even smart enough to do the work they give you…

I felt all these things when I started my first internship three years ago. Now, I’ve completed four internships and I’m halfway through my fifth one; I’m sort of a veteran intern by now. In these five internships, I’ve learned a good deal about what it takes to succeed in an internship, things that are not obvious to those just starting out. Hopefully by sharing this, others can avoid some of the mistakes I made.

Your first week at [startup]

Chances are that you’ve coded in assignments for schoolwork, and maybe you’ve coded a few side projects for fun. Work is a bit different: you’re working with a massive codebase that you didn’t write, and probably no single engineer in the company understands it all. Facing a codebase of this complexity, you might feel overwhelmed, struggling to find the right file to start. You feel uneasy that a small change is taking you hours, afraid that your boss thinks you’re underperforming.

Relax, you’re doing fine. If you got the job, it means they have faith in your abilities to learn and to succeed. I’ve talked to hundreds of Waterloo interns, and I’ve never heard of anyone getting dismissed for underperforming. The first few weeks will be rough as you come to terms with the codebase and technology stack, but trust me, it gets much, much easier afterwards.

Asking for help

As an intern, you’re not expected to know everything, and often you will be asking for help from more experienced, full-time engineers.

Before asking for help, you should spend a minute or so searching Google, or Stack Overflow, or the company wiki. Most general questions (not relating to company specific code) can be answered with Google, and you save everyone’s time this way.

When you do ask for help, be aware that they might be working on a completely different project, so they don’t have the same mental context as you. Rather than jump straight into the intricate technical details of your problem, you should describe at a high level what you’re trying to accomplish, and what you tried, and only then delve into the exact technical details.

An example of a poorly phrased question would be: “hey, how do I invalidate a FooBarWindow object if its parent is not visible?” You’re likely to get some confused stares — this might make perfect sense to you, but they’re wondering what is FooBarWindow and why are you trying to invalidate it at all.

A better way to phrase it would be something like: “hey, I’m working on X feature, and I’m encountering a problem where the buttons stop working after you press the back button. After looking a bit, I discovered my component should have been invalidated when its parent is no longer visible, but that’s not happening…” This time, you’ve done a much better job of describing your problem.

It’s always helpful to take notes, so you never ask the same question again. How do you commit your code to Git? How do you deploy the app to stage? If you don’t write it down, you’re going to forget.

At the start, you’re going to be asking 5 questions an hour, which is okay. Soon you will find yourself needing to ask less and less, and eventually you’ll only ask a handful per day.

Taking charge of your own learning

Like it or not, software engineering is a rapidly shifting field, where a new Javascript framework comes out every six months. You have to be continuously learning things, or your skills will become obsolete. Learning is even more important when you’re an intern, still learning the ropes. Fortunately, a tech internship is a great opportunity to learn quickly.

Not all software engineers are equal — at some point, you get to choose what you want to do: frontend, backend, or full stack? Web, iOS, or Android? Become an expert in Django or Ruby on Rails? Depending on the company, you often get considerable say on what team you’re on, and what project you work on within your team. Use this as an opportunity to get paid to learn new, interesting stuff!

Good technologies to learn should satisfy two criteria: it should be something you’re interested in, and it should also be widely used in the industry. That is to say, it’s more useful to know a popular web framework than an internal company-specific framework that does the same thing.

When you get to pick what project to take next, it might be tempting to pick something familiar, where you already know how to do everything. But you learn a lot more by working on something new; in my experience, employers have always been accommodating to my desire to work on a variety of different things.

You will overhear people talk passionately, with phrases like, “oh, it’s running Nginx inside Docker and fetches the data from a Cassandra cluster…” If you’ve never heard of these technologies, this sentence would be nonsensical to you. It’s well worth the time to spend 10 minutes reading about each technology that you hear mentioned, not to become an expert, but just to have a passing understanding of what each of these things do. With a few minutes of research, you’d be able to answer: “when should you use Cassandra over MySQL?

Learning is valuable even when it’s not immediately relevant to you. Occasionally, you’ll find yourself in meetings where you don’t have a clue what’s going on, say with business managers or projects you’re not involved in. Rather than zone out and browse Reddit for the duration of the meeting, listen in and learn as much as you can, and take notes if you begin to fall asleep! The human brain has near infinite capacity for learning new things, and at no point will it reach “capacity” like a hard drive.

Take responsibility and deliver results

A common misconception is programmers are paid to write code. Wrong: as a programmer, your job is to deliver results and provide value to your company; part of this job involves writing code, but a lot of the work is communicating with managers, designers, and other engineers to figure out what code to write.

When you’re assigned a project, you own it and you’re in charge of any tasks required to push it through to completion. What if something is broken in an API owned by another team? You might be tempted to hand in your code and proclaim, “my code works fine, so my job here is done, I can show you that their API is broken, so it’s their fault.” No, if your feature is broken then you need to fix it one way or another. So go and ping the engineer responsible, schedule a meeting with him, anything to get your project completed.

Sometimes you run into problems that seem insurmountable, so complex that you feel compelled to put down your sword and give up, and tell yourself, “this is too hard for an intern“. This is a bad idea, you should never expect a full-time engineer to come in, take over, and bail you out of the situation. Your mentors are not superhuman — it’s not like they can instantly conjure a solution, no, they have to work through the problem one piece at a time, just like you. There’s no reason you can’t do the same.

The product you deliver is what ultimately matters, so don’t worry about secondary measures of productivity, like how many lines of code you commit, or how many story points you rack up on Jira. There’s an apocryphal tale of a programmer who disagreed with management measuring productivity by lines of code, and writing “-2000” because he made the code simpler. Likewise, you aren’t being judged if you come in 30 minutes after your manager does, or if you leave 30 minutes before he does, or if you just feel like taking a mid-day stroll in the park, as long as you’re consistently delivering quality features.

Many interns suffer from “intern mentality” and consider themselves fundamentally different from full-times in some way. This is an irrational belief — your skills are probably on par with those of a junior engineer (or will be in a few weeks). This means you should behave like any other full-time engineer (albeit minus interview and on-call duties); the only difference is you’re leaving in a few months. Don’t be afraid to contribute your insights and ideas and consider them less valuable because you’re “just an intern”.

Other tips

What should you learn to prepare for an internship if you have spare time? Learn Git! Git is a version control system used in most companies, and is both non-trivial to pick up, and used more or less the same way everywhere. Other stuff is less useful to pre-learn because they’re either easy to pick up, or can be used in lots of different ways so it’s more efficient to learn on the job.

Internships are a great way to travel places, if that interests you. I picked 5 internships in 4 different cities for this reason. Unlike school, you don’t have to think about work during weekends, which leaves you lots of time to travel to nearby destinations.

I’ve only talked about what happens during work. If your internship is in the USA, the Unofficial Waterloo USA Intern Guide was super helpful in answering all my logistical questions. Also, some of my friends have written about crafting a resume, and how to ace the coding interview.

Four life lessons learned by playing Hearthstone

I’ve played Hearthstone on and off for a few years, since it first came out. As I played more and more, I began to notice parallels between my decision making processes in Hearthstone and in real life. This is a self-reflective post, and my first attempt to describe the core features of my mentality and decision making process. Because this has been part of my personality for so long, I found it difficult to put my ideas into words, but here goes.

There are two reasons why Hearthstone is a good representation of real life:

  • First, it’s a game of imperfect information and chance, so you must take risks and deal with uncertainty. Real life situations are usually like this. Games of perfect information (like chess) lack this probabilistic aspect and behave very differently.
  • Second, Hearthstone is a game about decision making skills, rather than mechanics. Every game has some element of decision making, but many games require doing some mechanical action (eg. last hitting) better than your opponent. Mechanical skills are confined to the specific game and are less likely to be relevant in real life.

By playing Hearthstone, I developed a general internal model for making decisions in uncertain situations.

Lesson 1: There is always a correct decision, and it’s your job to find it

The goal in Hearthstone is to reduce your opponent’s life to zero. How do you accomplish this? You make a plan, perhaps flooding the board with minions, perhaps unleashing a deadly combination of spells.

For our purposes, it doesn’t matter what your strategy is. At the start of the turn, you look at the cards in your hand, the state of the board, what cards your opponent played before. Call this information the game state. You ponder for a bit and come up with an action that best improves your position.

You execute your action on the board, but you still don’t know what happens next with certainty. There are many things you cannot control, which I will call RNG. RNG is short for Random Number Generator, and I will use it to mean anything you don’t have control over.

I use the term RNG for lack of a better term, but I’m not just talking about random game mechanics. RNG includes any state hidden from you, like your opponent’s hand and strategy.  Think of it as a random variable with a known distribution (eg. you play a card that destroys a random minion, which minion will it hit?) or with an unknown distribution (eg. what is the probability your opponent has two flamestrikes in his deck?). Even if the information is known to your opponent, it’s simpler to treat it as a random variable.

Here’s the model summarized in a diagram:

In any game state, there must be one “correct” action that gives you the highest chance of winning the game. The decision-making player aims to consider all possible actions and choose the best, “correct” one.

As a corollary, decision making should be rational and be a function of things I can observe. Otherwise, if my decision engine generates two different actions depending on my emotional state of mind, they cannot both be correct.

A second corollary is actions should always be justifiable through fundamental values. It’s unacceptable to do things by habit, or because other people are doing it — everything I do should have a positive expected value on the things I want to accomplish.

For me, one of my “meta” goals in life is to make correct decisions as much as possible. This is not to say that I behave like a robot — I still experience emotions like everyone else — but I try to eliminate emotions from my decision making process.

In Hearthstone, doing so gives you the highest chance of winning the game. It makes sense then, by extrapolation, that correct decision making gives you the best shot of getting what you want out of life.

Lesson 2: Information is valuable, treat information gathering as a subgoal

One rule of thumb in Hearthstone is “RNG first”. If you are going to play a sequence of cards, one of which has a random effect, it’s better to play the random effect first. This way you extract information out of the RNG pool of unknowns, and with this extra information you might be able to make a better play.

Another useful thing is to keep track of enemy secrets. Imagine you have this on the board:

You want to play a giant, but you’re worried that the secret is “mirror entity”, which summons a copy of the next minion you play.

Without any other information, you’re in a tough spot. But what if you played a minion last turn and the secret did not activate? Then you know that the secret isn’t mirror entity, and confidently play the giant.

Alternatively, suppose that you don’t have this information handy. One tactic is you can “test out” the secret by playing a small minion, and seeing whether the secret activates. You are paying a price with a normally inferior move, but the information you obtain is valuable for future decisions.

Under this framework, we can view flirting as an example of an information gathering strategy. You’re at a party and you see a cute girl walk by. At first, you make a few playful comments, and observe her reaction and assess if she is interested in you. Flirting isn’t just an arbitrary social custom, but a means to obtain information.

While information isn’t the final end-goal by itself, even a little information can greatly improve decision making, by eliminating vast swathes of possibilities that no longer need to be considered. Whether it be playing a giant, making a big purchase, or asking someone out on a date, gathering information is a useful subgoal.

Lesson 3: Focus on things you have control over, RNG evens out in the long run

Often in Hearthstone, luck is just not on your side. Have you ever seen your opponent topdeck the pyroblast and instantly win the game? Or that mad bomber that hits you three times in the face? How do you feel?

It’s natural to feel angry when this happens to you, especially if it ends up losing you the game. But eventually I realized how pointless it was to get upset at unlucky RNG. What’s the use of worrying about things you have no control over?

I see this all the time — people getting visibly upset when the bus is late, or when a player on your team goes AFK in a game of league. I try to adopt the opposite mindset: worry about my own decision making and simply accept random events beyond my control.

Let me give you an example. Last term, during an important phone interview, my phone stopped working during the middle of the interview. Calmly I got up and notified the CECA front desk, and waited as they spent the next 20 minutes troubleshooting the problem. Most people would be stressed out at this point, but I didn’t feel stressed at all. Rather than getting upset, my mind was relaxed, because I took comfort in knowing that I did everything that could be done; whatever happens next was out of my control.

The law of large numbers says that when you repeat a random event many times, the average outcome will surely converge to the expected value. Hearthstone is so random that a legend player will beat a rank 5 player no more than 55% of the time. Any single game is close to a coin flip, just marginally in favor of the stronger player. But over the long run, it’s a mathematical guarantee that the better player will end up on top.

Lesson 4: Separate the outcome of a decision from the decision itself

In real life and in Hearthstone, you can’t directly tell if a decision was good or not. You only know the outcome, and you can decide if the outcome is good or bad. But the outcome is a function of the decision and RNG, which adds noise to the process.

In other words, the correct decision does not always produce a good outcome, and sometimes a bad decision produces a good outcome. It would be a mistake to retroactively label a decision as “correct” simply because you got lucky.

Here’s a Hearthstone example:

Your opponent is a mage, and on turn 6 you flood the board with a lot of small minions. If he has flamestrike, playing it deals 4 damage to each of your minions, instantly killing your whole board.

Turn 7 comes and it turns out he doesn’t have flamestrike, so you win the game easily. You conclude that playing all your minions was a great idea because he didn’t have flamestrike.

This logic is fallacious: it fails to separate decision from outcome. A correct action is the one that maximizes the win probability, given the information available at the time. Therefore it makes no sense to look at the outcome and retroactively judge the correctness of the initial decision.

So in this example, playing all these minions was a mistake because there’s a high chance the mage has flamestrike. It doesn’t matter if he actually has flamestrike or not, the mistake is equally bad. (A better play would be to play fewer minions, thus mitigating the risk).

Now here’s a real life example. Last term, I had multiple job offers for software engineering internships and I had trouble deciding which one to accept. So I tried to negotiate: I picked one of the companies, told them about my other offers, and asked for a 20% raise in salary. My request was denied.

Does this mean that negotiating was a waste of time? Absolutely not. I know friends who successfully negotiated a higher salary by doing something similar. My particular outcome was not successful, but this doesn’t indicate my attempt was a mistake; if I found myself again with multiple offers, I would do the same thing.

Let’s end the article with a quote by Alfred Tennyson:

‘Tis better to have loved and lost

Than never to have loved at all

In both Hearthstone and romance, you can end up losing, even if you do everything correctly. Tennyson also realizes the need to separate the outcome (to have loved and lost) from the decision to pursue the relationship.

There’s a lot more I could talk about, but this post is getting quite long so I’ll stop here. Whether you agree or disagree with my view of the world, please leave a comment!