Learning R as a Computer Scientist

If you’re into statistics and data science, you’ve probably heard of the R programming language. It’s a statistical programming language that has gained much popularity lately. It comes with an environment specifically designed to be good at exploring data, plotting visualizations, and fitting models.

R is not like most programming languages. It’s quite different from any other language I’ve worked with: it’s developed by statisticians, who think differently from programmers. In this blog post, I describe some of the pitfalls that I ran into learning R with a computer science background. I used R extensively in two stats courses in university, and afterwards for a bunch of data analysis projects, and now I’m just starting to be comfortable and efficient with it.

Why a statistical programming language?

When I encountered R for the first time, my first reaction was: “why do we need a new language to do stats? Can’t we just use Python and import some statistical libraries?”

Sure, you can, but R is very streamlined for it. In Python, you would need something like scipy for fitting models, and something like matplotlib to display things on screen. With R, you get RStudio, a complete environment, and it’s very much batteries-included. In RStudio, you can parse the data, run statistics on it, and visualize results with very few lines of code.

Aside: RStudio is an IDE for R. Although it’s possible to run R standalone from the command line, in practice almost everyone uses RStudio.

I’ll do a quick demo of fitting a linear regression on a dataset to demonstrate how easy it is to do in R. First, let’s load the CSV file:

df <- read.csv("fossum.csv")

This reads a dataset containing body length measurements for a bunch of possums. Don’t ask why, it was used in a stats course I took. R parses the CSV file into a data frame and automatically figures out the dimensions and variable names and types.

Next, we fit a linear regression model of the total length of the possum versus the head length:

model <- lm(totlngth ~ hdlngth, df)

It’s one line of code with the lm function. What’s more, fitting linear models is so common in R that the syntax is baked into the language.

Aside: Here, we did totlngth ~ hdlngth to perform a single variable linear regression, but the notation allows fancier stuff. For example, if we did lm(totlngth ~ (hdlngth + age)^2), then we would get a model including two variables and the second order interaction effects. This is called Wilkinson-Rogers notation, if you want to read more about it.

We want to know how the model is doing, so we run the summary command:

> summary(model)

lm(formula = totlngth ~ hdlngth, data = df)

   Min     1Q Median     3Q    Max
-7.275 -1.611  0.136  1.882  5.250 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -28.722     14.655  -1.960   0.0568 .
hdlngth        1.266      0.159   7.961  7.5e-10 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.653 on 41 degrees of freedom
Multiple R-squared:  0.6072,	Adjusted R-squared:  0.5976
F-statistic: 63.38 on 1 and 41 DF,  p-value: 7.501e-10

Don’t worry if this doesn’t mean anything to you, it’s just dumping the parameters of the models it fit, and ran a bunch of tests to determine how significant the model is.

Lastly, let’s visualize the regression with a scatterplot:

plot(df$hdlngth, df$totlngth)

And R gives us a nice plot:


All of this only took 4 lines of R code! Hopefully I’ve piqued your interest by now — R is great for quickly trying out a lot of different models on your data without too much effort.

That being said, R has a somewhat steep learning curve as a lot of things don’t work the way you’d expect. Next, I’ll mention some pitfalls I came across.

Don’t worry about the type system

As computer scientists, we’re used to thinking about type systems, type casting rules, variable scoping rules, closures, stuff like that. These details form the backbone of any programming language, or so I thought. Not the case with R.

R is designed by statisticians, and statisticians are more interested in doing statistics than worrying about intricacies of their programming language. Types do exist, but it’s not worth your time to worry about the difference between a list and a vector; most likely, your code will just work on both.

The most fundamental object in R is the data frame, which stores rows of data. Data frames are as ubiquitous in R as objects are in Java. They also don’t have a close equivalent in most programming languages; it’s similar to a SQL table or an Excel spreadsheet.

Use dplyr for data wrangling

The base library in R is not the most well-designed package in the world. There are many inconsistencies, arbitrary design decisions, and common operations are needlessly unintuitive. Fortunately, R has an excellent ecosystem of packages that make up for the shortcomings of the base system.

In particular, I highly recommend using the packages dplyr and tidyr instead of the base package for data wrangling tasks. I’m talking about operations you do to data to get it to be a certain form, like sorting by a variable, grouping by a set of variables and computing the aggregate sum over each group, etc. Dplyr and tidyr provide a consistent set of functions that make this easy. I won’t go into too much detail, but you can see this page for a comparison between dplyr and base R for some common data wrangling tasks.

Use ggplot2 for plotting

Plotting is another domain where the base package falls short. The functions are inconsistent and worse, you’re often forced to hardcode arbitrary constants in your code. Stupid things like plot(..., pch=19) where 19 is the constant for “solid circle” and 17 means “solid triangle”.

There’s no reason to learn the base plotting system — ggplot2 is a much better alternative. Its functions allow you to build graphs piece by piece in a consistent manner (and they look nicer by default). I won’t go into the comparison in detail, but here’s a blog post that describes the advantages of ggplot2 over base graphics.

It’s unfortunate that R’s base package falls short in these two areas. But with the package manager, it’s super easy to install better alternatives. Both ggplot2 and dplyr are widely used (currently, both are in the top 5 most downloaded R packages).

How to self-study R

First off, check out Swirl. It’s a package for teaching beginners the basics of R, interactively within RStudio itself. It guides you through its courses on topics like regression modelling and dplyr, and only takes a few hours to complete.

At some point, read through the tidyverse style guide to get up to speed on the best practices on naming files and variables and stuff like that.

Now go and analyze data! One major difference between R and other languages is that you need a dataset to do anything interesting. There are many public datasets out there; Kaggle provides a sizable repository.

For me, it’s a lot more motivating to analyze data I care about. Analyze your bank statement history, or data on your phone’s pedometer app, or your university’s enrollment statistics data to find which electives have the most girls. Turn it into a mini data-analysis project. Fit some regression models and draw a few graphs with R, this is a great way to learn.

The best thing about R is the number of packages out there. If you read about a statistical model, chances are that someone’s written an R package for it. You can download it and be up and running in minutes.

It takes a while to get used to, but learning R is definitely a worthwhile investment for any aspiring data scientist.

AI Project: Harmonizing Pop Melodies using Hidden Markov Models

It is often said that all pop music “sound the same”. If this is the case, then a computer program should be able to compose pop music, right?

This is the problem we studied for the AI final project (CS486) — kind of. We restrict to the simpler problem of harmonizing an existing melody: in other words, given a fixed melody, find a sequence of chord progressions that match it.

We formulated the task as a probabilistic optimization problem, and devised an algorithm to solve it using Hidden Markov Models (HMMs). Then, we used the Music21 library to generate MIDI files from the output so that we can listen to it.

In the experiment, we chose the melody from If Only by JJ Lin. Here’s a demo:

Music Theory Model

In our model, pop music consists of two parts: a melody line consisting of a sequence of individual notes, and a harmony line of a sequence of chords. Generally, the melody line is the notes sung by the lead vocalist, and the harmony line is played by accompanying instruments (piano, guitar, digital synthesizers). Conventionally, the chord progression is written as a sequence of chord names above the melody; the exact notes in a chord progression is up to the performer’s discretion.

1.pngAbove: Example of two-part song with melody and chord progression

It is hard to quantify exactly what make a chord progression “good”, since music is inherently subjective and depends on an individual’s musical tastes. However, we capture the notion of “pleasant” using music theory, by assigning a penalty to musical dissonance between a note in the melody and a note in the chord.

According to music theory, the minor second, major second, and tritone (augmented fourth) intervals are dissonant (including the minor and major seventh inversions). Therefore, our program tries to avoid forming these intervals by assigning a penalty.

2.pngAbove: List of dissonant intervals. All other intervals are consonant.

We assume that all notes in the melody lie on either a major or minor scale in some fixed key. The set of permissible chords for a scale are the major and minor chords where all of its constituent notes are in the scale. For example, in the scale of G major, the permissible chords are {G, Am, Bm, C, D, Em}.

This is a vastly simplified model that does not capture more nuanced chords that appear in pop music. However, even the four-chord progression G – D – Em – C is sufficient for many top-40 pop songs.

First attempt: Harmonize each bar independently

The first thing we tried was to looking at each bar by itself, and searching through all of the permissible chords to find the chord that harmonizes best with the bar.

In other words, we define the penalty of the chord for a melody bar is the number of times a melody note forms a dissonant interval with a chord note, weighted by the duration of the note. Then, we assign to each bar the chord with the lowest penalty.

Here’s what we get:

m2.pngAbove: Naive harmonization. Notice that the Am chord is repeated 3 times.

This assigns a reasonable-sounding chord for each bar on its own, but doesn’t account for the transition probabilities between chords. We end up with the A minor chord repeated 3 times consecutively, which we’d like to avoid.

How to account for chord transitions, while still harmonizing with the melody in each bar? This is where we bring in Hidden Markov Models.

Second attempt: HMMs and the Viterbi Algorithm

Hidden Markov Models are a type of probabilistic graphical model that assumes that there is an unobservable hidden state that influences the output variable. In each timestep, the hidden state transitions to a different state, with probabilities given by a transition matrix.

A standard problem with HMMs is: given a sequence of observations, find the most likely sequence of hidden states that produced it. In our problem, the hidden states are the unknown sequence of chords, and the observed sequence is the bars of our melody. Now, solving the most likely sequence problem yields the chord progression that best harmonizes with the melody.


To encode the rule that we don’t like having the same chord consecutively for multiple bars, we assign the transitions so that the probability of transition from any chord to itself is low. The transition probability to every other chord is equal.

3.pngAbove: Transition probabilities for chords (only 3 shown here, but this generalizes easily).  A chord is equally likely to transition to any other chord, but unlikely to transition to itself.

The only thing left to do is solve the most likely sequence problem. We used a modified version of the Viterbi algorithm to do this — it solves the problem quickly using dynamic programming.

Running the Viterbi algorithm on our test melody:

m3.pngAbove: Using HMMs to eliminate repeated chords

And there we go — no more repeated chords! It doesn’t produce the same chord sequence as my manual harmonization, but sounds quite natural (and doesn’t violate any musical constraints).

Results and Future Directions

Using HMMs, we were able to harmonize a short melody segment. By looking at the entire melody as a whole, the HMM algorithm is able to produce a better chord progression than by optimizing every bar locally. Our implementation uses a greatly simplified musical model: it assumes the melody is in G major, and only considers 6 basic major and minor chords, but it successfully demonstrates a proof of concept.

When implementing this project, we thought this was new research (existing literature on harmonization focused mostly on classical music, not pop music). Alas, after some more literature review, it turns out that a Microsoft research team developed the same idea of using HMMs for melody harmonization, and published a series of papers in 2008. Oh well. Not quite publishable research but still a cool undergraduate class project.

The source code for this project is available on Github. Thanks to Michael Tu for collaborating with me!

If you enjoyed this post, check out my previous work on algorithmic counterpoint!

Neat Ideas in Coding Theory

It’s been a while since I’ve written anything mathematical on this blog, but one of my courses this term was really neat: Coding Theory (CO331). I liked it because it lies in the intersection between pure math and computer science: it applies abstract algebra to the concrete problem of sending messages.

Coding theory deals with encoding a message so that even if some errors occur during transmission and a few digits get flipped, the original message can still be recovered. This is not the same as cryptography — we’re not trying to hide the message, only correct errors.

What is a code?

How would you encode a message (assume it’s in binary), so that if one bit is flipped, you can still recover the message?

Well, an obvious way is to send each letter 3 times, so 0110 becomes 000111111000. Then to decode, we take the majority in each block of 3 letters, so as long as each block has only one error, we will always decode correctly. This is a an example of a code, with the codewords being {000, 111}.


In general, a code is a set of strings (codewords) of the same length. We send a codeword through the channel, where a few bits get flipped. Then, the decoder corrects the received word to the closest word in the code (nearest neighbor decoding).

The distance of a code is the minimum number of bit flips to transform one codeword to a different one. In order for a code to be able to correct a lot of errors, the distance should be high. In fact, if the distance is d, then the code can correct \lfloor \frac{d-1}{2} \rfloor errors.

We can visualize a code as a collection of spheres, centered around the codewords. Each sphere corresponds to the words that get corrected to the codeword in the middle.


A desirable code needs to have a high distance and a large number of codewords. This way, it can correct a lot of errors, without being too redundant. The rate of a code is

R = \frac{\mathrm{log_2(number\;of\;codewords)}}{\mathrm{length\;of\;codeword}}

This represents the efficiency: how much we’re paying to achieve the desired error correcting capability. The triplication code in the example has 2 codewords, each of length 3, so the rate is \frac{1}{3}.

Also, in coding theory, we’re not really concerned with how to transform a plaintext message into a stream of codewords. Given a code consisting of a set of codewords and a procedure to correct a received word to the nearest codeword, the rest is “just an engineering problem”.

These definitions set the scene for what’s coming next. As I said before, a good code should have a high distance as well as a large number of codewords. It also needs to be efficient to decode — with general codes, there is no better way to decode than to check every codeword to find the closest one to your received word — clearly, this is no good. To do better, we need to construct codes with algebraic structure.

Linear Codes

A linear code is a code where a linear combination of codewords is another codeword.

This is useful because now we can apply a lot of linear algebra machinery. The set of codewords now form a vector space, and from linear algebra, every vector space is spanned by a set of basis vectors. Then, to specify a code, we can write the basis vectors as rows of a matrix — this is called a generator matrix of a code.

Let’s do a concrete example. Consider a code where each word is 5 digits (each between 0-9), where the 5th digit is a “checksum” of the previous 4 digits modulo 10. For example, 98524 is valid because 9+8+5+2=24. This code cannot correct any mistakes, but can detect if a single typo has been made.

A generator matrix for this code is:

G = \left[\begin{array}{ccccc}1&0&0&0&1\\ 0&1&0&0&1 \\ 0&0&1&0&1 \\ 0&0&0&1&1 \end{array}\right]

If we add up any linear combination xG of the rows of G (remember to do arithmetic modulo 10), we get a codeword. For example, if x = [9,8,5,2], then x G = [9,8,5,2,4].

For any linear code, we can find a parity-check matrix H, which lets us quickly check if a word is in the code or not: given a word r, if Hr^T = 0, then r is a codeword. In our case, a parity-check matrix is:

H = \left[ \begin{array}{cccccc} 1&1&1&1&-1 \end{array} \right]

Using the parity-check matrix, we can easily detect errors, but correcting to the nearest codeword is still NP-hard for general linear codes. However, certain classes of linear codes have tractable decoding algorithms:

  • Hamming Codes: class of single-error correcting codes with distance 3.
  • Binary Golay Code: block length 24 and distance 8, so it can correct 3 errors.

Cyclic Codes

A cyclic code is a linear code where a rotation of any codeword is another codeword. (Since all cyclic codes are linear codes, any linear combination of codewords is still a codeword).

Cyclic codes are closely connected to polynomials, and we can write a codeword as coefficients of a polynomial: for example, 703001 is 7 + 3x^2 + x^5. A cyclic code is then the set of polynomial multiples of a generator polynomial g(x):

\langle g(x) \rangle = \{a(x) g(x)\}

Furthermore, g(x) must have some additional properties:

  • g(x) is an irreducible polynomial
  • g(x) | x^n-1 where n is the block length. For algebraic reasons that are difficult to explain, this ensures that x^n = 1, and we can always get rid of higher powers by replacing them with lower power terms.

For example, consider a binary cyclic code of block length 3, with generator polynomial g(x) = 1+x. Then the codewords are:

  • 0 (1+x) = 0 \to 000
  • 1 (1+x) = 1+x \to 110
  • x (1+x) = x+x^2 \to 011
  • (1+x) (1+x) = 1+x^2 \to 101

The codewords are {000, 110, 011, 101}. A rotation of a word to the right is the same as multiplying the corresponding polynomial by x, so cyclic codes are modeled nicely by polynomials.

Cyclic codes are very good at correcting burst errors, where errors are assumed to occur consecutively, instead of distributed evenly across the entire word. However, cyclic codes do not always have high distance. Some special classes of cyclic codes do have high distance, and are more suitable for general error correction:

  • BCH Codes: cyclic codes constructed to have high distance. They can be decoded efficiently, but the algorithm is a bit complicated.
  • Reed-Solomon Codes: special class of BCH codes suitable for use in practice. These are widely used for error correction in CDs, DVDs, and QR codes.

Applying to Graduate School in Computer Science

So you’re thinking of grad school. Four years of undergrad is not enough for you and you’re craving for more knowledge. Or perhaps you want to delay your entry into the “real world” for a couple more years. Well, grad school is the right place!

About a year ago, I decided I wanted to do grad school. However, most of my peers were on track to work in the industry after graduation. The process of applying for grad school is daunting, especially since it varies from country to country and from one subject to another. This is why I am writing a guide to grad school applications in Computer Science, for Canada and the USA: a compilation of things I wish I knew a year ago.

Why grad school?

People go to grad school for different reasons. Most people I know in computer science and software engineering plan to start working full-time — a reasonable choice, given the high salaries in the industry right now. I figured that I had the rest of my life to code for a company; there’s no rush to start working immediately.

Graduate programs come in three broad categories:

  1. Coursework Master’s. Typically about 1 year, this is basically an extension of undergrad. You take a bunch of graduate courses, which are more in-depth than undergrad classes, but you don’t do any research. This is useful for gaining more knowledge before going to work in the industry.
  2. Thesis Master’s. About 1-2 years, depending on the school. At first, you take some courses like in coursework master’s, but the primary goal is to produce a master’s thesis, which is equivalent to about one conference paper of original research. This is a good way to get some research experience, without the time commitment of a Ph.D (and is valuable as a stepping stone if you do decide to get one).
  3. Ph.D. A longer commitment of 4-5 years. In a Ph. D., you produce an extensive amount of original research; by the time you write your thesis, you will be a world expert on your specific topic. I like this illustrated explanation of what it’s like to do a Ph. D.

There are hybrid programs too, like thesis master’s often transition directly into a Ph. D, and also there are regional differences on how these programs work (more on this later).

Can I get into grad school?

As you may expect, top grad school programs are very competitive, and a typical grad student is a top student in his undergraduate class. So what do schools look for in their grad school admissions process?

Grades are a big factor: generally, an undergrad GPA of 85% or higher is good for grad school (with 80% being the bare minimum). However, even more important than GPA is your research experience. Publishing papers in conferences would be ideal, and research experience can make up for a lackluster transcript.

Unfortunately, Waterloo students are at a disadvantage here: with the co-op program, most people spend their undergrad years focusing on internships rather than research, which is considered less valuable. Don’t be discouraged though: my only research experience was through two part-time URA’s, and I have zero publications, but I still got into a good grad school.

Picking a research area

In grad school, you specialize on a specific area of computer science, for example, machine learning, or databases, or theory, or programming languages. You have to indicate roughly what area you want to study in your application, but it’s okay to not know exactly what you want to research.

For me, I wanted to do something involving artificial intelligence or data science or machine learning. Eventually I decided on natural language processing (NLP), since it’s an area of machine learning, and I like studying languages.

Some people have a professor in mind that they want to work with, but I didn’t really have a strong preference. It’s not necessary to explicitly contact potential advisers if you have nothing to say; you get to indicate your adviser preferences in your application.

Choosing your schools

The most important thing to look for in a grad school is the quality of the research group. You may be tempted to look at overall computer science rankings, but this can be misleading because different schools have strengths in different research areas. There are other factors to consider, like location, city environment (big city or college town), and social life.

It’s a good idea to apply to a variety of schools of different levels of competitiveness. However, each application costs about $100, so it can be expensive to apply to too many — 5-10 applications is a good balance.

I decided to apply to five schools: two in Canada and three in the USA. My main criteria were (1), a reputable research program in NLP, and (2), I wanted to live in a big city. After some deliberation, I decided to apply to the following:

  • Ph. D. at University of Washington
  • Ph. D. at UC Berkeley
  • Ph. D. at Columbia University
  • M. Sc. at University of Toronto
  • M. Sc. at University of British Columbia

I didn’t apply to the University of Waterloo, where I’m doing my undergrad, despite it being pretty highly ranked in Canada — after studying here for five years, I needed a change of scenery.

Differences between Canada and USA

You might have noticed that my three applications in the USA were all Ph. D. programs, while my two in Canada were master’s. Graduate school works quite differently in Canada vs in the USA. In Canada, most students do a master’s after undergrad and then go on to do a Ph. D., but in the USA, you enter into Ph. D. directly after undergrad, skipping the master’s.

There are master’s programs in the USA too, but they are almost exclusively coursework master’s, and are very expensive ($50k+ tuition per year). In contrast, thesis master’s programs in Canada and Ph. D. programs in the USA are fully funded, so you get paid a stipend of around $20k-30k a year.

A big reason to do a master’s in the USA is for visa purposes: for Chinese and Indian citizens, getting the H1-B is much easier with a master’s in the country, so the investment can be worth it. Otherwise, it’s probably not worth getting a master’s in the USA; studying in Canada is much cheaper.

If you go to grad school in Canada, you can apply for the CGS-M and OGS government scholarships for master’s students. Unfortunately, Canadian citizens are ineligible for most scholarships if you study in the USA.

Taking the GRE

Another difference for the USA is that the Graduate Record Exam (GRE) is required for all grad school admissions in the USA. This is a 4-hour-long computer-administered test with a reading, writing, and math component. If you’ve taken the SAT, this test is very similar. For grad school applications in computer science, only the general exam is required, and not the computer science subject test.

The GRE plays a fairly minor role in admissions: a terrible score will hurt your chances, but a perfect score will not help you that much. The quantitative and verbal sections are scored between 130-170, and for science and engineering programs, a good score is around 165+ for quantitative and 160+ for verbal.

The quantitative (math) section is a cakewalk for any computer science major, but the verbal section can be challenging if English is not your native language. It does require some preparation (1-6 months is recommended). I studied for a month and did quite well.

Applications are due on December 15 for most schools, so you should take the GRE in October at the latest (and earlier if you plan on taking it multiple times).

Letters of Recommendation

Most grad school and scholarship applications require three letters of recommendation; out of all requirements, this one requires the most planning. The ideal recommendation comes from a professor that you have done research with. If you go to Waterloo and are considering grad school, doing a part-time URA (undergraduate research assistantship) is a good way to secure a few recommendation letters.

It may be difficult to find three professors that you’ve worked with, so the next best thing is a weaker letter from a professor whose class you did well in. As a last resort, at most one letter may come from a non-academic source (like your co-op supervisor). I was lucky that one of my research projects was co-supervised by two different professors, so I got two letters that way.

Statement of Purpose

The statement of purpose is a two-page essay where you describe your academic history and research interests, and convince the admissions committee that you are the ideal candidate to admit. If you have internship experience, talk about what you learned any why it’s relevant for research.

Chances are that the first revision of your statement of purpose will suck (this was certainly the case for me), so get friends and professors to proofread it. After several revisions, here’s my final statement of purpose.

Offers of Admission

That’s about it — after you hit submit on all your applications by December 15, you can sit back and enjoy your final semester. With any luck, you will receive this in your inbox around the beginning of February:


In the end, I got accepted to both master’s programs in Canada (UBC and UofT), but got rejected from all three Ph. D. programs in the USA (Washington, Berkeley, and Columbia). I chose to accept the UofT offer, where I will study NLP starting this fall.

Hopefully this guide has been helpful, and good luck with your applications!

Reading Week Adventure: Backpacking through the Yucatan, Mexico

What do you do during reading week? Relax at home in Toronto, or study for midterms at Waterloo? Why not book yourself a solo flight to Mexico to see the Mayan ruins?

I didn’t originally plan on traveling solo. A friend was going to come with me, but he had to cancel. I decided to go anyways just by myself, and now I’m glad I did it.

My flight landed at Cancun, where I spent three nights. After that, I backpacked across the peninsula to Valladolid, where I stayed for a night, and then onto Merida for two nights. Finally, I took a bus back to Cancun to catch a flight the next day.


Above: Map of places I visited

Beaches of Cancun

Cancun is one of the most popular tourist destinations in Central America, with direct flights to many cities in Canada and the US. Flights here are relatively cheap, with a round trip from Toronto costing about $450 USD.

Landing in Cancun was the worst airport experience in my life. When my flight arrived at 10pm, there were hundreds of tourists from multiple flights coming in, and only two officials processing immigration customs. This resulted in a terrible bottleneck. By the time I exited the airport, it was past 1am — it took everyone over 2.5 hours to get through customs. Fortunately, getting out was smooth though.

The city of Cancun is quite literally built around tourism. The city is less than 50 years old, built in the 1970s when the Mexican government decided to develop its tourism infrastructure.

Above: Tropical island of Isla Mujeres

Many people come to Cancun for the beaches, and indeed, the beaches here are among the best in the world. But for a solo traveler like me, going to the beach is somewhat awkward. You can still do it, but you have to constantly keep an eye on your belongings when going for a swim.

Rather than an all-inclusive resort by the beach, I opted for a much cheaper room in a hostel downtown. Most of the beaches belong to private resorts, but there are many publicly accessible ones. I took a ferry to Isla Mujeres, a small island off the coast with beaches and snorkeling activities.

Mayan Ruins

Climbing through the ancient Mayan ruins was the highlight of my trip. The Mayan civilization thrived over a thousand years ago in present day Mexico and Guatemala, and their society collapsed around 900AD for unclear reasons. They left behind majestic temples in the jungle to explore today.

Above: Chichen Itza and Tulum

I went to four different Mayan ruins: Chichen Itza, Tulum, Ek Balam, and Uxmal. Chichen Itza and Tulum were the most touristy, with hundreds of organized tour buses bringing tens of thousands of tourists every day. Because of this massive volume, you’re not allowed to climb any of the ruins, only look from afar.

The ruins of Ek Balam and Uxmal were much nicer: with few organized tours, there were no massive crowds, only a handful of tourists. I would highly recommend visiting one of these less visited ruins, because you can climb freely onto the Mayan pyramids. Standing on top of the temple where Mayan rulers once stood, overlooking the ruins of the ancient city in the jungle — it feels like traveling back in time.

Above: Ek Balam and Uxmal

Swimming in Cenotes

Another awesome activity to do in the area is swimming in the open groundwater caves, called cenotes. There are no rivers in the Yucatan peninsula, so cenotes were an important source of fresh water for the Mayans, who built cities around them.

Above: Cenote Zaci in Valladolid

Today, the cenotes serve as natural swimming pools. The water is cool and fresh, and it’s a lot nicer to swim in than the salty ocean water.

Language Barrier

Spanish is the main language spoken in the Yucatan peninsula. Some people speak English, especially around Cancun and by people interacting with tourists.

I’ve studied Spanish for a few months — so my level is far from fluent, but I can have simple conversations in Spanish. You can get by without knowing Spanish, but knowing some made my trip a lot more interesting.

I met a lot of Mexican nationals: either locals, or Mexican tourists visiting from other parts of the country. These people generally don’t speak English, but they’re very friendly and are happy to let me practice my Spanish with them. On the bus, I chatted with a girl from Chetumal spending her weekend in Cancun; the next day, I met a guy from northern Mexico in my hostel and we went out for drinks together.

The immersive environment is really helpful for learning a language. I’d pick up 10-20 new words every day just by talking to people.

Occasionally I heard the Yucatec Maya language spoken as well — the indigenous Maya people still make up a large part of the population here. It’s easy to identify the Mayan language by its plosive consonants and abundant “sh” sounds, a sound which doesn’t exist in Spanish.

It’s certainly not the case that everyone speaks English — in one of my hostels, the front desk attendant spoke no English at all. Overall, I’d estimate I used Spanish in 60% of my interactions on this trip.

Merida, The White City

Merida is a colonial city a few hundred kilometers to the west of Cancun. It feels very different from Cancun, with a much older history and fewer tourists.

IMG_1073 (Medium).JPG

Above: Central plaza of Merida

At the center of the city is a massive cathedral, next to a plaza. There are many plazas around the city where people gather. The narrow streets are lined with colorful houses, and old cars fill the air with a faint scent of petrol.

Being friendly with the locals can backfire. Apparently in Merida, it’s unusual to see a random Asian guy in a restaurant speaking Spanish. So unusual that after talking to a local family in a restaurant, they offered me to stay in their house for the night. I already had a hostel, but I mentioned that I was on my way to visit a Mayan site outside the city, so they offered to go there with me. I agreed. Then, rather than take me there, they gave me a tour of the city, their house, and the mall. Then they introduced me to their daughter studying in Mexico City, and insisted we keep in touch on Facebook. In the end, I never got to the Mayan site, but it was quite an interesting experience.

Mexico Travel Logistics

In this section, all prices are assumed to be USD.

Around Cancun, the bus system is pretty good, with a two-hour air-conditioned bus ride costing about $7. There are also white vans called colectivos that pick up and drop off passengers along the highway; these are faster and cheaper, but less comfortable.

In Merida, the buses were more sketchy. I knew I was in Mexico’s backwaters when I accidentally took the second class bus from Chichen Itza to Merida, which stopped at every little Mayan village along the way, taking twice as long as usual. The bus to Uxmal ran once every 3 hours, and the schedule is nowhere to be found on the internet — you have to ask the locals.

Driving would’ve been a faster option. I wasn’t comfortable driving alone in a foreign country, so I took the bus, but you can see a lot more if you rented a car instead.

For accommodation, I slept in hostels, which cost about $20 a night for a private room and $10 for a bunk bed with many people in one room. I did a mixture of the two, but it’s easier to sleep in a private room, so for me, it’s worth the extra cost.

Most things in Mexico cost about half or one-third as much as the equivalent in the US: for example, I can get a decent meal for $5-10. On the other hand, I lost about 10% converting money into pesos through ATM and bank fees.

I got a SIM card at a bus station, which cost $20 for 1.5gb of data. Some people say that going offline gives a more authentic travel experience, but it’s useful to be able to look up information, rather than ask people where the nearest bank is.

I spent about $70 a day in Mexico, including hotels, food, taxis, entrance fees, etc. You can get by with less, but I wasn’t super frugal, and being a tourist, I probably overpaid for things.

Traveling alone is quite fun! With nobody else to account for, I can do as much or as little as I feel like. I never really felt lonely either, since it’s easy to find people to talk to in hostels.

That’s all I have to say for now. I have four midterms next week — great way to return to reality after a journey to another world. Better start studying.

Are programming competitions a good use of time?

10 minutes remaining in the contest, but you’re still a few points short of advancing. Armed with your mighty coding powers, the first three problems fall quickly, but problem 4 is proving a tough nut to crack. After four incorrect attempts, your time is running short, and you’re searching desperately for an off-by-one error, an edge case you haven’t considered.

You read the problem statement one more time, and at last, you find it. An integer overflow bug. With a wide grin, you fix the bug with a few quick keystrokes. You upload the source code file…


Accepted! You sit back, feeling both relieved and elated. With the addition of 25 points, your advancement to the next round is now guaranteed.

It’s not hard to see why programming contests are so appealing — this is programming distilled to its essence. No need to figure out how to integrate an unfamiliar API, or refactor code to be unit testable, or make sense of vague business requirements. In competitive programming, each problem has a self-contained, algorithmic solution, and an automated judge instantly determines if it’s correct or not.

2Above: Programming contests contain the same types of problems as technical interviews at companies like Google.

Competitive programming promises even more glory. Win enough contests, and you get an interview at Facebook or Google, where they ask you… you guessed it… more algorithm coding questions!

By doing programming contests, you gain an intimate understanding of data structures and algorithms and their complexities. While your colleagues vaguely know the difference between a depth-first versus a breadst-first search, you develop a much deeper intuition. You will never forget that one contest where you used DFS instead of BFS, causing your solution to time out.

Unlike the real world, competitive programming offers an arena of pure meritocracy. As long as you solve difficult problems fast, you will surely rise through the ranks. In the Google Code Jam, thousands of programmers try their luck in the qualifying round, but this number is reduced to 3000 by Round 2, and 500 by Round 3. Then for the grand finale, the top 25 elite coders are flown in to compete on-site in the world finals.

3Above: 25 of the world’s best compete in the Google Code Jam world finals.

I used to look up in awe at red coders (the highest rated users have red usernames). By the time I solved the first problem, they would have not only solved it in 10 minutes, but also solved 2-3 even harder ones. What was it like to think at that level, to possess that much coding wizardry?

And for some time, I strove to be like them. I studied my dynamic programming and graph algorithms in my spare time, and practiced on SPOJ and Hackerrank and Codeforces. I competed in my university’s ACM tryouts, and three times failed to qualify for the ACM team. So it goes.

In the last few years, I got to talk to a lot of competitive programmers, most of whom were far better than myself. Perhaps I was searching for some magical quality that gave them coding superpowers, but none was to be found. Instead, the key to a high rating was simply many years of dedication and hard work.

It’s illuminating to read the answers on this Quora question: “How does it feel to finally be red at TopCoder after years of hard work?” The short answer: nothing much really.

4Above: Rating graph of Codeforces user netman. Getting to red takes years of practice.

Given the amount of time it takes to master competitive programming, one naturally wonders: is this really a good use of time? In a contest, you are ultimately solving problems that other people have solved already, so nothing new is being produced. Although solving a contest problem is satisfying, I find it a lot more rewarding to build projects or apps with my novel ideas.

Recently, Google found that among its engineers, being good at programming competitions is negatively correlated to being good at software engineering.

In the video, Peter Norvig notes that competitive programmers are “used to going really fast, cranking the answer out and moving to the next thing, but you do better if you’re more reflective and go slowly and get things right”.

Ironically, the same thing that makes programming contests so attractive is its own downfall. Contests focus on data structures and algorithms, which are just a small part of software engineering. Other skills like UI design, databases, network architecture, distributed systems, etc, are not touched in programming contests.

Even if you only look at algorithmic problems, competitive programming is still not representative of reality. Due to limitations in automated judging, contest problems are limited to exact, deterministic algorithms that have a single correct answer. This rules out entire classes of randomized and approximate algorithms. Algorithms now rely more and more on data and machine learning, and less on combinatorial approaches, which further renders competitive programming less relevant.

Now, are programming contests useful? Yes, but only up to a point. Solving contest problems is an excellent way to familiarize yourself with a programming language and its data structures, as well as get better at converting procedural ideas to code. These are very useful skills for a coding interview. However, even the most difficult Facebook/Google interview questions are maybe around a Codeforces Div2 C (or Div1 A) difficulty, which is a long way from the hardest contest problems.

5Above: Beyond a certain point, skills learned in programming contests are only useful for programming contests.

I would put the inflection point at about 1700 Codeforces rating (enough to solve Div2 A and B consistently). Beyond that, you continue to improve, but be aware that you’ll be studying things solely for contests that have little relevance anywhere else, for example, Fenwick trees, max flow algorithms, bitmask DP, and other increasingly obscure topics.

So far, I’ve been fairly critical of competitive programming, but rather than deride it as a waste of time, I think it’s best to view it as a sport. Like soccer or basketball, the function of sports in society is to inspire excellence, and above all, to have fun. Terry Tao wrote a similar article on math competitions; I’d agree with him.

My advice to you: do programming contests if you find them fun and you enjoy tackling hard problems. But don’t take it too seriously: it takes years of dedicated effort to get extremely good at it, dedication that very few people have. Unless you’re at or aiming to compete at the World Final level, you definitely shouldn’t be spending a significant amount of time getting better at contests. Your time is better spent studying machine learning, or statistics, or compilers, or distributed systems, or just about anything else in computer science.

I have accounts on Hackerrank and Codeforces if you want to follow me. As per my own advice, I’m no longer actively trying to get better, but I still do problems for fun occasionally.

Edit: This article has some interesting discussion on Reddit and Codeforces.

Side Project: Conversation Player

Here’s a side project I’ve been working on. It’s a widget that plays back conversations in real time. Here’s a video demo of it:

The idea is to share text message snippets in a more digestible manner. The usual way to distribute a chat log to a friend would to send them a transcript of the messages, along with timestamps and all. The information is there, but it’s mentally taxing to read through it, and it’s especially difficult to make sense of the timestamps.

We figured it would be much easier to watch the conversation unfold in real time. This way, you see the conversation one message at a time, instead of all of it at once.


In the future, this will be a widget that you can embed in a blog or website. The reader can toggle the playback speed and use the slider to jump anywhere in the conversation.

We also built a set of utilities to import conversations from Facebook Messenger and Skype into a JSON format that our app can understand. Right now, these utilities are a bit clumsy to work with, so we’re holding back on releasing the project.

Conversation Player is built using React. It’s my first time using React, so it was good coding practice. I also had to learn the whole modern Javascript setup with babel, browserify, webpack, and so on. Everything is buzzword compliant now. Yay!