Using Waveform Plots to Improve your Accent, and a Dive into English Phonology

I was born in China and immigrated to Canada when I was 4 years old. After living in Canada for 18 years, I consider myself a native speaker for most purposes, but I still retain a noticeable non-native accent when speaking.

This post has a video that contains me speaking, if you want to hear what my accent sounds like.

It’s often considered very difficult or impossible to change your accent once you reach adulthood. I don’t know if this is true or not, but it sounds like a self-fulfilling prophecy — the more you think it’s impossible, the less you try, so of course your accent will not get any better. Impossible or not, it’s worth it to give it a try.

The first step is identifying what errors you’re making. This can be quite difficult if you’re not a trained linguist — native English speakers will detect that you have an accent, but they can’t really pinpoint exactly what’s wrong with your speech — it just sounds wrong to them.

One accent reduction strategy is the following: listen to a native speaker saying a sentence (for example, in a movie or on the radio), and repeat the same sentence, mimicking the intonation as closely as possible. Record both sentences, and play them side by side. This way, with all the other confounding factors gone, it’s much easier to identify the differences between your pronunciation and the native one.

When I tried doing this using Audacity, I noticed something interesting. Oftentimes, it was easier to spot differences in the waveform plot (that Audacity shows automatically) than to hear the differences between the audio samples. When you’re used to speaking a certain way all your life, your ears “tune out” the differences.

Here’s an example. The phrase is “figure out how to sell it for less” (Soundcloud):

2_.png

The difference is clear in the waveform plot. In my audio sample, there are two spikes corresponding to the “t” sound that don’t appear in the native speaker’s sample.

For vowels, the spectrogram works better than the waveform plot. Here’s the words “said” and “sad”, which differ in only the vowel:

1.png

Again, if you find it difficult to hear the difference, it helps to have a visual representation to look at.


I was surprised to find out that I’d been pronouncing the “t” consonant incorrectly all my life. In English, the letter “t” represents an aspirated alveolar stop (IPA /tʰ/), which is what I’m doing, right? Well, no. The letter “t” does produce the sound /tʰ/ at the beginning of a word, but in American English, the “t” at the final position of a word can get de-aspirated so that there’s no audible release. It can also turn into a glottal stop (IPA /ʔ/) in some dialects, but native speakers rarely pronounce /tʰ/, except in careful speech.

This is a phonological rule, and there are many instances of this. Here’s a simple experiment: put your hand in front of your mouth and say the word “pin”. You should feel a puff of air in your palm. Now say the word “spin” — and there is no puff of air. This is because in English, the /p/ sound always changes into /b/ following the /s/ sound.

Now this got me curious and I wondered: exactly what are the rules governing sound changes in English consonants? Can I learn them so I don’t make this mistake again? Native English speakers don’t know these rules (consciously at least), and even ESL materials don’t go into much detail about subtle aspects of pronunciation. The best resources for this would be linguistics textbooks on English phonology.

I consulted a textbook called “Gimson’s Pronunciation of English” [1]. For just the rules regarding sound changes of the /t/ sound at the word-final position, the book lists 6 rules. Here’s a summary of the first 3:

  • No audible release in syllable-final positions, especially before a pause. Examples: mat, map, robe, road. To distinguish /t/ from /d/, the preceding vowel is lengthened for /d/ and shortened for /t/.
  • In stop clusters like “white post” (t + p) or “good boy” (d + b), there is no audible release for the first consonant.
  • When a plosive consonant is followed by a nasal consonant that is homorganic (articulated in the same place), then the air is released out of the nose instead of the mouth (eg: topmost, submerge). However, this doesn’t happen if the nasal consonant is articulated in a different place (eg: big man, cheap nuts).

As you can see, the rules are quite complicated. The book is somewhat challenging for non-linguists — these are just the rules for /t/ at the word-final position; the book goes on to spend hundreds of pages to cover all kinds of vowel changes that occur in stressed and unstressed syllables, when combined with other words, and so on. For a summary, take a look at the Wikipedia article on English Phonology.

What’s really amazing is how native speakers learn all these patterns, perfectly, as babies. Native speakers may make orthographic mistakes like mixing up “their, they’re, there”, but they never make phonological mistakes like forgetting to de-aspirate the /p/ in “spin” — they simply get it right every time, without even realizing it!


Some of my friends immigrated to Canada at a similar or later age than me, and learned English with no noticeable accent. Therefore, people sometimes found it strange that I still have an accent. Even more interesting is the fact that although my pronunciation is non-native, I don’t make non-native grammatical mistakes. In other words, I can intuitively judge which sentences are grammatical or ungrammatical just as well as a native speaker. Does that make me a linguistic anomaly? Intrigued, I dug deeper into academic research.

In 1999, Flege et al. conducted a study of Korean-American immigrants who moved to the USA at an early age [2]. Each participant was given two tasks. In the first task, the participant was asked to speak a series of English sentences, and native speakers judged how much of a foreign accent was present on a scale from 1 to 9. In the second task, the participant was a list of English sentences, some grammatical and some not, and picked which ones were grammatical.

Linguists hypothesize that during first language acquisition, babies learn the phonology of their language long before they start to speak; grammatical structure is acquired much later. The Korean-American study seems to support this hypothesis. For the phonological task, immigrants who arrived as young as age 3 sometimes retained a non-native accent into adulthood.

3.pngAbove: Scores for phonological task decrease as age of arrival increases, but even very early arrivals retain a non-native accent.

Basically, arriving before age 6 or so increases the chance of the child developing a native-like accent, but by no means does it guarantee it.

On the other hand, the window for learning grammar is much longer:

4.pngAbove: Scores for grammatical task only start to decrease after about age 7.

Age of arrival is a large factor, but does not explain everything. Some people are just naturally better at acquiring languages than others. The study also looked at the effect of other factors like musical ability and perceived importance of English on the phonological score, but the connection is a lot weaker.

Language is so easy that every baby picks it up, yet so complex that linguists write hundreds of pages to describe it. Even today, language acquisition is poorly understood, and there are many unresolved questions about how it works.


References

  1. Cruttenden, Alan. “Gimson’s Pronunciation of English, 8th Edition”. Routeledge, 2014.
  2. Flege, James Emil et al. “Age Constraints on Second Language Acquisition”. Journal of Memory and Language, Issue 41, 1999.

The Power Law Distribution and the Harsh Reality of Language Learning

I’m an avid language learner, and sometimes people ask me: “how many languages do you speak?” If we’re counting all the languages in which I can have at least a basic conversation, then I can speak five languages — but can I really claim fluency in a language if I can barely read children’s books? Despite being a seemingly innocuous question, it’s not so simple to answer. In this article, I’ll try to explain why.

Let’s say you’re just starting to study Japanese. You might picture yourself being able to do the following things, after a few months or years of study:

  1. Have a conversation with a Japanese person who doesn’t speak any English
  2. Watch the latest episode of some anime in Japanese before the English subtitles come out
  3. Overhear a conversation between two Japanese people in an elevator

After learning several languages, I discovered that the first task is a lot easier than the other two, by an order of magnitude. Whether in French or in Japanese, I would quickly learn enough of the language to talk to people, but the ability to understand movies and radio remains elusive even after years of study.

There is a fundamental difference in how language is used in one-on-one conversation versus the other two tasks. When conversing with a native speaker, it is possible for him to avoid colloquialisms, speak slower, and repeat things you didn’t understand using simpler words. On the other hand, when listening to native-level speech without the speaker adjusting for your language level, you need to be near native-level yourself to understand what’s going on.

We can justify this concept using statistics. By looking at how frequencies of English words are distributed, we show that after an initial period of rapid progress, it soon becomes exponentially harder to get better at a language. Conversely, even a small decrease in language complexity can drastically increase comprehension by non-native listeners.

Reaching conversational level is easy

For the rest of this article, I’ll avoid using the word “fluent”, which is rather vague and misleading. Instead, I will call a “conversational” speaker someone who can conduct some level of conversation in a language, and a “near-native” speaker someone who can readily understand speech and media intended for native speakers.

It’s surprising how little of a language you actually need to know to have a decent conversation with someone. Basically, you need to know:

  1. A set of about 1000-2000 very basic words (eg: person, happy, cat, slow, etc).
  2. Enough grammar to form sentences (eg: present / future / past tenses; connecting words like “then”, “because”; conditionals, comparisons, etc). Grammar doesn’t need to be perfect, just close enough for the listener to understand what you’re trying to say.
  3. When you want to say something but don’t know the word for it, be flexible enough to work around the issue and express it with words you do know.

For an example of English using only basic words, look at the Simple English Wikipedia. It shows that you can explain complex things using a vocabulary of only about 1000 words.

For another example, imagine that Bob, a native English speaker, is talking to Jing, an international student from China. Their conversation might go like this:

Bob: I read in the news that a baby got abducted by wolves yesterday…

Jing: Abducted? What do you mean?

Bob: He got taken away by wolves while the family was out camping.

Jing: Wow, that’s terrible! Is he okay now?

In this conversation, Jing indicates that she doesn’t understand a complex word, “abducted”, and Bob rephrases the idea using simpler words, and the conversation goes on. This pattern happens a lot when I’m conversing with native Japanese speakers.

After some time, Bob gets an intuitive feeling for what level of words Jing can understand, and naturally simplifies his speech to accommodate. This way, the two can converse without Jing explicitly interrupting and asking Bob to repeat what he said.

Consequently, reaching conversational level in a language is not very hard. Some people claim you can achieve “fluency” in 3 months for a language. I think this is a reasonable amount of time for reaching conversational level.

What if you don’t have the luxury of the speaker simplifying his level of speech for you? We shall see that the task becomes much harder.

The curse of the Power Law

Initially, I was inspired to write this article after an experience with a group of French speakers. I could talk to any of them individually in French, which is hardly remarkable given that I studied the language since grade 4 and minored in it in university. However, when they talked between themselves, I was completely lost, and could only get a vague sense of what they were talking about.

Feeling slightly embarrassed, I sought an explanation for this phenomenon. Why was it that I could produce 20-page essays for university French classes, but struggled to understand dialogue in French movies and everyday conversations between French people?

The answer lies in the distribution of word frequencies in language. It doesn’t matter if you’re looking at English or French or Japanese — every natural language follows a power law distribution, which means that the frequency of every word is inversely proportional to its rank in the frequency table. In other words, the 1000th most common word appears twice as often as the 2000th most common word, and four times as often as the 4000th most common word, and so on.

(Aside: this phenomenon is sometimes called Zipf’s Law, but refers to the same thing. It’s unclear why this occurs, but the law holds in every natural language)

1.pngAbove: Power law distribution in natural languages

The power law distribution exhibits the long tail property, meaning that as you advance further to the right of the distribution (by learning more vocabulary), the words become less and less common, but never drops off completely. Furthermore, rare words like “constitution” or “fallacy” convey disproportionately more meaning than common words like “the” or “you”.

This is bad news for language learners. Even if you understand 90% of the words of a text, the remaining 10% are the most important words in the passage, so you actually understand much less than 90% of the meaning. Moreover, it takes exponentially more vocabulary and effort to understand 95% or 98% or 99% of the words in the text.

I set out to experimentally test this phenomenon in English. I took the Brown Corpus, containing a million words of various English text, and computed the size of vocabulary you would need to understand 50%, 80%, 90%, 95%, 98%, 99%, and 99.5% of the words in the corpus.

2.png

By knowing 75 words, you already understand half of the words in a text! Of course, just knowing words like “the” and “it” doesn’t get you very far. Learning 2000 words is enough to have a decent conversation and understand 80% of the words in a text. However, it gets exponentially harder after that: to get from 80% to 98% comprehension, you need to learn more than 10 times as many words!

(Aside: in this analysis I’m considering conjugations like “swim” and “swimming” to be different words; if you count only the stems, you end up with lower word counts but they still follow a similar distribution)

How many words can you miss and still be able to figure out the meaning by inference? In a typical English novel, I encounter about one word per page that I’m unsure of, and a page contains about 200-250 words, so I estimate 99.5% comprehension is native level. When there are more than 5 words per page that I don’t know, then reading becomes very slow and difficult — this is about 98% comprehension.

Therefore I will consider 98% comprehension “near-native”: above this level, you can generally infer the remaining words from context. Below this level, say between 90% to 98% comprehension, you may understand generally what’s going on, but miss a lot of crucial details.

3.pngAbove: Perceived learning curve for a foreign language

This explains the difficulty of language learning. In the beginning, progress is fast, and in a short period of time you learn enough words to have conversations. After that, you reach a long intermediate-stage plateau where you’re learning more words, but don’t know enough to understand native-level speech, and anybody speaking to you must use a reduced vocabulary in order for you to understand. Eventually, you will know enough words to infer the rest from context, but you need a lot of work to reach this stage.

Implications for language learners

The good news is that if you want to converse with people in a language, it’s perfectly doable in 3 to 6 months. On the other hand, to watch TV shows in the language without subtitles or understand people speaking naturally is going to take a lot more work — probably living for a few years in a country where the language is spoken.

Is there any shortcut instead of slowly learning thousands of words? I can’t say for sure, but somehow I doubt it. By nature, words are arbitrary clusters of sounds, so no amount of cleverness can help you deduce the meaning of words you’ve never seen before. And when the proportion of unknown words is above a certain threshold, it quickly becomes infeasible to try to infer meaning from context. We’ve reached the barrier imposed by the power law distribution.


Now I will briefly engage in some sociological speculation.

My university has a lot of international students. I’ve always noticed that these students tend to form social groups speaking their native non-English languages, and rarely assimilate into English-speaking social groups. At first I thought maybe this was because their English was bad — but I talked to a lot of international students in English and their English seemed okay: noticeably non-native but I didn’t feel there was a language barrier. After all, all our lectures are in English, and they get by.

However, I noticed that when I talked to international students, I subconsciously matched their rate of speaking, speaking just a little bit slower and clearer than normal. I would also avoid the usage of colloquialisms and cultural references that they might not understand.

If the same international student went out to a bar with a group of native English speakers, everyone else would be speaking at normal native speed. Even though she understands more than 90% of the words being spoken, it’s not quite enough to follow the discussion, and she doesn’t want to interrupt the conversation to clarify a word. As everything builds on what was previously said in the conversation, missing a word here and there means she is totally lost.

It’s not that immigrants don’t want to assimilate into our culture, but rather, we don’t realize how hard it is to master a language. On the surface, going from 90% to 98% comprehension looks like a small increase, but in reality, it takes an immense amount of work.

Read further discussion of this article on /r/languagelearning!

How a simple trick decreased my elevator waiting time by 33%

Last month, when I traveled to Hong Kong, I stayed at a guesthouse in a place called the Chungking Mansions. Located in Tsim Sha Tsui, it’s one of the most crowded, sketchiest, and cheapest places to stay in Hong Kong.

5262623923_99b6c39b21.jpgChungking Mansions in Tsim Sha Tsui

Of the 17 floors, the first few are teeming with Indian and African restaurants and various questionable businesses. The rest of the floors are guesthouses and private residences. One thing that’s unusual about the building is the structure of its elevators.

The building is partitioned into five disjoint blocks, and each block has two elevators. One of the elevators only goes to the odd numbered floors, and the other elevator only goes to the even numbered floors. Neither elevator goes to the second floor because there are stairs.

1.pngElevator Schematic of Chungking Mansions

I lived on the 14th floor, and man, those elevators were slow! Because of the crazy population density of the building, the elevator would stop on several floors on the way up and down. Even more, people often carried furniture on the elevators, which took a long time to load and unload.

To pass the time, I timed exactly how long it took between arriving at the elevator on the ground floor, waiting for the elevator to come, riding the elevator up, and getting off at the 14th floor. After several trials, the average time came out to be about 4 minutes. Clearly, 4 minutes is too long, especially when waiting in 35 degrees weather without air condition, so I started to look for optimizations.

The bulk of the time is spent waiting for the elevator to come. The best case is when the elevator is on your floor and you get in, then the waiting time is zero. The worst case is when the elevator has just left and you have to wait a full cycle before you can get in. After you get in, it takes a fairly constant amount of time to reach your floor. Therefore, your travel time is determined by your luck with the elevator cycle. Assuming that the elevator takes 4 minutes to make a complete cycle (and you live on the top floor), the best case total elevator time is 2 minutes, the worst case is 6 minutes, and the average case is 4 minutes.

It occurred to me that just because I lived on the 14th floor, I don’t necessarily have to take the even numbered elevator! Instead, if the odd numbered elevator arrives first, it’s actually faster to take the elevator to the 13th floor and climb the stairs to the 14th floor. Compared to the time to wait for the elevator, the time to climb one floor is negligible. I started doing this trick and timed how long it took. Empirically, this optimization seemed to speed my time by about 1 minute on average.

Being a mathematician at heart, I was unsatisfied with empirical results. Theoretically, exactly how big is this improvement?


Let us model the two elevators as random variables X_1 and X_2, both independently drawn from the uniform distribution [0,1]. The random variables represent model the waiting time, with 0 being the best case and 1 being the worst case.

With the naive strategy of taking the even numbered elevator, our waiting time is X_1 with expected value E[X_1] = \frac{1}{2}. Using the improved strategy, our waiting time is \min(X_1, X_2). What is the expected value of this random variable?

For two elevators, the solution is straightforward: consider every possible value of X_1 and X_2 and find the average of \min(X_1, X_2). In other words, the expected value of \min(X_1, X_2) is

{\displaystyle \int_0^1 \int_0^1 \min(x_1, x_2) \mathrm{d} x_1 \mathrm{d} x_2}

Geometrically, this is equivalent to calculating the volume of the square pyramid with vertices at (0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0), and (1, 1, 1). Recall from geometry that the volume of a square pyramid with known base and height is \frac{1}{3} bh = \frac{1}{3}.

2.png

Therefore, the expected value of \min(X_1, X_2) is \frac{1}{3}, which is a 33% improvement over the naive strategy with expected value \frac{1}{2}.


Forget about elevators for now; let’s generalize!

We know that the expected value of two uniform [0,1] random variables is \frac{1}{3}, but what if we have n random variables? What is the expected value of the minimum of all of them?

I coded a quick simulation and it seemed that the expected value of the minimum of n random variables is \frac{1}{n+1}, but I couldn’t find a simple proof of this. Searching online, I found proofs here and here. The proof isn’t too hard, so I’ll summarize it here.

Lemma: Let M_n(x) be the c.d.f for \min(X_1, \cdots, X_n), where each X_i is i.i.d with uniform distribution [0,1]. Then the formula for M_n(x) is

M_n(x) = 1 - (1-x)^n

Proof:

\begin{array}{rl} M_n(x) & = P(\min(X_1, \cdots, X_n) < x) \\ & = 1 - P(X_1 \geq x, \cdots, X_n \geq x) \\ & = 1 - (1-x)^n \; \; \; \square \end{array}

Now to prove the main claim:

Claim: The expected value of \min(X_1, \cdots, X_n) is \frac{1}{n+1}

Proof:

Let m_n(x) be the p.d.f of \min(X_1, \cdots, X_n), so m_n(x) = M'_n(x) = n(1-x)^{n-1}. From this, the expected value is

\begin{array}{rl} {\displaystyle \int_0^1 x m_n(x) \mathrm{d}x} & = {\displaystyle \int_0^1 x n (1-x)^{n-1} \mathrm{d} x} \\ & = {\displaystyle \frac{1}{n+1}} \end{array}

This concludes the proof. I skipped a bunch of steps in the evaluation of the integral because Wolfram Alpha did it for me.


For some people, this sort of travel frustration would lead to complaining and an angry Yelp review, but for me, it led me down this mathematical rabbit hole. Life is interesting, isn’t it?

I’m not sure if the locals employ this trick or not: it was pretty obvious to me, but on the other hand I didn’t witness anybody else doing it during my stay. Anyhow, useful trick to know if you’re staying in the Chungking Mansions!

Read further discussion of this post on Reddit!

Four Weeks in Tokyo: My Japanese Homestay Experience

During the month of June, I enrolled in a Japanese language class in Tokyo, and stayed with a Japanese family in a homestay program. This is my first time to Japan, and I hoped to improve my Japanese language skills and learn about their culture.

A month is an unusually long time to spend in one place for most tourists. Normally you’d stay for a few days, see all the pretty sights, and move on, but by doing a monthlong homestay, I got a better idea of what life is really like in Japan. Instead of rushing from one tourist attraction to another, I had time to explore at a more relaxed pace, and also code some side projects in the evenings.

IMG_1596 (Medium)My homestay family got me a cake on the first day. “Welcome, Bai-kun”

The language school I enrolled at was called Coto Language Academy. They provide various levels of small group Japanese language classes for foreigners, and they also helped me arrange my homestay. I’m not sure how effective the actual classes were: their style is focused on drilling rigid grammatical rules, which is not the best way to learn a foreign language. Nevertheless, attending classes for a few hours a day adds some structure to my life, and it’s a good way to meet other foreigners who are also learning Japanese.

For the rest of this article, I’ll list some observations about Japan. Some things were more or less what I expected, but a lot of things were quite different.

Things in Japan that went as expected

1. People are very polite. Everybody speaks quietly, there is no angry yelling on the streets. Waiting for the subway, walking up the escalator, crossing the street: everything is very orderly; you never see people cut the line or cut you off on the road. It’s as though everyone is hyper-aware of their surroundings and try not to do anything that might inconvenience others.

2. Trains are always on time, down to the minute. If the train is supposed to depart at 5:27, it will depart at 5:27, not a minute sooner, not a minute later. Any delay more than five minutes is considered late, and apologies are issued over the speakers.

The Japanese punctuality extends to daily life. When my homestay family said we’d go out for dinner at 7pm, they really meant 7pm. In Canada, 7pm usually meant you’d actually leave the house at 7:10pm or 7:15pm. Not in Japan: by 7:02pm, they were already waiting in the car.

tokyo3 (Medium)Typical rush hour in Japan

Trains can get very, very crowded during rush hour. Although I haven’t seen any pushers like that picture, I had the pleasant experience of touching the bodies of 5 strangers at the same time in a subway car.

3. Vending machines are everywhere, on every other street corner. They all sell the same variety of tea and coffee drinks though, I haven’t seen anything weird.

IMG_1607 (Medium)Vending machines in a Chiba suburb

4. Cat cafes and anime shops. Cat cafes are a Japanese invention where you pet cats for stress relief.

Me at a cat cafe. 100% as cute as you’d expect.

Akihabara is the place to go for any kind of anime-related paraphernalia.

IMG_1745 (Medium)A Pikachu slot machine

Anime culture is much less prevalent in Japan than I expected. Outside of Japan, anime is considered a big part of Japanese culture, but in Japan, it’s fairly niche. The anime stores are all clustered in the Akihabara district, and otherwise you don’t encounter it that much.

Things that surprised me about Japan

1. People work a lot. It’s expected to work many hours of overtime, far beyond the usual 40 hours a week in Canada. The evening rush hour starts at about 5pm and lasts well into the night: even at 11pm, the trains are always packed from salarymen who just got off from work. My homestay family’s dad often did not come home until after midnight, and we usually ate dinner without him.

I don’t know how anyone can still be productive after working so much. Right now, I can’t really comment because I haven’t been inside of a Japanese corporation. It’s enough of a problem in Japan’s society that they have a word “karoshi”, meaning death from working too much.

Aside from the long working hours, I was also surprised that Japanese salarymen typically work for one company for decades, sometimes their entire life. This is very rare in Canada, where software engineers seldom stay more than 5 years at a company. When technology shifts, companies in Japan train their existing employees to do the new tasks they need, rather than lay off workers and hire new ones. The culture is quite different.

2. Streets are much quieter and less crowded than expected. Before coming to Japan, I had spent a month in China and had grown accustomed to bustling streets with horns blaring and everybody jaywalking haphazardly. I expected Tokyo, with a population of 30 million, to be more of the same. I was pleasantly surprised to find out that this is not the case.

IMG_1849 (Medium)Shinjuku at night: one of the busiest districts of Tokyo

A few places were very crowded for sure, like Shinjuku and Shibuya. Everywhere else is not a lot different from the suburbs, except the buildings are a bit taller.

3. Tokyo is huge. With a metro-area population equal to that of Canada, its sheer size is massive. It’s hard to get a sense of scale from looking at Google Maps: it takes me about an hour to commute from my homestay in Urayasu to my language school in Chiyoda. Even with a system of super efficient trains going 100km/h, it still takes over two hours to get to attractions on the other side of the city. Two hours of commute time each way is common for the Japanese, if you work downtown and live in one of the outlying suburbs.

4. Japanese food is a lot more than sushi. In Canada, Japanese restaurants mostly focus on sushi, but sushi is not that common in Japan. Only maybe one in ten restaurants here serve sushi. The others serve all kinds of Japanese food I never knew existed.

The sushi restaurants do not disappoint though. Even in cheaper restaurants, with rotating plates costing no more than 100-200 yen each, the sushi is better than anything I’ve had in Canada.

IMG_1608 (Medium)Rotating sushi restaurant

One particular Japanese delicacy is natto: a slimy mixture of fermented beans that they like to mix with rice. Most foreigners don’t like it. I tried it once. Now, when somebody asks me what I’d like to eat, I reply, “fine with anything but natto”.

5. Temples and shrines are everywhere. You will run into a shrine every few blocks in the city, and Nikko is full of them.

IMG_2033 (Medium)UNESCO world heritage site of Nikko, with over 100 temples and shrines

Architecturally they’re quite similar, but temples (tera) are for Buddhism and shrines (jinja) are for the Shinto religion.


Besides feeding me and suffering through my broken Japanese for a month, my homestay family also taught me how to play Shogi.

IMG_1746 (Medium)Learning how to play Shogi

Shogi is the Japanese version of chess. A lot of tactics are similar to chess, but the games are a bit longer, and it never reaches an endgame because captured pieces get dropped back on the board.

IMG_2082 (Medium)Me with homestay family. I will miss you guys!

That’s it for my month in Tokyo. Next stop: Kyoto!

Learning R as a Computer Scientist

If you’re into statistics and data science, you’ve probably heard of the R programming language. It’s a statistical programming language that has gained much popularity lately. It comes with an environment specifically designed to be good at exploring data, plotting visualizations, and fitting models.

R is not like most programming languages. It’s quite different from any other language I’ve worked with: it’s developed by statisticians, who think differently from programmers. In this blog post, I describe some of the pitfalls that I ran into learning R with a computer science background. I used R extensively in two stats courses in university, and afterwards for a bunch of data analysis projects, and now I’m just starting to be comfortable and efficient with it.

Why a statistical programming language?

When I encountered R for the first time, my first reaction was: “why do we need a new language to do stats? Can’t we just use Python and import some statistical libraries?”

Sure, you can, but R is very streamlined for it. In Python, you would need something like scipy for fitting models, and something like matplotlib to display things on screen. With R, you get RStudio, a complete environment, and it’s very much batteries-included. In RStudio, you can parse the data, run statistics on it, and visualize results with very few lines of code.

Aside: RStudio is an IDE for R. Although it’s possible to run R standalone from the command line, in practice almost everyone uses RStudio.

I’ll do a quick demo of fitting a linear regression on a dataset to demonstrate how easy it is to do in R. First, let’s load the CSV file:

df <- read.csv("fossum.csv")

This reads a dataset containing body length measurements for a bunch of possums. Don’t ask why, it was used in a stats course I took. R parses the CSV file into a data frame and automatically figures out the dimensions and variable names and types.

Next, we fit a linear regression model of the total length of the possum versus the head length:

model <- lm(totlngth ~ hdlngth, df)

It’s one line of code with the lm function. What’s more, fitting linear models is so common in R that the syntax is baked into the language.

Aside: Here, we did totlngth ~ hdlngth to perform a single variable linear regression, but the notation allows fancier stuff. For example, if we did lm(totlngth ~ (hdlngth + age)^2), then we would get a model including two variables and the second order interaction effects. This is called Wilkinson-Rogers notation, if you want to read more about it.

We want to know how the model is doing, so we run the summary command:

> summary(model)

Call:
lm(formula = totlngth ~ hdlngth, data = df)

Residuals:
   Min     1Q Median     3Q    Max
-7.275 -1.611  0.136  1.882  5.250 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -28.722     14.655  -1.960   0.0568 .
hdlngth        1.266      0.159   7.961  7.5e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.653 on 41 degrees of freedom
Multiple R-squared:  0.6072,	Adjusted R-squared:  0.5976
F-statistic: 63.38 on 1 and 41 DF,  p-value: 7.501e-10

Don’t worry if this doesn’t mean anything to you, it’s just dumping the parameters of the models it fit, and ran a bunch of tests to determine how significant the model is.

Lastly, let’s visualize the regression with a scatterplot:

plot(df$hdlngth, df$totlngth)
abline(model)

And R gives us a nice plot:

1.png

All of this only took 4 lines of R code! Hopefully I’ve piqued your interest by now — R is great for quickly trying out a lot of different models on your data without too much effort.

That being said, R has a somewhat steep learning curve as a lot of things don’t work the way you’d expect. Next, I’ll mention some pitfalls I came across.

Don’t worry about the type system

As computer scientists, we’re used to thinking about type systems, type casting rules, variable scoping rules, closures, stuff like that. These details form the backbone of any programming language, or so I thought. Not the case with R.

R is designed by statisticians, and statisticians are more interested in doing statistics than worrying about intricacies of their programming language. Types do exist, but it’s not worth your time to worry about the difference between a list and a vector; most likely, your code will just work on both.

The most fundamental object in R is the data frame, which stores rows of data. Data frames are as ubiquitous in R as objects are in Java. They also don’t have a close equivalent in most programming languages; it’s similar to a SQL table or an Excel spreadsheet.

Use dplyr for data wrangling

The base library in R is not the most well-designed package in the world. There are many inconsistencies, arbitrary design decisions, and common operations are needlessly unintuitive. Fortunately, R has an excellent ecosystem of packages that make up for the shortcomings of the base system.

In particular, I highly recommend using the packages dplyr and tidyr instead of the base package for data wrangling tasks. I’m talking about operations you do to data to get it to be a certain form, like sorting by a variable, grouping by a set of variables and computing the aggregate sum over each group, etc. Dplyr and tidyr provide a consistent set of functions that make this easy. I won’t go into too much detail, but you can see this page for a comparison between dplyr and base R for some common data wrangling tasks.

Use ggplot2 for plotting

Plotting is another domain where the base package falls short. The functions are inconsistent and worse, you’re often forced to hardcode arbitrary constants in your code. Stupid things like plot(..., pch=19) where 19 is the constant for “solid circle” and 17 means “solid triangle”.

There’s no reason to learn the base plotting system — ggplot2 is a much better alternative. Its functions allow you to build graphs piece by piece in a consistent manner (and they look nicer by default). I won’t go into the comparison in detail, but here’s a blog post that describes the advantages of ggplot2 over base graphics.

It’s unfortunate that R’s base package falls short in these two areas. But with the package manager, it’s super easy to install better alternatives. Both ggplot2 and dplyr are widely used (currently, both are in the top 5 most downloaded R packages).

How to self-study R

First off, check out Swirl. It’s a package for teaching beginners the basics of R, interactively within RStudio itself. It guides you through its courses on topics like regression modelling and dplyr, and only takes a few hours to complete.

At some point, read through the tidyverse style guide to get up to speed on the best practices on naming files and variables and stuff like that.

Now go and analyze data! One major difference between R and other languages is that you need a dataset to do anything interesting. There are many public datasets out there; Kaggle provides a sizable repository.

For me, it’s a lot more motivating to analyze data I care about. Analyze your bank statement history, or data on your phone’s pedometer app, or your university’s enrollment statistics data to find which electives have the most girls. Turn it into a mini data-analysis project. Fit some regression models and draw a few graphs with R, this is a great way to learn.

The best thing about R is the number of packages out there. If you read about a statistical model, chances are that someone’s written an R package for it. You can download it and be up and running in minutes.

It takes a while to get used to, but learning R is definitely a worthwhile investment for any aspiring data scientist.

AI Project: Harmonizing Pop Melodies using Hidden Markov Models

It is often said that all pop music “sound the same”. If this is the case, then a computer program should be able to compose pop music, right?

This is the problem we studied for the AI final project (CS486) — kind of. We restrict to the simpler problem of harmonizing an existing melody: in other words, given a fixed melody, find a sequence of chord progressions that match it.

We formulated the task as a probabilistic optimization problem, and devised an algorithm to solve it using Hidden Markov Models (HMMs). Then, we used the Music21 library to generate MIDI files from the output so that we can listen to it.

In the experiment, we chose the melody from If Only by JJ Lin. Here’s a demo:

Music Theory Model

In our model, pop music consists of two parts: a melody line consisting of a sequence of individual notes, and a harmony line of a sequence of chords. Generally, the melody line is the notes sung by the lead vocalist, and the harmony line is played by accompanying instruments (piano, guitar, digital synthesizers). Conventionally, the chord progression is written as a sequence of chord names above the melody; the exact notes in a chord progression is up to the performer’s discretion.

1.pngAbove: Example of two-part song with melody and chord progression

It is hard to quantify exactly what make a chord progression “good”, since music is inherently subjective and depends on an individual’s musical tastes. However, we capture the notion of “pleasant” using music theory, by assigning a penalty to musical dissonance between a note in the melody and a note in the chord.

According to music theory, the minor second, major second, and tritone (augmented fourth) intervals are dissonant (including the minor and major seventh inversions). Therefore, our program tries to avoid forming these intervals by assigning a penalty.

2.pngAbove: List of dissonant intervals. All other intervals are consonant.

We assume that all notes in the melody lie on either a major or minor scale in some fixed key. The set of permissible chords for a scale are the major and minor chords where all of its constituent notes are in the scale. For example, in the scale of G major, the permissible chords are {G, Am, Bm, C, D, Em}.

This is a vastly simplified model that does not capture more nuanced chords that appear in pop music. However, even the four-chord progression G – D – Em – C is sufficient for many top-40 pop songs.

First attempt: Harmonize each bar independently

The first thing we tried was to looking at each bar by itself, and searching through all of the permissible chords to find the chord that harmonizes best with the bar.

In other words, we define the penalty of the chord for a melody bar is the number of times a melody note forms a dissonant interval with a chord note, weighted by the duration of the note. Then, we assign to each bar the chord with the lowest penalty.

Here’s what we get:

m2.pngAbove: Naive harmonization. Notice that the Am chord is repeated 3 times.

This assigns a reasonable-sounding chord for each bar on its own, but doesn’t account for the transition probabilities between chords. We end up with the A minor chord repeated 3 times consecutively, which we’d like to avoid.

How to account for chord transitions, while still harmonizing with the melody in each bar? This is where we bring in Hidden Markov Models.

Second attempt: HMMs and the Viterbi Algorithm

Hidden Markov Models are a type of probabilistic graphical model that assumes that there is an unobservable hidden state that influences the output variable. In each timestep, the hidden state transitions to a different state, with probabilities given by a transition matrix.

A standard problem with HMMs is: given a sequence of observations, find the most likely sequence of hidden states that produced it. In our problem, the hidden states are the unknown sequence of chords, and the observed sequence is the bars of our melody. Now, solving the most likely sequence problem yields the chord progression that best harmonizes with the melody.

4.png

To encode the rule that we don’t like having the same chord consecutively for multiple bars, we assign the transitions so that the probability of transition from any chord to itself is low. The transition probability to every other chord is equal.

3.pngAbove: Transition probabilities for chords (only 3 shown here, but this generalizes easily).  A chord is equally likely to transition to any other chord, but unlikely to transition to itself.

The only thing left to do is solve the most likely sequence problem. We used a modified version of the Viterbi algorithm to do this — it solves the problem quickly using dynamic programming.

Running the Viterbi algorithm on our test melody:

m3.pngAbove: Using HMMs to eliminate repeated chords

And there we go — no more repeated chords! It doesn’t produce the same chord sequence as my manual harmonization, but sounds quite natural (and doesn’t violate any musical constraints).

Results and Future Directions

Using HMMs, we were able to harmonize a short melody segment. By looking at the entire melody as a whole, the HMM algorithm is able to produce a better chord progression than by optimizing every bar locally. Our implementation uses a greatly simplified musical model: it assumes the melody is in G major, and only considers 6 basic major and minor chords, but it successfully demonstrates a proof of concept.

When implementing this project, we thought this was new research (existing literature on harmonization focused mostly on classical music, not pop music). Alas, after some more literature review, it turns out that a Microsoft research team developed the same idea of using HMMs for melody harmonization, and published a series of papers in 2008. Oh well. Not quite publishable research but still a cool undergraduate class project.

The source code for this project is available on Github. Thanks to Michael Tu for collaborating with me!

If you enjoyed this post, check out my previous work on algorithmic counterpoint!