My First Research Paper: State Complexity of Overlap Assembly

My first research paper is completed and has been uploaded to Arxiv! It’s titled “State Complexity of Overlap Assembly”, by Janusz Brzozowski, Lila Kari, Myself, and Marek Szykuła. It’s in the area of formal language theory, which is an area of theoretical computer science. I worked on it as a part-time URA research project for two terms during my undergrad at Waterloo.

The contents of the paper are fairly technical. In this blog post, I will explain the background and motivation for the problem, and give a statement of the main result that we proved.

What is Overlap Assembly?

The subject of this paper is a formal language operation called overlap assembly, which models annealing of DNA strands. When you cool a solution containing a lot of short DNA strands, they tend to stick together in a predictable manner: they stick together to form longer strands, but only if the ends “match”.

We can view this as a formal operation on strings. If we have two strings a and b such that some suffix of a matches some prefix of b, then the overlap assembly is the string we get by combining them together:

1.pngAbove: Example of overlap assembly of two strings

In some cases, there might be more than one way of combining the strings together, for example, a=CATA and b=ATAG — then both CATATAG and CATAG are possible overlaps. Therefore, overlap assembly actually produces a set of strings.

We can extend this idea to languages in the obvious way. The overlap assembly of two languages is the set of all the strings you get by overlapping any two strings in the respective languages. For example, if L1={ab, abab, ababab, …} and L2={ba, baba, bababa, …}, then the overlap language is {aba, ababa, …}.

It turns out that if we start with two regular languages, then the overlap assembly language will always be regular too. I won’t go into the details, but it suffices to construct an NFA that recognizes the overlap language, given two DFAs recognizing the two input languages.

2.pngAbove: Example of construction of an overlap assembly NFA (Figure 2 of our paper)

What is State Complexity?

Given that overlap assembly is closed under regular languages, a natural question to ask is: how “complex” is the regular language that gets produced? One measure of complexity of regular languages is state complexity: the number of states in the smallest DFA that recognizes the language.

State complexity was first studied in 1994 by Sheng Yu et al. Some operations do not increase state complexity very much: if two regular languages have state complexities m and n, then their union has state complexity at most mn. On the other hand, the reversal operation can blow up state complexity exponentially — it’s possible for a language to have state complexity n but its reversal to have state complexity 2^n.

Here’s a table of the state complexities of a few regular language operations:


Over the years, state complexity has been studied for a wide range of other regular language operations. Overlap assembly is another such operation — the paper studies the state complexity of this operation.

Main Result

In our paper, we proved that the state complexity of overlap assembly (for two languages with state complexities m and n) is at most:

2(m-1) 3^{n-1} + 2^n

Further, we constructed a family of DFAs that achieve this bound, so the bound is tight.

That’s it for my not-too-technical summary of this paper. I glossed over a lot of the details, so check out the paper for the full story!

EulerCoin: Earn digital tokens by solving difficult mathematical Project Euler problems!

Today I’m going to show you something I built during the ETHWaterloo hackathon this weekend.

ETHWaterloo is a hackathon themed around the cryptocurrency Ethereum, which is gaining a lot of traction lately. You have to build something using Ethereum technology, and you can listen to many talks given by experts in the cryptocurrency community. It was a great way to learn more about this new blockchain technology.

vitalik.jpgAbove: Ethereum founder Vitalik Buterin speaking in the opening ceremony

The hackathon opened with a series of talks. I used to be good friends with Vitalik back in first year of university, but haven’t talked to him much after he dropped out of university to work on Ethereum. It was good to see him in person again at this hackathon.

After the opening ceremony, we went to a few tutorials on how to set up the Solidity environment, then brainstormed ideas for what to build. Eventually, we came up with the this:

logo.pngEulerCoin: our new Ethereum-based ICO

We often hear the media describe Bitcoin as “computers solving hard math problems to generate money”. Well, we made a new cryptocurrency, Eulercoin, where tokens are created by correctly submitting the answers to Project Euler problems!

So how does this work?

  • EulerCoin is implemented as a Ethereum smart contract that runs on the blockchain. A user submits answers to the smart contract, which then tries to submit it to Project Euler. If it’s correct, the user is awarded some EulerCoin.
  • Correct answers are recorded on the blockchain forever so anyone can use it to cheat on Project Euler and access the solution forums.
  • For each problem, only the first submitter gets awarded EulerCoin. The amount is larger if the problem has fewer solvers (meaning it’s a harder problem).

I used to maintain a list of Project Euler answers since 2009, but recently it’s been hard to convince people to post their answers publicly. Something like EulerCoin would incentivize people to contribute their answers.

After working on it for a day or so, we got a prototype working:

screenshot.pngAbove: Screenshot of EulerCoin prototype

Here’s a simple web UI, where clicking “Submit” opens up Metamask to initiate an Ethereum transaction. To avoid actually spamming Project Euler for a hackathon project, we mocked up the answer-checking component, but everything else works.

This was the first time programming in Ethereum / Solidity for all of us, so it was a novel experience. After doing this hackathon, I have a much better understanding of what Ethereum does. Bitcoin is digital currency that can only be transferred and not much else. Ethereum is a whole platform where you can run your own custom, Turing-complete smart contracts, all on the blockchain.

Ethereum also has some major limitations that may hinder widespread adoption. Here’s a few facts about Ethereum that I was surprised to learn:

  • All Solidity code is run once by everybody in the blockchain. This is a big scalability problem — imagine that when you buy some coffee, everyone in the world from Canada to France to Vietnam has to process this fact! To sync an Ethereum node, you have to have downloaded a blockchain containing every Ethereum transaction from the beginning of time — too much data for my laptop to handle. This also means that storing even small amounts of data on the blockchain can get incredibly expensive.
  • All data stored in the blockchain is visible to everybody. Because of this, it’s quite difficult to maintain any kind of privacy — this was a problem we ran into for one of our ideas. You can try encryption, but where to store the keys? If you put the keys on the blockchain, then anyone can see it, and you’re back to square one. There are some ideas of using fancy cryptography like ring signatures to keep privacy in limited situations, but these ideas are very quite mature yet.

I’m curious to see how Ethereum will solve these big problems in the upcoming years.

22553486_10214254283333445_746148568_o.jpgAbove: Team EulerCoin (Left to Right: Andrei Danciulescu, Chuyi Liu, Shala Chen, Me)

Our submission didn’t make the final round of judging, but we had a good time. Before the closing ceremony, our team gathered for a group photo.

That’s it for now. I’m not sure if I want to invest my own money into Ethereum just yet, but certainly I will read more about cryptocurrencies. Bubble or not, the mathematical ideas and technologies are worth looking into.

Paper Review: Linguistic Features to Identify Alzheimer’s Disease

Today I’m going to be sharing a paper I’ve been looking at, related to my research: “Linguistic Features Identify Alzheimer’s Disease in Narrative Speech” by Katie Fraser, Jed Meltzer, and my adviser Frank Rudzicz. The paper was published in 2016 in the Journal of Alzheimer’s Disease. It uses NLP to automatically diagnose patients with Alzheimer’s disease, given a sample of their speech.

Alzheimer’s disease is a disease that you might have heard of, but it doesn’t get much attention in the media, unlike cancer and stroke. It is a neurodegenerative disease that mostly affects elderly people. 5 million Americans are living with Alzheimer’s, including 1 in 9 over the age of 65, and 1 in 3 over the age of 85.

Alzheimer’s is also the most expensive disease in America. After diagnosis, patients may continue to live for over 10 years, and during much of this time, they are unable to care for themselves and require a constant caregiver. In 2017, 68% of Medicare and Medicaid’s budget is spent on patients with Alzheimer’s, and this number is expected to increase as the elderly population grows.

Despite a lot of recent advances in our understanding of the disease, there is currently no cure for Alzheimer’s. Since the disease is so prevalent and harmful, research in this direction is highly impactful.

Previous tests to diagnose Alzheimer’s

One of the early signs of Alzheimer’s is having difficulty remembering things, including words, leading to a decrease in vocabulary. A reliable way to test for this is a retrieval question like the following (Monsch et al., 1992):

In the next 60 seconds, name as many items as possible that can be found in a supermarket.

A healthy person could rattle out about 20-30 items in a minute, whereas someone with Alzheimer’s could only produce about 10. By setting the threshold at 16 items, they could classify even mild cases of Alzheimer’s with about 92% accuracy.

This doesn’t quite capture the signs of Alzheimer’s disease though. Patients with Alzheimer’s tend to be rambly and incoherent. This can be tested with a picture description task, where the patient is given a picture and asked to describe it with as much detail as possible (Giles, Patterson, Hodges, 1994).

73c894ea4d2dc12ca69a6380e51f1d62Above: Boston Cookie Theft picture used for picture description task

There is no time limit, and the patients talked until they indicated they had nothing more to say, or if they didn’t say anything for 15 seconds.

Patients with Alzheimer’s disease produced descriptions with varying degrees of incoherence. Here’s an example transcript, from the above paper:

Experimenter: Tell me everything you see going on in this picture

Patient: oh yes there’s some washing up going on / (laughs) yes / …… oh and the other / ….. this little one is taking down the cookie jar / and this little girl is waiting for it to come down so she’ll have it / ………. er this girl has got a good old splash / she’s left the taps on (laughs) she’s gone splash all down there / um …… she’s got splash all down there

You can clearly tell that something’s off, but it’s hard to put a finger on exactly what the problem is. Well, time to apply some machine learning!

Results of Paper

Fraser’s 2016 paper uses data from the DementiaBank corpus, consisting of 240 narrative samples from patients with Alzheimer’s, and 233 from a healthy control group. The two groups were matched to have similar age, gender, and education levels. Each participant was asked to describe the Boston Cookie Theft picture above.

Fraser’s analysis used both the original audio data, as well as a detailed computer-readable transcript. She looked at 370 different features covering all sorts of linguistic metrics, like ratios of different parts of speech, syntactic structures, vocabulary richness, and repetition. Then, she performed a factor analysis and identified a set of 35 features that achieves about 81% accuracy in distinguishing between Alzheimer’s patients and controls.

According to the analysis, a few of the most important distinguishing features are:

  • Pronoun to noun ratio. Alzheimer’s patients produce vague statements and tend to substitute pronouns like “he” for nouns like “the boy”. This also applies to adverbial constructions like “the boy is reaching up there” rather than “the boy is reaching into the cupboard”.
  • Usage of high frequency words. Alzheimer’s patients have difficulty remembering specific words and replace them with more general, therefore higher frequency words.

Future directions

Shortly after this research was published, my adviser Frank Rudzicz co-founded WinterLight Labs, a company that’s working on turning this proof-of-concept into an actual usable product. It also diagnoses various other cognitive disorders like Primary Progressive Aphasia.

A few other grad students in my research group are working on Talk2Me, which is a large longitudinal study to collect more data from patients with various neurodegenerative disorders. More data is always helpful for future research.

So this is the starting point for my research. Stay tuned for updates!

What’s the difference between Mathematics and Statistics?

Statistics has a sort of funny and peculiar relationship with mathematics. In a lot of university departments, they’re lumped together and you have a Department of Mathematics and Statistics”. Other times, it’s grouped as a branch in applied math. Pure mathematicians tend to either think of it as an application of probability theory, or dislike it because it’s not rigorous enough”.

After having studied both, I feel it’s misleading to say that statistics is a branch of math. Rather, statistics is a separate discipline that uses math, but differs in fundamental ways from other branches of math, like combinatorics or differential equations or group theory. Statistics is the study of uncertainty, and this uncertainty permeates the subject so much that mathematics and statistics are fundamentally different modes of thinking.


Above: if pure math and statistics were like games


Definitions and Proofs

Math always follows a consistent definition-theorem-proof structure. No matter what branch of mathematics you’re studying, whether it be algebraic number theory or real analysis, the structure of a mathematical argument is more or less the same.

You begin by defining some object, let’s say a wug. After defining it, everybody can look at the definition and agree on which objects are wugs and which objects are not wugs.

Next, you proceed to prove interesting things about wugs, using marvelous arguments like proof by contradiction and induction. At every step of the proof, the reader can verify that indeed, this step follows logically from the definitions. After several of these proofs, you now understand a lot of properties of wugs and how they connect to other objects in the mathematical universe, and everyone is happy.

In statistics, it’s common to define things with intuition and examples, so you know it when you see it”; things are rarely so black-and-white like in mathematics. This is born out of necessity: statisticians work with real data, which tends to be messy and doesn’t lend itself easily to clean, rigorous definitions.

Take for example the concept of an outlier”. Many statistical methods behave badly when the data contains outliers, so it’s a common practice to identify outliers and remove them. But what exactly constitutes an outlier? Well, that depends on many criteria, like how many data points you have, how far it is from the rest of the points, and what kind of model you’re fitting.


In the above plot, two points are potentially outliers. Should you remove them, or keep them, or maybe remove one of them? There’s no correct answer, and you have to use your judgment.

For another example, consider p-values. Usually, when you get a p-value under 0.05, it can be considered statistically significant. But this value is merely a guideline, not a law – it’s not like 0.048 is definitely significant and 0.051 is not.

Now let’s say you run an A/B-test and find that changing a button to blue results in higher clicks, with p-value of 0.059. Should you recommend to your boss that they make the change? What if you get 0.072, or 0.105? At what point does it become not significant? There is no correct answer, you have to use your judgment.

Take another example: heteroscedasticity. This is a fancy word that means the variance is unequal for different parts of your dataset. Heteroscedasticity is bad because a lot of models assume that the variance is constant, and if this assumption is violated then you’ll get wrong results, so you need to use a different model.



Is this data heteroscedastic, or does it only look like the variance is uneven because there are so few points to the left of 3.5? Is the problem serious enough that fitting a linear model is invalid? There’s no correct answer, you have to use your judgment.

Another example: consider a linear regression model with two variables. When you plot the points on a graph, you should expect the points to roughly lie on a straight line. Not exactly on a line, of course, just roughly linear. But what if you get this:


There is some evidence of non-linearity, but how much bendiness” can you accept before the data is definitely not roughly linear” and you have to use a different model? Again, there’s no correct answer, and you have to use your judgment.

I think you see the pattern here. In both math and statistics, you have models that only work if certain assumptions are satisfied. However, unlike math, there is no universal procedure that can tell you whether your data satisfies these assumptions.

Here are some common things that statistical models assume:

  • A random variable is drawn from a normal (Gaussian) distribution
  • Two random variables are independent
  • Two random variables satisfy a linear relationship
  • Variance is constant

Your data is not going to exactly fit a normal distribution, so all of these are approximations. A common saying in statistics goes: all models are wrong, but some are useful”.

On the other hand, if your data deviates significantly from your model assumptions, then the model breaks down and you get garbage results. There’s no universal black-and-white procedure to decide if your data is normally distributed, so at some point you have to step in and apply your judgment.

Aside: in this article I’m ignoring Mathematical Statistics, which is the part of statistics that tries to justify statistical methods using rigorous math. Mathematical Statistics follows the definition-theorem-proof pattern and is very much like any other branch of math. Any proofs you see in a stats course likely belongs in this category.


Classical vs Statistical Algorithms

You might be wondering: without rigorous definitions and proofs, how do you be sure anything you’re doing is correct? Indeed, non-statistical (i.e. mathematical) and statistical methods have different ways of judging correctness”.

Non-statistical methods use theory to justify their correctness. For instance, we can prove by induction that Dijkstra’s algorithm always returns the shortest path in a graph, or that quicksort always arranges an array in sorted order. To compare running time, we use Big-O notation, a mathematical construct that formalizes runtimes of programs by looking at how they behave as their inputs get infinitely large.

Non-statistical algorithms focus primarily on worst-case analysis, even for approximation and randomized algorithms. The best known approximation algorithm for the Traveling Salesman problem has an approximation ratio of 1.5 – this means that even for the worst possible input, the algorithm gives a path that’s no more than 1.5 times longer than the optimal solution. It doesn’t make a difference if the algorithm performs a lot better than 1.5 for most practical inputs, because it’s always the worst case that we care about.

A statistical method is good if it can make inferences and predictions on real-world data. Broadly speaking, there are two main goals of statistics. The first is statistical inference: analyzing the data to understand the processes that gave rise to it; the second is prediction: using patterns from past data to predict the future. Therefore, data is crucial when evaluating two different statistical algorithms. No amount of theory will tell you whether a support vector machine is better than a decision tree classifier – the only way to find out is by running both on your data and seeing which one gives more accurate predictions.

2 Above: the winning neural network architecture for ImageNet Challenge 2012. Currently, theory fails at explaining why this method works so well.

In machine learning, there is still theory that tries to formally describe how statistical models behave, but it’s far removed from practice. Consider, for instance, the concepts of VC dimension and PAC learnability. Basically, the theory gives conditions under which the model eventually converges to the best one as you give it more and more data, but is not concerned with how much data you need to achieve a desired accuracy rate.

This approach is highly theoretical and impractical for deciding which model works best for a particular dataset. Theory falls especially short in deep learning, where model hyperparameters and architectures are found by trial and error. Even with models that are theoretically well-understood, the theory can only serve as a guideline; you still need cross-validation to determine the best hyperparameters.


Modelling the Real World

Both mathematics and statistics are tools we use to model and understand the world, but they do so in very different ways. Math creates an idealized model of reality where everything is clear and deterministic; statistics accepts that all knowledge is uncertain and tries to make sense of the data in spite of all the randomness. As for which approach is better – both approaches have their advantages and disadvantages.

Math is good for modelling domains where the rules are logical and can be expressed with equations. One example of this is physical processes: just a small set of rules is remarkably good for predicting what happens in the real world. Moreover, once we’ve figured out the mathematical laws that govern a system, they are infinitely generalizable — Newton’s laws can accurately predict the motion of celestial bodies even if we’ve only observed apples falling from trees. On the other hand, math is awkward at dealing with error and uncertainty. Mathematicians create an ideal version of reality, and hope that it’s close enough to the real thing.

Statistics shines when the rules of the game are uncertain. Rather than ignoring error, statistics embraces uncertainty. Every value has a confidence interval where you can expect it to be right about 95% of the time, but we can never be 100% sure about anything. But given enough data, the right model will separate the signal from the noise. This makes statistics a powerful tool when there are many unknown confounding factors, like modelling sociological phenomena or anything involving human decisions.

The downside is that statistics only works on the sample space where you have data; most models are bad at extrapolating past the range of data that it’s trained on. In other words, if we use a regression model with data of apples falling from trees, it will eventually be pretty good at predicting other apples falling from trees, but it won’t be able to predict the path of the moon. Thus, math enables us to understand the system at a deeper, more fundamental level than statistics.

Math is a beautiful subject that reduces a complicated system to its essence. But when you’re trying to understand how people behave, when the subjects are not always rational, learning from data is the way to go.

Great Solo Asian Trip Part 2: Languages of East Asia

This is the second blog post in my two-part series on my 4-month trip to Asia. Here is part one. In this second blog post, I will focus on the languages I encountered in Asia and my attempts at learning them.

I’ve always enjoyed learning languages (here is a video of me speaking a bunch of them) — and Asia is a very linguistically diverse place compared to North America, with almost every country speaking a different language. So in every country I visited, I tried to learn the language as best as I could. Realistically, it’s not possible to go from zero to fluency in the span of a vacation, but you can actually learn a decent amount in a week or two. Travelling in a foreign country is a great motivator for learning languages, and I found myself learning new words much faster than I did studying it at home.

I went to five countries on this trip, in chronological order: China, Japan, South Korea, Vietnam, and Malaysia.


In the first month of my trip, I went to a bunch of cities in China with my mom and sister. For the most part, there wasn’t much language learning, as I already spoke Mandarin fluently.

One of the regions we went to was Xishuangbanna, in southern Yunnan province. Xishuangbanna is a special autonomous prefecture, designated by the Chinese government for the Dai ethnic minority. The outer fringes of China are filled with various groups of non-Chinese minority groups, each with their own unique culture and language. Home to 25 officially recognized ethnic groups and countless more unrecognized ones, Yunnan is one of the most linguistically diverse places in the world.

1Above: Bilingual signs in Chinese and Dai in Jinghong

In practice, recent migration of the Chinese into the region meant that even in Xishuangbanna, the Han Chinese outnumber the local Dai people, and Mandarin is spoken everywhere. In the streets of Jinghong, you can see bilingual signs written in Mandarin and the Dai language (a language related to Thai). Their language is written in the Tai Lue script, which looks pretty cool, but I never got a chance to learn it.

Next stop on my trip was Hong Kong. The local language here is Cantonese, which shares a lot of similar vocabulary and grammatical structure with my native Mandarin, since they were both descended from Middle Chinese about 1500 years ago. However, a millennium of sound changes means that today, Mandarin and Cantonese are quite different languages and are not at all mutually intelligible.

I was eager to practice my Cantonese in my two days in Hong Kong, but found that whenever I said something incorrect, they would give me a weird look and immediately switch to Mandarin or English. Indeed, learning a language is very difficult when everybody in that country is fluent in English. Oh well.


A lot of travellers complain that the locals speak no English; you don’t often hear of complaints that their English is too good! Well, Japan did not leave me disappointed. Although everyone studies English in school, most people have little practice actually using it, so Japan is ranked near the bottom in English proficiency among developed nations. Perfect!

Before coming to Japan, I already knew a decent amount of Japanese, mostly from watching lots of anime. However, there are very few Japanese people in Canada, so I didn’t have much practice actually speaking it.

I was in Japan for one and a half months, the most of any single country of this trip. In order to accelerate my Japanese learning process, I enrolled in classes at a Japanese language school and stayed with a Japanese homestay family. This way, I learned formal grammatical structures in school and got conversation practice at home. I wrote a more detailed blog post here about this part of the trip.

Phonologically, Japanese is an easy language to pronounce because it has a relatively small number of consonants and only five vowels. There are no tones, and every syllable has form CV (consonant followed by a vowel). Therefore, an English speaker will have a much easier time pronouncing Japanese correctly than the other way around.

Grammatically, Japanese has a few oddities that take some time to get used to. First, the subject of a sentence is usually omitted, so the same phrase can mean “I have an apple” or “he has an apple”. Second, every time you use a verb, you have to decide between the casual form (used between friends and family) or the polite form (used when talking to strangers). Think of verb conjugations, but instead of verb endings differing by subject, they’re conjugated based on politeness level.

The word order of Japanese is also quite different from English. Japanese is an agglutinative language, so you can form really long words by attaching various suffixes to verbs. For example:

  • iku: (I/you/he) goes
  • ikanai: (I/you/he) doesn’t go
  • ikitai: (I/you/he) wants to go
  • ikitakunai: (I/you/he) doesn’t want to go
  • ikanakatta: (I/you/he) didn’t go
  • ikitakunakatta: (I/you/he) didn’t want to go
  • etc…

None of this makes Japanese fundamentally hard, just different from a lot of other languages. This also explains why Google Translate sucks so much at Japanese. When translating Japanese to English, the subjects of sentences are implicit in Japanese but must be explicit in English; when translating English to Japanese, the politeness level is implicit in English but must be explicit in Japanese.

One more thing to beware of is the Japanese pitch accent. Although it’s nowhere close to a full tonal system like Chinese, stressed syllables have a slightly higher pitch. For example, the word “kirei” (pretty) has a pitch accent on the first syllable: “KI-rei”. Once I messed this up and put the accent on the second syllable instead: “ki-REI”, but unbeknownst to me, to native Japanese this sounds like “kirai” (to hate), which has the accent on the second syllable. So I meant to say “nihon wa kirei desu” (Japan is pretty) but it sounded more like “nihon wa kirai desu” (I hate Japan)!


That was quite an awkward moment.

When I headed west from Tokyo into the Kansai region of Kyoto and Osaka, I noticed a bit of dialectal variation. The “u” in “desu” is a lot more drawn out, and the copula “da” was replaced with “ya”, so on the streets of Kyoto I’d hear a lot of “yakedo” instead of “dakedo” in Tokyo. I got to practice my Japanese with my Kyoto Airbnb host every night, and picked up a few words of the Kansai dialect. For example:

  • ookini: thank you (Tokyo dialect: arigatou)
  • akan: no good (Tokyo dialect: dame)
  • okan: mother (Tokyo dialect: okaasan)

The writing system of Japanese is quite unique and deserves a mention. It actually has three writing systems: the Hiragana syllabary for grammatical particles, the Katakana syllabary for foreign loanwords, and Kanji, logographic characters borrowed from Chinese. A Kanji character can be read in several different ways. Typically, when you have two or more Kanji together, it’s a loanword from Chinese read using a Chinese-like pronunciation (eg: novel, 小説 is read shousetsu) but when you have a single Kanji character followed by a bunch of Hiragana, it’s a Japanese word that means the same thing but sounds nothing like the Chinese word (eg: small, 小さい is read chiisai).

The logographic nature of Kanji is immensely helpful for Chinese people learning Japanese. You get the etymology of every Chinese loanword, and you get to “read” texts well above your level as you know the meaning of most words (although it gives you no information on how the word is pronounced).

My Japanese improved a lot during my 6 weeks in the country. By the time I got to Fukuoka, at the western end of Japan, I had no problems holding a conversation for 30 minutes with locals in a restaurant (provided they speak slowly, of course). It’s been one of my most rewarding language learning experiences to date.

South Korea

From Fukuoka, I traveled across the sea for a mere three hours, on a boat going at a speed slower than a car on a freeway, and landed in a new country. Suddenly, the script on the signs were different, and the language on the street once again strange and unfamiliar. You can’t get the same satisfaction arriving in an airplane.

IMG_2240 (Medium)Above: Busan, my first stop in Korea

Of course, I was in the city of Busan, in South Korea. I was a bit nervous coming here, since it was the first time in my life that I’d been in a country where I wasn’t at least conversationally proficient in the language. Indeed, procuring a SIM card on my first day entailed a combination of me trying to speak broken Korean, them trying to speak broken English, hand gesturing, and (shamefully) Google Translate.

Before coming to Korea, I knew how to read Hangul (the Korean writing system) and a couple dozen words and phrases I picked up from Kpop and my university’s Korean language club. I also tried taking Korean lessons on italki (a language learning website) and various textbooks, but the language never really “clicked” for me, and now I still can’t hold a conversation in Korean for very long.

I suspect the reason has to do with passive knowledge: I’ve had a lot of exposure to Japanese from hundreds of hours of watching anime, but nowhere near as much exposure to Korean. Passive knowledge is important because humans learn language from data, and given enough data, we pick up on a lot of grammatical patterns without explicitly learning them.

Also, studying Kpop song lyrics is not a very effective way to learn Korean. The word distribution in song lyrics is sufficiently different from the word distribution in conversation that studying song lyrics would likely make you better at understanding other songs but not that much better at speaking Korean.

Grammatically, Japanese and Korean are very similar: they have nearly identical word order, and grammatical particles almost have a one-to-one correspondence. They both conjugate verbs differently based on politeness, and form complex words by gluing together suffixes to the end of verbs. The grammar of the two languages are so similar that you can almost translate Japanese to Korean just by translating each morpheme and without changing the order — and both are very different from Chinese, the other major language spoken in the region.

Phonologically, Korean is a lot more complex than Japanese, which is bad news for language learners. Korean has about twice as many different vowels as Japanese, and a few more consonants as well. Even more, Korean maintains a three-way distinction for many consonants: for example, the ‘b’ sound has a plain version (불: bul), an aspirated version (풀: pul), and a tense version (뿔: ppul). I had a lot of difficulty telling these sounds apart, and often had to guess many combinations to find a word in the dictionary.

Unlike Chinese and Japanese, Korean does not use a logographic writing system. In Hangul, each word spells out how the word sounds phonetically, and the system is quite regular. On one hand, this means that Hangul can be learned in a day, but on the other hand, it’s not terribly useful to be able to sound out Korean text without knowing what anything means. I actually prefer the Japanese logographic system, since it makes the Chinese cognates a lot clearer. In fact, about 60% of Korean’s vocabulary are Chinese loanwords, but with a phonetic writing system, it’s not always easy to identify what they are.


The next country on my trip was Vietnam. I learned a few phrases from a Pimsleur audio course, but apart from that, I knew very little about the Vietnamese language coming in. The places I stayed were sufficiently touristy that most people spoke enough English to get by, but not so fluently as to make learning the language pointless.

Vietnamese is a tonal language, like Mandarin and Cantonese. It has 6 tones, but they’re quite different from the tones in Mandarin (which has 4-5). At a casual glance, Vietnamese may sound similar to Chinese, but the languages are unrelated and there is little shared vocabulary.

3Above: Comparison between Mandarin tones (above) and Vietnamese tones (below)

Vietnamese syllables have a wide variety of distinct vowel diphthongs; multiplied with the number of tones, this means that there are a huge number of distinct syllables. By the laws of information theory, this also means that one Vietnamese syllable contains a lot of information — I was often surprised at words that were one syllable in Vietnamese but two syllables in Mandarin.

My Vietnamese pronunciation must have sounded very strange to the locals: often, when I said something, they would understand what I said, but then they’d burst out laughing. Inevitably, they’d follow by asking if I was overseas Vietnamese.

Vietnamese grammar is a bit like Chinese, with a subject-verb-object word order and lack of verb conjugations. So in Vietnamese, if you string together a bunch of words in a reasonable order, there’s a good chance it would be correct (and close to zero chance in Japanese or Korean). One notable difference is in Vietnamese, the adjective comes after the noun, whereas it comes before the noun in Chinese.

One language peculiarity is that Vietnamese doesn’t have pronouns for “I” or “you”. Instead, you must determine your social relationship to the other party to determine what words to use. If I’m talking to an older man, then I refer to him as anh (literally: older brother) and I refer to myself as em (literally: younger sibling). These words would change if I were talking to a young woman, or much older woman, etc. You can imagine that this system is quite confusing for foreigners, so it’s acceptable to use Tôi which unambiguously means “I”, although native speakers don’t often use this word.

Written Vietnamese uses the Latin alphabet (kind of like Chinese Pinyin), and closely reflects the spoken language. Most letters are pronounced more or less the way you’d expect, but there are some exceptions, for example, ‘gi’, ‘di’, and ‘ri’ are all pronounced like ‘zi’.

In two weeks in Vietnam, I didn’t learn enough of the language to have much of a conversation, but I knew enough for most of the common situations you encounter as a tourist, and could haggle prices with fruit vendors and motorcycle taxi drivers. I also learned how to tell between the northern Hanoi dialect and the southern Saigon dialect (they’re mutually intelligible but have a few differences).


The final country on my trip was Malaysia. Malaysia is culturally a very diverse country, with ethnic Malays, Chinese, and Indians living in the same country. The Malay language is frequently used for interethnic communication. I learned a few phrases of the language, but didn’t need to use it much, because everybody I met spoke either English or Mandarin fluently.

Malaysia is a very multilingual country. The Malaysian-Chinese people speak a southern Chinese dialect (one of Hokkien, Hakka, or Cantonese), Mandarin, Malay, and English. In Canada, it’s common to speak one or two languages, but we can only dream of speaking 4-5 languages fluently, as many Malaysians do.

Rate of Language Learning

I kept a journal of new words I learned in all my languages. Whenever somebody said a word I didn’t recognize, I would make a note of it, look it up later, and record it in my journal. When I wanted to say something but didn’t know the word for it, I would also add it to my journal. This way, I learned new words in a natural way, without having to memorize lists of words.

4Above: Tally of words learned in various languages

On average, I picked up 3-5 new words for every day I spent in a foreign country. At this rate, I should be able to read Harry Potter (~5000 unique words) after about 3 years.

That’s all for now. In September, I will be starting my master’s in Computational Linguistics; hopefully, studying all these random languages will come to some use.

With so much linguistic diversity, and with most people speaking little English, Asia is a great vacation spot for language nerds and aspiring polyglots!

Great Solo Asian Trip Part 1: General Thoughts

This is the first part of my two-part series on my 4-month trip to Asia. In this post, I will talk about my thoughts related to the trip and travel in general; in the second post I will go into detail about my experience learning the languages of all the countries I went to.

During my 4 months of travel, I visited 5 countries: China, Japan, South Korea, Vietnam, and Malaysia. Now I will admit that I’m not a very skilled photographer, nor am I able to write succulent descriptions of exotic foods. Rather than bore the reader with a play-by-play itinerary of the whole 4 months, I’ve grouped my thoughts by theme rather than chronological order.

Trip Planning

For the first month of my trip, I travelled around China with my mom and sister. Most of my relatives live there, and it’s obligatory to pay them a visit every 5 years or so. After China, Japan and South Korea were natural countries to visit next, since I’d wanted to go there for a long time but never got the chance to.

20_8_2017_7_9_28_322Above: Me at Yangshuo, Guangxi province, China

That left me another month of travel time, and initially I wasn’t sure where to go. There were a number of places I wanted to go in Southeast Asia, like Vietnam, Thailand, Cambodia, and Myanmar, but it was July and much of Southeast Asia was in the middle of monsoon season. I figured it would suck to go somewhere and have it rain nonstop for days, so I looked at climatology maps and picked two places that were relatively dry this time of year: Central Vietnam and Malaysia.

Travel and Productivity

During my trip, I spent roughly half my time doing touristy activities and the other half of my time sitting in cafes, working on various projects. One might wonder why you’d want to travel to Vietnam just to code in a cafe — but I found that after visiting tourist attractions every day for a week, I’d start to feel overwhelmed. Travelling is physically tiring, so it was important to pace myself for such a long trip.

I found cafes to be reasonably productive environments. Not only do they have better wifi than my hotel room, they also have nice ambiance and I get to try all sorts of snacks and drinks. Some of the stuff I worked on include:

  • Wrote a few blog posts about math and language learning
  • Some data analysis related to my school’s student enrolment statistics. I got a lot of practice using R in the process.
  • Built some NLP modules for Snaptravel, a friend’s startup
  • Watched some deep learning lectures and set up an AWS GPU instance to play with some deep learning models
  • Spent about 30 minutes a day learning the country’s language

Making Friends

I found that when travelling alone, I’m a lot more likely to talk to random strangers than when I’m travelling with other people — you’re much less approachable if you’re already engaged in an excited conversation with your travelmate. When travelling solo, it’s easy to start talking to the person next to you at a restaurant or on a train.

In all the countries I visited, I found that older people were universally more eager to talk to me than younger people. I imagine that young people are busy with their own problems and generally have better things to do than sit around and chat. One caveat is that older people usually don’t speak English. In Japan, this was a great way to practice my Japanese, but it was a problem in Korea and Vietnam, where I don’t speak the language very well. They still tried to talk to me, but I just didn’t know enough of their language to have a conversation for very long. When I got to Malaysia, where Mandarin is widely spoken, it was nice to be able to talk to the locals again.

Some of my friends like to use apps like Meetup to find events to meet people. I never tried it — it seemed like too much effort to actively make friends when I’m only in a city for a few days. Throughout the trip, I kept in contact with a bunch of relatives in China and friends in North America, so I never really felt lonely.

Cultural Diversity and Food

Asia is a very culturally diverse place, with more or less every country having its own ethnic group and its own language.

At the same time, each individual country is not a very diverse place. With the exception of Malaysia, all the countries I visited each had a very homogeneous society. In Vietnam, everybody not a tourist was Vietnamese. Currently, less than 1 in 1000 residents of Vietnam are foreign-born, compared to about 1 in 5 for Canada. Essentially, if you’re in Da Nang, you can get amazing Pho or Cao Lau, but if you’re craving an authentic Mexican burrito? Tough luck.

Above: Delicious food from around Asia. Don’t ask me what they are, because I don’t remember either.

In Canada, we’re used to attending a lecture taught by an Indian professor, having lunch at a Chinese restaurant, then going to the doctor and see a black physician. Multiculturalism is something we take for granted, but is simply not a thing in most parts of the world.

Vietnam and Malaysia are good countries to eat food. There aren’t that many must-see tourist attractions, so you can relax and enjoy the scenery and the food. It’s nice to be able to order a full meal, complete with appetizers, drinks, and dessert, and still have the bill be less than $10.

Travel and History

Above: North Korea from across the Demilitarized Zone and Hiroshima Atom Bomb Dome

Travel made me develop a greater appreciation for recent history and geopolitics, to understand how these countries became the way they are today. Two of the countries I visited were battlefields of the Cold War, the effects of which are still apparent to this day.

When I went to Hong Kong, I was amazed by the wealth of this tiny island. Despite having a population of just 7 million, it’s rank 6 in the world in number of billionaires. How did this city on the outskirts of China, with no natural resources, become so wealthy? To find out, I started reading about the British Opium Wars, and soon found myself learning about all sorts of interconnected topics like the rise of Chinese Communism and the Cultural Revolution.

Previously, I often found history to be a rather dry topic as presented in textbooks. It’s very different to visit the relevant countries and experience history firsthand.

Wifi: The Good and Bad

For me, wifi is probably the most important feature when booking hotels, as I’m heavily dependent on it for all my work, information, and communication. Unfortunately, there’s no reliable way to determine if a hotel has good wifi before actually going there (there do exist websites that collect speed test results, but their data only covers a small fraction of hotels in a city).

As an experiment, I classified each hotel I stayed in as “decent wifi” or “terrible wifi”. A wifi connection passes the bar for “decent” if I can browse the usual websites without it being noticeably slow, and watch Youtube at 480p without buffering (this requires a connection speed of about 1.0Mbps), otherwise I classify it as “terrible”. Here are my results:


So about half of the hotels passed this bar. China had the worst wifi and Japan had the best, but even in countries purported to have amazing internet like South Korea, you can end up with terrible hotel wifi. I suppose I should be glad that all of the places had at least some sort of wifi; this wasn’t the case when I visited China in 2013.

Airbnb has gotten rather deceptive lately. Previously it was a good way to live in a person’s house and get to know the city from the perspective of a local. Now, a large number of listings are hotels, and even properties described as “homestays” are very much commercial ventures; most of the time, I never met my “host”. I found the best way to get an actual host is to look for hosts with only a single listing, but there is no way to filter for that automatically in the app.

Final Thoughts

20_8_2017_7_8_17_662Above: Me at Fushimi Inari Taisha, Kyoto

I’m very grateful to have the opportunity to take 4 months off to travel around Asia. The timing was worked naturally (graduated in April, leaving a 4-month gap before graduate school starts in September), and I had accumulated enough internship savings to afford it.

Now, having had experience with long-term travel, I don’t want to make this my permanent lifestyle. The most obvious issue is money, which I’ll run out of eventually, so I’d need to find some part-time remote work on the side. More crucially, there are certain advantages of staying in one place: it’s easier to build a social network, advance in your career, and find a significant other, all of which are difficult if you’re moving every week.

Although I don’t want to travel permanently, I really enjoyed my 4-month travel adventure. Maybe I’ll do it again someday, in a different part of the world — Europe maybe? We’ll see.