Lessons learned after 6 months of building a language learning startup

For the past six months, I’ve been working on LevelText, a startup that helps you learn languages by reading authentic native material. Actually, calling it a startup is a bit of a stretch because our app never produced any revenue and only a handful of users.

Nevertheless, I worked on it seriously for about half a year with a cofounder, joined an incubator, incorporated a company, iterated on an idea and launched several times. During this process, I learned a lot about the startup ecosystem, web development, talking to users, and many aspects of running a business. In this article, I will tell the story of how we got here and what we learned.

Long before LevelText, building a language learning app had been on my mind for a long time. I had always been passionate about languages: I studied several languages including French, Spanish and Japanese, and I was a PhD student researching linguistics and natural language processing. Thus, during my idle time, I liked to brainstorm ideas about improving language learning. Surely, language learning can be improved with my AI and NLP expertise, right?

There were a lot of language learning apps out there already, the most famous one being Duolingo, but there are many alternatives including Babbel, Rosetta Stone, Busuu, and dozens of others. Most people in the language learning community liked to criticize these apps: they offered similar features like language games and flashcards, teaching you words and phrases, but you could use them for a long time without achieving much fluency in the language.

According to many language acquisition researchers and educators, a more effective method of learning languages was “comprehensible input”: essentially getting a lot of exposure to authentic material (whether it’s books, articles, TV shows, podcasts, etc). The language is ideally at a level slightly above what you can easily understand, but not so difficult that it’s overwhelming. This was in stark contrast to all the apps that taught language through unnatural games and exercises, thus, despite the crowded marketplace, I was confident that teaching languages using authentic reading material would be a revolutionary app idea.

Iteration 1: Learning languages through reading

Personally, I like to learn languages by reading books. My process was the following: while reading, I would mark all the words that I didn’t understand, then I would look them up in a dictionary and add them to my collection to be reviewed later. I was improving my Chinese at the time, and found this process quite tedious because inputting unknown logographic characters into a dictionary was nontrivial.

So my first idea was an app where you take a picture of a printed text where you highlighted some words, and it would use some computer vision to look up these words for you in a dictionary. Then, if you liked, you could use this information to create Anki flashcards to review later. We called the tool VocabAssist and my wife helped me implement the prototype.

At this time, I started to interview all of my friends who were into language learning. A few people agreed with me that reading in the target language was a pain point, but most people did not have physical books lying around, and preferred reading on their phone or computer. And people who owned books wanted to keep them clean rather than writing all over them.

Therefore, we redesigned the concept: the user would instead paste text into the app. The app would then use some NLP to decide which words you probably don’t know, given your language level, and look up their definitions in a dictionary for you. Yet a problem still remained: how were users supposed to find material to read? Making them copy/paste text from web pages would have been too clumsy.

After doing a bit of market research, we found two existing apps with a similar concept already: LingQ and Readlang. These two apps did a decent job of helping learners study from the text, with fully featured dictionaries and flashcards, and they even supported video. However, they had to already have some material on hand; otherwise, both had libraries of user-contributed content to read, but they were small and you were unlikely to find what you’re looking for if you had niche interests. It would’ve been much better if they could leverage all of the content available on the Internet.

Both my wife and I had other full-time responsibilities, so progress was slow. Neither of us knew much about web development (I was a decent backend engineer but knew very little frontend), so the app was barely functional. As a result, we never launched the app to the public and only demoed it to some friends. Before we could launch it for real, I would need a cofounder with more experience building websites, and probably handle business responsibilities as well.

Iteration 2: Search engine for French learners

At the end of 2021, I had just moved to Vancouver and was wrapping up my PhD; I had more free time and was ready to give this project a more serious effort. So I set up a profile on the YCombinator cofounder matching platform, and soon I matched with Jonny Kalambay. He was working as a software engineer a Yelp and also had a passion for language learning. We worked on a toy project as a trial of working together and then decided to become cofounders.

Jonny was a talented full-stack engineer who could build and launch prototypes within days. Following the advice of the books “Running Lean” and “The Mom Test”, we started to interview users of our target demographic who were interested in language learning, asking them how they learn languages, what were the biggest struggles, and what tools they wish existed.

It turned out that finding interesting reading material of an appropriate difficulty was a fairly common problem. Thus, LevelText was born. We decided to build a search engine tool for French learners: we would take the user’s search query, translate it into French, search the internet for articles about the topic, and finally, return a sorted list of the articles ranked by estimated reading difficulty.

Our MVP assumed the user spoke English and was studying French. We decided to support only one language in our initial launch, since this would allow us to test out the concept with minimal development. We picked French because it was the most widely studied language in Canada and both of us were familiar with it.

After launching it on ProductHunt and conducting user tests on some French learners, the response was lukewarm. The tool was novel and cool, but didn’t provide enough value for them to keep coming back to the site. Intermediate and advanced learners already had lists of places to find reading material, and most articles on the internet were too difficult for beginners. We didn’t address the problem of motivation, one of the biggest struggles for language learners, and depended on the user remembering to use our site.

From the business side, the idea was difficult to monetize. There are several options for making money from a consumer-facing software product. Placing advertisements is the lowest effort option, but requires a massive amount of traffic to produce sufficient revenue: possible for a social network, but usually not for language learning apps. Most language learning apps make money by charging monthly or annual subscriptions. But we would need to provide a lot more value to justify charging a subscription – a simple tool wouldn’t cut it.

Iteration 3: Daily bite-sized reading

Our idea to fix the motivation problem was instead of the user needing to make searches, why not send them articles to their email? They would only need to indicate their interests once during sign-up and then they would receive a lesson every day in their inbox. With the lesson prepared and ready to go, the friction needed to start studying would be much lower.

Each lesson consisted of a paragraph of simple-to-read French text accessible to beginners, and a multiple-choice question to test their comprehension. We started prototyping this concept, using the GPT-3 model from OpenAI to simplify French articles down to a simple paragraph, and manually writing the multiple choice questions. We then enrolled a handful of French learners from social language learning sites to validate the idea.

The results were promising: most of the learners opened the e-mail and answered the question for three days in a row, although we never tried to charge money for it (the first real test of product-market fit), and the amount of manual work to prepare lessons meant that we could not get a large sample size of users to experiment with. Also, our mini-lessons could be completed in under a minute, which we felt was too little content for people to pay for it.

Founder troubles

It was around this time that we were invited to interview with YCombinator. This was a big deal because getting in meant receiving $500,000 USD of investment and connections to a network of experienced founders and mentors. We didn’t expect to get this far since our product hadn’t demonstrated traction yet; I guess it was our founder credentials that landed us the interview. We spent a day or two rehearsing our pitch and practicing answering questions for the rapid-fire 10-minute interview.

The next day, we received a rejection letter from YCombinator, basically saying that many great founders have tried to break into the language learning apps market, but it is hard to build a business in this space. Language learning (especially for non-English languages) was a hobby for most people with no urgent need, and they urged us to consider solving a must-have rather than a nice-to-have problem.

Our team morale took a hit after the rejection, and truth be told, we had gotten lots of hints at this point that language learning was not a great business direction. However, language learning was what we had been doing from the beginning, and we had no idea what else we could pivot to. Jonny suggested switching to making software for language teachers, because this seemed like a less competitive space than language learning apps. But I wasn’t thrilled with this direction: neither of us had connections or experience in education, so we had no good understanding of the problems that teachers faced and the realities of their daily working environment.

We started to fight more and more over product and business decisions: what level of language learners to aim at, how much to rely on automation versus manual content curation, how much we should care about monetization, etc. Neither of us were more experienced than the other in business, and neither of us trusted the other to take the lead on it. About two weeks after YCombinator, Jonny decided to leave LevelText and start his own company to make tools for language teachers. To be fair, he agreed to leave me with full ownership of the company and all its intellectual property (i.e., the code that we wrote), but I was on my own.

Iteration 4: Personalized courses from native content

Up to this point, we had talked to many users and ran some experiments, but we had never actually tried to charge money for our service. In this iteration, my plan was to take what I had learned thus far and build a product that at least had a chance of succeeding. By “success”, I mean making at least minimum wage, or roughly $3000 CAD/month.

Occasionally, you hear of startups raising investment from the founders’ credentials alone, without any revenue, but we talked to some investors and found this to be mostly not the case. Only incubators would invest on the order of $100k in early stage startups with no revenue, but we thought this was too little to unlock any options that we can’t already do, like hiring engineers. All the bigger investors expected to see monthly recurring revenue of at least $5-10k/month before our request for a $500k seed round would be taken seriously. So investment or no investment, I would need to start making that much revenue in the near future.

My business model was to bundle lessons into courses, which students could buy at a fixed cost. Several successful companies sold educational content bundled into courses, such as LinkedIn Learning and Chessable: the idea being that if most users drop off after a few weeks, we make more money by charging a one-time course purchase cost at the beginning than from a subscription (which they would quickly cancel). Another advantage of the course model was that I only needed to produce a fixed amount of content, whereas if I sold subscriptions, I would need to continuously generate new content for as long as they are subscribed, an unattractive proposition given that the app was not fully automated and a lot of work still needed to be done manually.

With a price point of $20 per course, I would need to sell just over 100 courses per month to make minimum wage: an optimistic, but not an impossible quantity. I built a landing page and a form where users input which topics they wanted to read about. The student then receives one trial lesson for free, at which point they would need to buy a course to receive more lessons.

To prepare a lesson, I would pick a French article of around B1 difficulty on your desired topic, and run an algorithm to guess the most difficult French vocabulary in the article and provide English translations. Additionally, the lesson included a computer text-to-speech reading and an English translation of the full article. For $20, I would prepare 7 daily French lessons matching your interests.

I launched the platform, bought some traffic from Google Adwords, but to my disbelief, nobody bought the courses. To figure out what was wrong, I conducted another round of user interviews. Again, the responses revealed that we were far from product market fit.

Everybody complained that the pricing was too expensive for what we provided: other digital services like Spotify and Netflix cost a similar amount or lower, and you got a lot more. People also found the pricing model confusing and assumed it was $20/month for a subscription. They also complained that our app only helped you practice reading, with no exercises or anything to help you with speaking, listening or writing. Worst of all, users didn’t find much value in personalized lessons, which was supposed to be our unique selling point – they rated lessons catered to their personal interests about equally interesting as lessons about the default topics.

Reflecting on this feedback, I think what happened makes a lot of sense. Most users, when asked to provide a list of their interests, gave fairly generic responses like “sports” or “politics”: not very useful for personalization. I wanted to make at least minimum wage to justify all the effort I was putting into this project, so I worked backwards from the $3000/month revenue target and designed the pricing scheme to achieve this goal. But my users don’t care about me or how much money I wanted to make, they only saw that I didn’t offer much for $20, so of course they didn’t buy.

I gathered the user feedback and sketched out some ideas of how to improve the product for the next iteration. But honestly, I was running out of steam – after countless iterations and little to show for my effort, it became difficult to gather the focus and willpower to implement more features. Being a solo founder is not much fun: nobody else understood my problems or cared about my language learning startup. So I decided it was a good time to quit.

Why language learning is a poor business

Despite my passion for it, language learning is ultimately a weak business model, especially software-based apps. For English speakers, learning a language is mainly something that is done for fun; it is seldom a must-have. Learners typically get bored and quit after only a few weeks, when their initial motivation runs out and it becomes a chore. This makes it difficult to make a subscription-based model work (many language learning apps get around this by asking for payment of several months or a whole year upfront).

Most language learners don’t spend much money on their hobby – out of all the people I interviewed, hardly anybody had ever paid for an app. It’s hard to blame them – there are tons of free language learning resources on the internet, so there is rarely a compelling reason to justify taking out your credit card. Even when people do pay to learn a language, it is usually for human tutoring or physical books (which people expect to pay for), not software products.

Finally, the market is saturated with hundreds of language learning products offering similar features. There is no good way to measure results of language learning, so you have no way to prove that your method is better than the others in terms of educational efficacy. With no better alternative, most companies optimize for user retention, inevitably leading to an app filled with gamification features.

Was founding a startup the wrong choice for me?

There are several reasons why being a founder in this space was probably not the right choice for me personally. First of all, my main expertise is in AI: while useful, it is far from the biggest priority in the early stages of most consumer products. Language learning apps needed to be engaging and fun to use, and AI was at most a minor component. Before product-market fit, the main tasks in a startup are developing quick prototypes of web or mobile apps, networking on social media to find users, talking to those users, and aligning the product with their needs. As for AI, it is usually best to throw something together quickly with GPT-3 and improve it later if there is demand for it, because any more sophisticated machine learning would take too long relative to the benefits.

While I could probably learn web development, sales, and product management given enough time, I had no experience in any of these skills, meaning I have no competitive advantage over the thousands of other aspiring entrepreneurs, and would be at a disadvantage compared to founders with experience in these skills.

Another factor is personal risk tolerance: how long are you willing to work full-time on your startup without any revenue or investment? For me, I was willing to put six months. This short timeline drove me to only consider business models that would generate a couple thousand dollars a month of revenue right away, and kill the idea if this goal could not be reached. But in reality, it is common to iterate for several years before achieving this goal (eg: Readlang took about 3 years to reach $1000/month in revenue). Since I wasn’t prepared to risk several years, potentially with zero payout at the end, it would’ve probably been better to work on it on the side while working a regular job somewhere, instead of going all-in full-time with an overly optimistic timeline.

There is a common belief in the startup community that persistence in your startup is “good” and giving up is “bad”. Part of this is observation bias: you only read about successful companies that raise millions and get acquired, after a lot of failure and persistence, and never hear about the majority of startups that fail. This has led many talented young people to pursue the startup dream, often willing to fail at it for years. I don’t think this is a particularly noble act: there is nothing magical that happens when you incorporate a company and declare yourself to be a startup founder. Your “company” is only worth something if you can provide something of sufficient value to customers that they pay for it; if you have no revenue then you are basically no different from being unemployed (even if you write a lot of code and get a few people to use it).

Even though I didn’t succeed in my venture, this was a valuable learning experience. You learn much more by actually attempting a startup than just reading about them; the life of a founder is not as glamorous as portrayed in the media. Now I know what I don’t know, and have a better idea of what skills I’d need to have a better chance next time. That’s the end of my rather long post, I hope it will be useful to readers thinking about building a startup or a language learning app.

Why do polysynthetic languages all have very few speakers?

Polysynthetic languages are able to express complex ideas in one word, that in most languages would require a whole sentence. For example, in Inuktitut:

qangatasuukkuvimmuuriaqalaaqtunga

“I’ll have to go to the airport”

There’s no widely accepted definition of a polysynthetic language. Generally, polysynthetic languages have noun incorporation (where noun arguments are expressed affixes of a verb) and serial verb construction (where a single word contains multiple verbs). They are considered some of the most grammatically complex languages in the world.

Polysynthetic languages are most commonly found among the indigenous languages of North America. Only a few such languages have more than 100k speakers: Nahuatl (1.5m speakers), Navajo (170k speakers), and Cree (110k speakers). Most polysynthetic languages are spoken by a very small number of people and many are in danger of becoming extinct.

Why aren’t there more polysynthetic languages — major national languages with millions of speakers? Is it mere coincidence that the most complex languages have few speakers? According to Wray (2007), it’s not just coincidence, rather, languages spoken within a small, close-knit community with little outside contact tend to develop grammatical complexity. Languages with lots of external contact and adult learners tend to be more simplified and regular.

It’s well known that children are better language learners than adults. L1 and L2 language acquisition processes work very differently, so that children and adults have different needs when learning a language. Adult learners prefer regularity and expressions that can be decomposed into smaller parts. Anyone who has studied a foreign language has seen tables of verb conjugations like these:

french-conjugation

korean-verb

For adult learners, the ideal language is predictable and has few exceptions. The number 12 is pronounced “ten-two“, not “twelve“. A doctor who treats your teeth is a “tooth-doctor“, rather than a “dentist“. Exceptions give the adult learner difficulties since they have to be individually memorized. An example of a very predictable language is the constructed language Esperanto, designed to have as few exceptions as possible and be easy to learn for native speakers of any European language.

Children learn languages differently. At the age of 12 months (the holophrastic stage), children start producing single words that can represent complex ideas. Even though they are multiple words in the adult language, the child initially treats them as a single unit:

whasat (what’s that)

gimme (give me)

Once they reach 18-24 months of age, children pick up morphology and start using multiple words at a time. Children learn whole phrases first, then only later learn to analyze them into parts on an as-needed basis, thus they have no difficulty with opaque idioms and irregular forms. They don’t really benefit from regularity either: when children learn Esperanto as a native language, they introduce irregularities, even when the language is perfectly regular.

We see evidence of this process in English. Native speakers frequently make mistakes like using “could of” instead of “could’ve“, or using “your” instead of “you’re“. This is evidence that native English speakers think of them as a single unit, and don’t naturally analyze them into their sub-components: “could+have” and “you+are“.

According to the theory, in languages spoken in isolated communities, where few adults try to learn the language, it ends up with complex and irregular words. When lots of grown-ups try to learn the language, they struggle with the grammatical complexity and simplify it. Over time, these simplifications eventually become a standard part of the language.

Among the world’s languages, various studies have found correlations between grammatical complexity and smaller population size, supporting this theory. However, the theory is not without its problems. As with any observational study, correlation doesn’t imply causation. The European conquest of the Americas decimated the native population, and consequently, speakers of indigenous languages have declined drastically in the last few centuries. Framing it this way, the answer to “why aren’t there more polysynthetic languages with millions of speakers” is simply: “they all died of smallpox or got culturally assimilated”.

If instead, Native Americans had sailed across the ocean and colonized Europe, would more of us be speaking polysynthetic languages now? Until we can go back in time and rewrite history, we’ll never know the answer for sure.

Further reading

  • Atkinson, Mark David. “Sociocultural determination of linguistic complexity.” (2016). PhD Thesis. Chapter 1 provides a good overview of how languages are affected by social structure.
  • Kelly, Barbara, et al. “The acquisition of polysynthetic languages.” Language and Linguistics Compass 8.2 (2014): 51-64.
  • Trudgill, Peter. Sociolinguistic typology: Social determinants of linguistic complexity. Oxford University Press, 2011.
  • Wray, Alison, and George W. Grace. “The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form.” Lingua 117.3 (2007): 543-578. This paper proposes the theory and explains it in detail.

Explaining chain-shift tone sandhi in Min Nan Chinese

In my previous post on the Teochew dialect, I noted that Teochew has a complex system of tone sandhi. The last syllable of a word keeps its citation (base) form, while all preceding syllables undergo sandhi. For example:

gu5 (cow) -> gu1 nek5 (cow-meat = beef)

seng52 (play) -> seng35 iu3 hi1 (play a game)

The sandhi system is quite regular — for instance, if a word’s base tone is 52 (falling tone), then its sandhi tone will be 35 (rising tone), across many words:

toin52 (see) -> toin35 dze3 (see-book = read)

mang52 (mosquito) -> mang35 iu5 (mosquito-oil)

We can represent this relationship as an edge in a directed graph 52 -> 35. Similarly, words with base tone 5 have sandhi tone 1, so we have an edge 5 -> 1. In Teochew, the sandhi graph of the six non-checked tones looks like this:

teochew-sandhi

Above: Teochew tone sandhi, Jieyang dialect, adapted from Xu (2007). For simplicity, we ignore checked tones (ending in -p, -t, -k), which have different sandhi patterns.

This type of pattern is not unique to Teochew, but exists in many dialects of Min Nan. Other dialects have different tones but a similar system. It’s called right-dominant chain-shift, because the rightmost syllable of a word keeps its base tone. It’s also called a “tone circle” when the graph has a cycle. Most notably, the sandhi pattern where A -> B, and B -> C, yet A !-> C is quite rare cross-linguistically, and does not occur in any Chinese dialect other than in the Min family.

Is there any explanation for this unusual tone sandhi system? In this blog post, I give an overview of some attempts at an explanation from theoretical phonology and historical linguistics.

Xiamen tone circle and Optimality Theory

The Xiamen / Amoy dialect is perhaps the most studied variety of Min Nan. Its sandhi system looks like this:

xiamen-sandhi

Barrie (2006) and Thomas (2008) attempt to explain this system with Optimality Theory (OT). In modern theoretical phonology, OT is a framework that describes how the underlying phonemes are mapped to the output phonemes, not using rules, but rather with a set of constraints. The constraints dictate what kinds of patterns that are considered “bad” in the language, but some violations are worse than others, so the constraints are ranked in a hierarchy. Then, the output is the solution that is “least bad” according to the ranking.

To explain the Xiamen tone circle sandhi, Thomas begins by introducing the following OT constraints:

  • *RISE: incur a penalty for every sandhi tone that has a rising contour.
  • *MERGE: incur a penalty when two citation tones are mapped to the same sandhi tone.
  • DIFFER: incur penalty when a base tone is mapped to itself as a sandhi tone.

Without any constraints, there are 5^5 = 3125 possible sandhi systems in a 5-tone language. With these constraints, most of the hypothetical systems are eliminated — for example, the null system (where every tone is mapped to itself) incurs 5 violations of the DIFFER constraint.

These 3 rules aren’t quite enough to fully explain the Xiamen tone system: there are still 84 hypothetical systems that are equally good as the actual system. With the aid of a Perl script, Thomas then introduces more rules until only one system (the actual observed one) emerges as the best under the constraints.

Problems with the OT explanation

There are several reasons why I didn’t find this explanation very satisfying. First, it’s not falsifiable: if your constraints don’t generate the right result, you can keep adding more and more constraints, and tweak the ranking, until they produce the result you want.

Second, the constraints are very arbitrary and lack any cognitive-linguistic motivation. You can explain the *MERGE constraint as trying to preserve contrasts, which makes sense from an information theory point of view, but what about DIFFER? It’s unclear why base tones shouldn’t be mapped to the same sandhi tone, especially since many languages (like Cantonese) manage fine with no sandhi at all.

Even considering Teochew, which is more closely related to the Xiamen dialect, we see that all three constraints are violated. I’m not aware of any analysis of Teochew sandhi using OT, and it would be interesting to see, but surely it would have a very different set of constraints from the Xiamen system.

Nevertheless, OT has been an extremely successful framework in modern phonology. In some cases, OT can describe a pattern very cleanly, where you’d need very complicated rules to describe them. In that case, the set of OT constraints would be a good explanation for the pattern.

Also, if the same constraint shows up in a lot of languages, then that increases its credibility that it’s a true cross-language tendency, rather than a just a made-up rule to explain the data. For example, if the *RISE constraint shows up in OT grammars for many languages, then you could claim that there’s a general tendency for languages to prefer falling tones over rising tones.

Evidence from Middle Chinese

Chen (2000) gives a different perspective. Essentially, he claims that it’s impossible to make sense of the data in any particular modern-day dialect. Instead, we should compare multiple dialects together in the context of historical sound changes.

The evidence he gives is from the Zhangzhou dialect, located about 40km inland from Xiamen. The Zhangzhou dialect has a similar tone circle as Xiamen, but with different values!

3sandhi

It’s not obvious how the two systems are related, until you consider the mapping to Middle Chinese tone categories:

mc-circle

The roman numerals I, II, III denote tones of Middle Chinese, spoken during ~600AD. Middle Chinese had four tones, but none of the present day Chinese dialects retain this system, after centuries of tone splits and merges. In many dialects, a Middle Chinese tone splits into two tones depending on whether the initial is voiced or voiceless. When comparing tones from different dialects, it’s often useful to refer to historical tone categories like “IIIa”, which roughly means “syllables that were tone III in Middle Chinese and the initial consonant is voiceless”.

It’s unlikely that both Xiamen and Zhangzhou coincidentally developed sandhi patterns that map to the same Middle Chinese tone categories. It’s far more likely that the tone circle developed in a common ancestral language, then their phonetic values diverged afterwards in the respective present-day dialects.

That still leaves open the question of: how exactly did the tone circle develop in the first place? It’s likely that we’ll never know for sure: the details are lost to time, and the processes driving historical tone change are not very well understood.

In summary, theoretical phonology and historical linguistics offer complementary insights that explain the chain-shift sandhi patterns in Min Nan languages. Optimality Theory proposes tendencies for languages to prefer certain structures over others. This partially explains the pattern; a lot of it is simply due to historical accident.

References

  1. Barrie, Michael. “Tone circles and contrast preservation.” Linguistic Inquiry 37.1 (2006): 131-141.
  2. Chen, Matthew Y. Tone sandhi: Patterns across Chinese dialects. Vol. 92. Cambridge University Press, 2000. Pages 38-49.
  3. Thomas, Guillaume. “An analysis of Xiamen tone circle.” Proceedings of the 27th West Coast Conference on Formal Linguistics. Cascadilla Proceedings Project, Somerville, MA. 2008.
  4. Xu, Hui Ling. “Aspect of Chaozhou grammar: a synchronic description of the Jieyang variety.” (2007).

Learning the Teochew (Chaozhou) Dialect

Lately I’ve been learning my girlfriend’s dialect of Chinese, called the Teochew dialect.  Teochew is spoken in the eastern part of the Guangdong province by about 15 million people, including the cities of Chaozhou, Shantou, and Jieyang. It is part of the Min Nan (闽南) branch of Chinese languages.

teochew-map

Above: Map of major dialect groups of Chinese, with Teochew circled. Teochew is part of the Min branch of Chinese. Source: Wikipedia.

Although the different varieties of Chinese are usually refer to as “dialects”, linguists consider them different languages as they are not mutually intelligible. Teochew is not intelligible to either Mandarin or Cantonese speakers. Teochew and Mandarin diverged about 2000 years ago, so today they are about as similar as French is to Portuguese. Interestingly, linguists claim that Teochew is one of the most conservative Chinese dialects, preserving many archaic words and features from Old Chinese.

Above: Sample of Teochew speech from entrepreneur Li Ka-shing.

Since I like learning languages, naturally I started learning my girlfriend’s native tongue soon after we started dating. It helped that I spoke Mandarin, but Teochew is not close enough to simply pick up by osmosis, it still requires deliberate study. Compared to other languages I’ve learned, Teochew is challenging because very few people try to learn it as a foreign language, thus there are few language-learning resources for it.

Writing System

The first hurdle is that Teochew is primarily spoken, not written, and does not have a standard writing system. This is the case with most Chinese dialects. Almost all Teochews are bilingual in Standard Chinese, which they are taught in school to read and write.

Sometimes people try to write Teochew using Chinese characters by finding the equivalent Standard Chinese cognates, but there are many dialectal words which don’t have any Mandarin equivalent. In these cases, you can invent new characters or substitute similar sounding characters, but there’s no standard way of doing this.

Still, I needed a way to write Teochew, to take notes on new vocabulary and grammar. At first, I used IPA, but as I became more familiar with the language, I devised my own romanization system that captured the sound differences.

Cognates with Mandarin

Note (Jul 2020): People in the comments have pointed out that some of these examples are incorrect. I’ll keep this section the way it is because I think the high-level point still stands, but these are not great examples.

Knowing Mandarin was very helpful for learning Teochew, since there are lots of cognates. Some cognates are obviously recognizable:

  • Teochew: kai shim, happy. Cognate to Mandarin: kai xin, 开心.
  • Teochew: ing ui, because. Cognate to Mandarin: ying wei, 因为

Some words have cognates in Mandarin, but mean something slightly different, or aren’t commonly used:

  • Teochew: ou, black. Cognate to Mandarin: wu, 乌 (dark). The usual Mandarin word is hei, 黑 (black).
  • Teochew: dze: book. Cognate to Mandarin: ce, 册 (booklet). The usual Mandarin word is shu, 书 (book).

Sometimes, a word has a cognate in Mandarin, but sound quite different due to centuries of sound change:

  • Teochew: hak hau, school. Cognate to Mandarin: xue xiao, 学校.
  • Teochew: de, pig. Cognate to Mandarin: zhu, 猪.
  • Teochew: dung: center. Cognate to Mandarin: zhong, 中.

In the last two examples, we see a fairly common sound change, where a dental stop initial (d- and t-) in Teochew corresponds to an affricate (zh- or ch-) in Mandarin. It’s not usually enough to guess the word, but serves as a useful memory aid.

Finally, a lot of dialectal Teochew words (I’d estimate about 30%) don’t have any recognizable cognate in Mandarin. Examples:

  • da bo: man
  • no gya: child
  • ge lai: home

Grammatical Differences

Generally, I found Teochew grammar to be fairly similar to Mandarin, with only minor differences. Most grammatical constructions can transfer cognate by cognate and still make sense in the other language.

One significant difference in Teochew is the many fused negation markers. Here, a syllable starts with the initial b- or m- joined with a final to negate something. Some examples:

  • bo: not have
  • boi: will not
  • bue: not yet
  • mm: not
  • mai: not want
  • ming: not have to

Phonology and Tone Sandhi

The sound structure of Teochew is not too different from Mandarin, and I didn’t find it difficult to pronounce. The biggest difference is that syllables may end with a stop: -t, -k, -p, and -m, whereas Mandarin syllables can only end with a vowel or nasal. The characteristic of a Teochew accent in Mandarin is replacing /f/ with /h/, and indeed there is no /f/ sound in Teochew.

The hardest part of learning Teochew for me were the tones. Teochew has either six or eight tones depending on how you count them, which isn’t difficult to produce in isolation. However, Teochew has a complex system of tone sandhi rules, where the tone of each syllable changes depending on the tone of the following syllable. Mandarin has tone sandhi to some extent (for example, the third tone sandhi rule where nǐ + hǎo is pronounced níhǎo rather than nǐhǎo). But Teochew takes this to a whole new level, where nearly every syllable undergoes contextual tone change.

Some examples (the numbers are Chao tone numerals, with 1 meaning lowest and 5 meaning highest tone):

  • gu5: cow
  • gu1 nek5: beef

Another example, where a falling tone changes to a rising tone:

  • seng52: to play
  • seng35 iu3 hi1: to play a game

There are tables of tone sandhi rules describing in detail how each tone gets converted to what other tone, but this process is not entirely regular and there are exceptions. As a result, I frequently get the tone wrong by mistake.

Update: In this blog post, I explore Teochew tone sandhi in more detail.

Resources for Learning Teochew

Teochew is seldom studied as a foreign language, so there aren’t many language learning resources for it. Even dictionaries are hard to find. One helpful dictionary is Wiktionary, which has the Teochew pronunciation for most Chinese characters.

Also helpful were formal linguistic grammars:

  1. Xu, Huiling. “Aspects of Chaoshan grammar: A synchronic description of the Jieyang dialect.” Monograph Series Journal of Chinese Linguistics 22 (2007).
  2. Yeo, Pamela Yu Hui. “A sketch grammar of Singapore Teochew.” (2011).

The first is a massively detailed, 300-page description of Teochew grammar, while the second is a shorter grammar sketch on a similar variety spoken in Singapore. They require some linguistics background to read. Of course, the best resource is my girlfriend, a native speaker of Teochew.

Visiting the Chaoshan Region

After practicing my Teochew for a few months with my girlfriend, we paid a visit to her hometown and relatives in the Chaoshan region. More specifically, Raoping County located on the border between Guangdong and Fujian provinces.

 

Left: Chaoshan railway station, China. Right: Me learning the Gongfu tea ceremony, an essential aspect of Teochew culture.

Teochew people are traditional and family oriented, very much unlike the individualistic Western values that I’m used to. In Raoping and Guangzhou, we attended large family gatherings in the afternoon, chatting and gossiping while drinking tea. Although they are still Han Chinese, the Teochew consider themselves a distinct subgroup within Chinese, with their unique culture and language. The Teochew are especially proud of their language, which they consider to be extremely hard for outsiders to learn. Essentially, speaking Teochew is what separates “ga gi nang” (roughly translated as “our people”) from the countless other Chinese.

My Teochew is not great. Sometimes I struggle to get the tones right and make myself understood. But at a large family gathering, a relative asked me why I was learning Teochew, and I was able to reply, albeit with a Mandarin accent: “I want to learn Teochew so that I can be part of your family”.

raoping-sea

Above: Me, Elaine, and her grandfather, on a quiet early morning excursion to visit the sea. Raoping County, Guangdong Province, China.

Thanks to my girlfriend Elaine Ye for helping me write this post. Elaine is fluent in Teochew, Mandarin, Cantonese, and English.

Great Solo Asian Trip Part 2: Languages of East Asia

This is the second blog post in my two-part series on my 4-month trip to Asia. Here is part one. In this second blog post, I will focus on the languages I encountered in Asia and my attempts at learning them.

I’ve always enjoyed learning languages (here is a video of me speaking a bunch of them) — and Asia is a very linguistically diverse place compared to North America, with almost every country speaking a different language. So in every country I visited, I tried to learn the language as best as I could. Realistically, it’s not possible to go from zero to fluency in the span of a vacation, but you can actually learn a decent amount in a week or two. Travelling in a foreign country is a great motivator for learning languages, and I found myself learning new words much faster than I did studying it at home.

I went to five countries on this trip, in chronological order: China, Japan, South Korea, Vietnam, and Malaysia.

China

In the first month of my trip, I went to a bunch of cities in China with my mom and sister. For the most part, there wasn’t much language learning, as I already spoke Mandarin fluently.

One of the regions we went to was Xishuangbanna, in southern Yunnan province. Xishuangbanna is a special autonomous prefecture, designated by the Chinese government for the Dai ethnic minority. The outer fringes of China are filled with various groups of non-Chinese minority groups, each with their own unique culture and language. Home to 25 officially recognized ethnic groups and countless more unrecognized ones, Yunnan is one of the most linguistically diverse places in the world.

1Above: Bilingual signs in Chinese and Dai in Jinghong

In practice, recent migration of the Chinese into the region meant that even in Xishuangbanna, the Han Chinese outnumber the local Dai people, and Mandarin is spoken everywhere. In the streets of Jinghong, you can see bilingual signs written in Mandarin and the Dai language (a language related to Thai). Their language is written in the Tai Lue script, which looks pretty cool, but I never got a chance to learn it.


Next stop on my trip was Hong Kong. The local language here is Cantonese, which shares a lot of similar vocabulary and grammatical structure with my native Mandarin, since they were both descended from Middle Chinese about 1500 years ago. However, a millennium of sound changes means that today, Mandarin and Cantonese are quite different languages and are not at all mutually intelligible.

I was eager to practice my Cantonese in my two days in Hong Kong, but found that whenever I said something incorrect, they would give me a weird look and immediately switch to Mandarin or English. Indeed, learning a language is very difficult when everybody in that country is fluent in English. Oh well.

Japan

A lot of travellers complain that the locals speak no English; you don’t often hear of complaints that their English is too good! Well, Japan did not leave me disappointed. Although everyone studies English in school, most people have little practice actually using it, so Japan is ranked near the bottom in English proficiency among developed nations. Perfect!

Before coming to Japan, I already knew a decent amount of Japanese, mostly from watching lots of anime. However, there are very few Japanese people in Canada, so I didn’t have much practice actually speaking it.

I was in Japan for one and a half months, the most of any single country of this trip. In order to accelerate my Japanese learning process, I enrolled in classes at a Japanese language school and stayed with a Japanese homestay family. This way, I learned formal grammatical structures in school and got conversation practice at home. I wrote a more detailed blog post here about this part of the trip.


Phonologically, Japanese is an easy language to pronounce because it has a relatively small number of consonants and only five vowels. There are no tones, and every syllable has form CV (consonant followed by a vowel). Therefore, an English speaker will have a much easier time pronouncing Japanese correctly than the other way around.

Grammatically, Japanese has a few oddities that take some time to get used to. First, the subject of a sentence is usually omitted, so the same phrase can mean “I have an apple” or “he has an apple”. Second, every time you use a verb, you have to decide between the casual form (used between friends and family) or the polite form (used when talking to strangers). Think of verb conjugations, but instead of verb endings differing by subject, they’re conjugated based on politeness level.

The word order of Japanese is also quite different from English. Japanese is an agglutinative language, so you can form really long words by attaching various suffixes to verbs. For example:

  • iku: (I/you/he) goes
  • ikanai: (I/you/he) doesn’t go
  • ikitai: (I/you/he) wants to go
  • ikitakunai: (I/you/he) doesn’t want to go
  • ikanakatta: (I/you/he) didn’t go
  • ikitakunakatta: (I/you/he) didn’t want to go
  • etc…

None of this makes Japanese fundamentally hard, just different from a lot of other languages. This also explains why Google Translate sucks so much at Japanese. When translating Japanese to English, the subjects of sentences are implicit in Japanese but must be explicit in English; when translating English to Japanese, the politeness level is implicit in English but must be explicit in Japanese.

One more thing to beware of is the Japanese pitch accent. Although it’s nowhere close to a full tonal system like Chinese, stressed syllables have a slightly higher pitch. For example, the word “kirei” (pretty) has a pitch accent on the first syllable: “KI-rei”. Once I messed this up and put the accent on the second syllable instead: “ki-REI”, but unbeknownst to me, to native Japanese this sounds like “kirai” (to hate), which has the accent on the second syllable. So I meant to say “nihon wa kirei desu” (Japan is pretty) but it sounded more like “nihon wa kirai desu” (I hate Japan)!

2.png

That was quite an awkward moment.

When I headed west from Tokyo into the Kansai region of Kyoto and Osaka, I noticed a bit of dialectal variation. The “u” in “desu” is a lot more drawn out, and the copula “da” was replaced with “ya”, so on the streets of Kyoto I’d hear a lot of “yakedo” instead of “dakedo” in Tokyo. I got to practice my Japanese with my Kyoto Airbnb host every night, and picked up a few words of the Kansai dialect. For example:

  • ookini: thank you (Tokyo dialect: arigatou)
  • akan: no good (Tokyo dialect: dame)
  • okan: mother (Tokyo dialect: okaasan)

The writing system of Japanese is quite unique and deserves a mention. It actually has three writing systems: the Hiragana syllabary for grammatical particles, the Katakana syllabary for foreign loanwords, and Kanji, logographic characters borrowed from Chinese. A Kanji character can be read in several different ways. Typically, when you have two or more Kanji together, it’s a loanword from Chinese read using a Chinese-like pronunciation (eg: novel, 小説 is read shousetsu) but when you have a single Kanji character followed by a bunch of Hiragana, it’s a Japanese word that means the same thing but sounds nothing like the Chinese word (eg: small, 小さい is read chiisai).

The logographic nature of Kanji is immensely helpful for Chinese people learning Japanese. You get the etymology of every Chinese loanword, and you get to “read” texts well above your level as you know the meaning of most words (although it gives you no information on how the word is pronounced).

My Japanese improved a lot during my 6 weeks in the country. By the time I got to Fukuoka, at the western end of Japan, I had no problems holding a conversation for 30 minutes with locals in a restaurant (provided they speak slowly, of course). It’s been one of my most rewarding language learning experiences to date.

South Korea

From Fukuoka, I traveled across the sea for a mere three hours, on a boat going at a speed slower than a car on a freeway, and landed in a new country. Suddenly, the script on the signs were different, and the language on the street once again strange and unfamiliar. You can’t get the same satisfaction arriving in an airplane.

IMG_2240 (Medium)Above: Busan, my first stop in Korea

Of course, I was in the city of Busan, in South Korea. I was a bit nervous coming here, since it was the first time in my life that I’d been in a country where I wasn’t at least conversationally proficient in the language. Indeed, procuring a SIM card on my first day entailed a combination of me trying to speak broken Korean, them trying to speak broken English, hand gesturing, and (shamefully) Google Translate.

Before coming to Korea, I knew how to read Hangul (the Korean writing system) and a couple dozen words and phrases I picked up from Kpop and my university’s Korean language club. I also tried taking Korean lessons on italki (a language learning website) and various textbooks, but the language never really “clicked” for me, and now I still can’t hold a conversation in Korean for very long.

I suspect the reason has to do with passive knowledge: I’ve had a lot of exposure to Japanese from hundreds of hours of watching anime, but nowhere near as much exposure to Korean. Passive knowledge is important because humans learn language from data, and given enough data, we pick up on a lot of grammatical patterns without explicitly learning them.

Also, studying Kpop song lyrics is not a very effective way to learn Korean. The word distribution in song lyrics is sufficiently different from the word distribution in conversation that studying song lyrics would likely make you better at understanding other songs but not that much better at speaking Korean.


Grammatically, Japanese and Korean are very similar: they have nearly identical word order, and grammatical particles almost have a one-to-one correspondence. They both conjugate verbs differently based on politeness, and form complex words by gluing together suffixes to the end of verbs. The grammar of the two languages are so similar that you can almost translate Japanese to Korean just by translating each morpheme and without changing the order — and both are very different from Chinese, the other major language spoken in the region.

Phonologically, Korean is a lot more complex than Japanese, which is bad news for language learners. Korean has about twice as many different vowels as Japanese, and a few more consonants as well. Even more, Korean maintains a three-way distinction for many consonants: for example, the ‘b’ sound has a plain version (불: bul), an aspirated version (풀: pul), and a tense version (뿔: ppul). I had a lot of difficulty telling these sounds apart, and often had to guess many combinations to find a word in the dictionary.

Unlike Chinese and Japanese, Korean does not use a logographic writing system. In Hangul, each word spells out how the word sounds phonetically, and the system is quite regular. On one hand, this means that Hangul can be learned in a day, but on the other hand, it’s not terribly useful to be able to sound out Korean text without knowing what anything means. I actually prefer the Japanese logographic system, since it makes the Chinese cognates a lot clearer. In fact, about 60% of Korean’s vocabulary are Chinese loanwords, but with a phonetic writing system, it’s not always easy to identify what they are.

Vietnam

The next country on my trip was Vietnam. I learned a few phrases from a Pimsleur audio course, but apart from that, I knew very little about the Vietnamese language coming in. The places I stayed were sufficiently touristy that most people spoke enough English to get by, but not so fluently as to make learning the language pointless.

Vietnamese is a tonal language, like Mandarin and Cantonese. It has 6 tones, but they’re quite different from the tones in Mandarin (which has 4-5). At a casual glance, Vietnamese may sound similar to Chinese, but the languages are unrelated and there is little shared vocabulary.

3Above: Comparison between Mandarin tones (above) and Vietnamese tones (below)

Vietnamese syllables have a wide variety of distinct vowel diphthongs; multiplied with the number of tones, this means that there are a huge number of distinct syllables. By the laws of information theory, this also means that one Vietnamese syllable contains a lot of information — I was often surprised at words that were one syllable in Vietnamese but two syllables in Mandarin.

My Vietnamese pronunciation must have sounded very strange to the locals: often, when I said something, they would understand what I said, but then they’d burst out laughing. Inevitably, they’d follow by asking if I was overseas Vietnamese.

Vietnamese grammar is a bit like Chinese, with a subject-verb-object word order and lack of verb conjugations. So in Vietnamese, if you string together a bunch of words in a reasonable order, there’s a good chance it would be correct (and close to zero chance in Japanese or Korean). One notable difference is in Vietnamese, the adjective comes after the noun, whereas it comes before the noun in Chinese.

One language peculiarity is that Vietnamese doesn’t have pronouns for “I” or “you”. Instead, you must determine your social relationship to the other party to determine what words to use. If I’m talking to an older man, then I refer to him as anh (literally: older brother) and I refer to myself as em (literally: younger sibling). These words would change if I were talking to a young woman, or much older woman, etc. You can imagine that this system is quite confusing for foreigners, so it’s acceptable to use Tôi which unambiguously means “I”, although native speakers don’t often use this word.

Written Vietnamese uses the Latin alphabet (kind of like Chinese Pinyin), and closely reflects the spoken language. Most letters are pronounced more or less the way you’d expect, but there are some exceptions, for example, ‘gi’, ‘di’, and ‘ri’ are all pronounced like ‘zi’.

In two weeks in Vietnam, I didn’t learn enough of the language to have much of a conversation, but I knew enough for most of the common situations you encounter as a tourist, and could haggle prices with fruit vendors and motorcycle taxi drivers. I also learned how to tell between the northern Hanoi dialect and the southern Saigon dialect (they’re mutually intelligible but have a few differences).

Malaysia

The final country on my trip was Malaysia. Malaysia is culturally a very diverse country, with ethnic Malays, Chinese, and Indians living in the same country. The Malay language is frequently used for interethnic communication. I learned a few phrases of the language, but didn’t need to use it much, because everybody I met spoke either English or Mandarin fluently.

Malaysia is a very multilingual country. The Malaysian-Chinese people speak a southern Chinese dialect (one of Hokkien, Hakka, or Cantonese), Mandarin, Malay, and English. In Canada, it’s common to speak one or two languages, but we can only dream of speaking 4-5 languages fluently, as many Malaysians do.

Rate of Language Learning

I kept a journal of new words I learned in all my languages. Whenever somebody said a word I didn’t recognize, I would make a note of it, look it up later, and record it in my journal. When I wanted to say something but didn’t know the word for it, I would also add it to my journal. This way, I learned new words in a natural way, without having to memorize lists of words.

4Above: Tally of words learned in various languages

On average, I picked up 3-5 new words for every day I spent in a foreign country. At this rate, I should be able to read Harry Potter (~5000 unique words) after about 3 years.


That’s all for now. In September, I will be starting my master’s in Computational Linguistics; hopefully, studying all these random languages will come to some use.

With so much linguistic diversity, and with most people speaking little English, Asia is a great vacation spot for language nerds and aspiring polyglots!

Further discussion of this article on /r/languagelearning.

Using Waveform Plots to Improve your Accent, and a Dive into English Phonology

I was born in China and immigrated to Canada when I was 4 years old. After living in Canada for 18 years, I consider myself a native speaker for most purposes, but I still retain a noticeable non-native accent when speaking.

This post has a video that contains me speaking, if you want to hear what my accent sounds like.

It’s often considered very difficult or impossible to change your accent once you reach adulthood. I don’t know if this is true or not, but it sounds like a self-fulfilling prophecy — the more you think it’s impossible, the less you try, so of course your accent will not get any better. Impossible or not, it’s worth it to give it a try.

The first step is identifying what errors you’re making. This can be quite difficult if you’re not a trained linguist — native English speakers will detect that you have an accent, but they can’t really pinpoint exactly what’s wrong with your speech — it just sounds wrong to them.

One accent reduction strategy is the following: listen to a native speaker saying a sentence (for example, in a movie or on the radio), and repeat the same sentence, mimicking the intonation as closely as possible. Record both sentences, and play them side by side. This way, with all the other confounding factors gone, it’s much easier to identify the differences between your pronunciation and the native one.

When I tried doing this using Audacity, I noticed something interesting. Oftentimes, it was easier to spot differences in the waveform plot (that Audacity shows automatically) than to hear the differences between the audio samples. When you’re used to speaking a certain way all your life, your ears “tune out” the differences.

Here’s an example. The phrase is “figure out how to sell it for less” (Soundcloud):

2_.png

The difference is clear in the waveform plot. In my audio sample, there are two spikes corresponding to the “t” sound that don’t appear in the native speaker’s sample.

For vowels, the spectrogram works better than the waveform plot. Here’s the words “said” and “sad”, which differ in only the vowel:

1.png

Again, if you find it difficult to hear the difference, it helps to have a visual representation to look at.


I was surprised to find out that I’d been pronouncing the “t” consonant incorrectly all my life. In English, the letter “t” represents an aspirated alveolar stop (IPA /tʰ/), which is what I’m doing, right? Well, no. The letter “t” does produce the sound /tʰ/ at the beginning of a word, but in American English, the “t” at the final position of a word can get de-aspirated so that there’s no audible release. It can also turn into a glottal stop (IPA /ʔ/) in some dialects, but native speakers rarely pronounce /tʰ/, except in careful speech.

This is a phonological rule, and there are many instances of this. Here’s a simple experiment: put your hand in front of your mouth and say the word “pin”. You should feel a puff of air in your palm. Now say the word “spin” — and there is no puff of air. This is because in English, the /p/ sound always changes into /b/ following the /s/ sound.

Now this got me curious and I wondered: exactly what are the rules governing sound changes in English consonants? Can I learn them so I don’t make this mistake again? Native English speakers don’t know these rules (consciously at least), and even ESL materials don’t go into much detail about subtle aspects of pronunciation. The best resources for this would be linguistics textbooks on English phonology.

I consulted a textbook called “Gimson’s Pronunciation of English” [1]. For just the rules regarding sound changes of the /t/ sound at the word-final position, the book lists 6 rules. Here’s a summary of the first 3:

  • No audible release in syllable-final positions, especially before a pause. Examples: mat, map, robe, road. To distinguish /t/ from /d/, the preceding vowel is lengthened for /d/ and shortened for /t/.
  • In stop clusters like “white post” (t + p) or “good boy” (d + b), there is no audible release for the first consonant.
  • When a plosive consonant is followed by a nasal consonant that is homorganic (articulated in the same place), then the air is released out of the nose instead of the mouth (eg: topmost, submerge). However, this doesn’t happen if the nasal consonant is articulated in a different place (eg: big man, cheap nuts).

As you can see, the rules are quite complicated. The book is somewhat challenging for non-linguists — these are just the rules for /t/ at the word-final position; the book goes on to spend hundreds of pages to cover all kinds of vowel changes that occur in stressed and unstressed syllables, when combined with other words, and so on. For a summary, take a look at the Wikipedia article on English Phonology.

What’s really amazing is how native speakers learn all these patterns, perfectly, as babies. Native speakers may make orthographic mistakes like mixing up “their, they’re, there”, but they never make phonological mistakes like forgetting to de-aspirate the /p/ in “spin” — they simply get it right every time, without even realizing it!


Some of my friends immigrated to Canada at a similar or later age than me, and learned English with no noticeable accent. Therefore, people sometimes found it strange that I still have an accent. Even more interesting is the fact that although my pronunciation is non-native, I don’t make non-native grammatical mistakes. In other words, I can intuitively judge which sentences are grammatical or ungrammatical just as well as a native speaker. Does that make me a linguistic anomaly? Intrigued, I dug deeper into academic research.

In 1999, Flege et al. conducted a study of Korean-American immigrants who moved to the USA at an early age [2]. Each participant was given two tasks. In the first task, the participant was asked to speak a series of English sentences, and native speakers judged how much of a foreign accent was present on a scale from 1 to 9. In the second task, the participant was a list of English sentences, some grammatical and some not, and picked which ones were grammatical.

Linguists hypothesize that during first language acquisition, babies learn the phonology of their language long before they start to speak; grammatical structure is acquired much later. The Korean-American study seems to support this hypothesis. For the phonological task, immigrants who arrived as young as age 3 sometimes retained a non-native accent into adulthood.

3.pngAbove: Scores for phonological task decrease as age of arrival increases, but even very early arrivals retain a non-native accent.

Basically, arriving before age 6 or so increases the chance of the child developing a native-like accent, but by no means does it guarantee it.

On the other hand, the window for learning grammar is much longer:

4.pngAbove: Scores for grammatical task only start to decrease after about age 7.

Age of arrival is a large factor, but does not explain everything. Some people are just naturally better at acquiring languages than others. The study also looked at the effect of other factors like musical ability and perceived importance of English on the phonological score, but the connection is a lot weaker.

Language is so easy that every baby picks it up, yet so complex that linguists write hundreds of pages to describe it. Even today, language acquisition is poorly understood, and there are many unresolved questions about how it works.


References

  1. Cruttenden, Alan. “Gimson’s Pronunciation of English, 8th Edition”. Routeledge, 2014.
  2. Flege, James Emil et al. “Age Constraints on Second Language Acquisition”. Journal of Memory and Language, Issue 41, 1999.

The Power Law Distribution and the Harsh Reality of Language Learning

I’m an avid language learner, and sometimes people ask me: “how many languages do you speak?” If we’re counting all the languages in which I can have at least a basic conversation, then I can speak five languages — but can I really claim fluency in a language if I can barely read children’s books? Despite being a seemingly innocuous question, it’s not so simple to answer. In this article, I’ll try to explain why.

Let’s say you’re just starting to study Japanese. You might picture yourself being able to do the following things, after a few months or years of study:

  1. Have a conversation with a Japanese person who doesn’t speak any English
  2. Watch the latest episode of some anime in Japanese before the English subtitles come out
  3. Overhear a conversation between two Japanese people in an elevator

After learning several languages, I discovered that the first task is a lot easier than the other two, by an order of magnitude. Whether in French or in Japanese, I would quickly learn enough of the language to talk to people, but the ability to understand movies and radio remains elusive even after years of study.

There is a fundamental difference in how language is used in one-on-one conversation versus the other two tasks. When conversing with a native speaker, it is possible for him to avoid colloquialisms, speak slower, and repeat things you didn’t understand using simpler words. On the other hand, when listening to native-level speech without the speaker adjusting for your language level, you need to be near native-level yourself to understand what’s going on.

We can justify this concept using statistics. By looking at how frequencies of English words are distributed, we show that after an initial period of rapid progress, it soon becomes exponentially harder to get better at a language. Conversely, even a small decrease in language complexity can drastically increase comprehension by non-native listeners.

Reaching conversational level is easy

For the rest of this article, I’ll avoid using the word “fluent”, which is rather vague and misleading. Instead, I will call a “conversational” speaker someone who can conduct some level of conversation in a language, and a “near-native” speaker someone who can readily understand speech and media intended for native speakers.

It’s surprising how little of a language you actually need to know to have a decent conversation with someone. Basically, you need to know:

  1. A set of about 1000-2000 very basic words (eg: person, happy, cat, slow, etc).
  2. Enough grammar to form sentences (eg: present / future / past tenses; connecting words like “then”, “because”; conditionals, comparisons, etc). Grammar doesn’t need to be perfect, just close enough for the listener to understand what you’re trying to say.
  3. When you want to say something but don’t know the word for it, be flexible enough to work around the issue and express it with words you do know.

For an example of English using only basic words, look at the Simple English Wikipedia. It shows that you can explain complex things using a vocabulary of only about 1000 words.

For another example, imagine that Bob, a native English speaker, is talking to Jing, an international student from China. Their conversation might go like this:

Bob: I read in the news that a baby got abducted by wolves yesterday…

Jing: Abducted? What do you mean?

Bob: He got taken away by wolves while the family was out camping.

Jing: Wow, that’s terrible! Is he okay now?

In this conversation, Jing indicates that she doesn’t understand a complex word, “abducted”, and Bob rephrases the idea using simpler words, and the conversation goes on. This pattern happens a lot when I’m conversing with native Japanese speakers.

After some time, Bob gets an intuitive feeling for what level of words Jing can understand, and naturally simplifies his speech to accommodate. This way, the two can converse without Jing explicitly interrupting and asking Bob to repeat what he said.

Consequently, reaching conversational level in a language is not very hard. Some people claim you can achieve “fluency” in 3 months for a language. I think this is a reasonable amount of time for reaching conversational level.

What if you don’t have the luxury of the speaker simplifying his level of speech for you? We shall see that the task becomes much harder.

The curse of the Power Law

Initially, I was inspired to write this article after an experience with a group of French speakers. I could talk to any of them individually in French, which is hardly remarkable given that I studied the language since grade 4 and minored in it in university. However, when they talked between themselves, I was completely lost, and could only get a vague sense of what they were talking about.

Feeling slightly embarrassed, I sought an explanation for this phenomenon. Why was it that I could produce 20-page essays for university French classes, but struggled to understand dialogue in French movies and everyday conversations between French people?

The answer lies in the distribution of word frequencies in language. It doesn’t matter if you’re looking at English or French or Japanese — every natural language follows a power law distribution, which means that the frequency of every word is inversely proportional to its rank in the frequency table. In other words, the 1000th most common word appears twice as often as the 2000th most common word, and four times as often as the 4000th most common word, and so on.

(Aside: this phenomenon is sometimes called Zipf’s Law, but refers to the same thing. It’s unclear why this occurs, but the law holds in every natural language)

1.pngAbove: Power law distribution in natural languages

The power law distribution exhibits the long tail property, meaning that as you advance further to the right of the distribution (by learning more vocabulary), the words become less and less common, but never drops off completely. Furthermore, rare words like “constitution” or “fallacy” convey disproportionately more meaning than common words like “the” or “you”.

This is bad news for language learners. Even if you understand 90% of the words of a text, the remaining 10% are the most important words in the passage, so you actually understand much less than 90% of the meaning. Moreover, it takes exponentially more vocabulary and effort to understand 95% or 98% or 99% of the words in the text.

I set out to experimentally test this phenomenon in English. I took the Brown Corpus, containing a million words of various English text, and computed the size of vocabulary you would need to understand 50%, 80%, 90%, 95%, 98%, 99%, and 99.5% of the words in the corpus.

2.png

By knowing 75 words, you already understand half of the words in a text! Of course, just knowing words like “the” and “it” doesn’t get you very far. Learning 2000 words is enough to have a decent conversation and understand 80% of the words in a text. However, it gets exponentially harder after that: to get from 80% to 98% comprehension, you need to learn more than 10 times as many words!

(Aside: in this analysis I’m considering conjugations like “swim” and “swimming” to be different words; if you count only the stems, you end up with lower word counts but they still follow a similar distribution)

How many words can you miss and still be able to figure out the meaning by inference? In a typical English novel, I encounter about one word per page that I’m unsure of, and a page contains about 200-250 words, so I estimate 99.5% comprehension is native level. When there are more than 5 words per page that I don’t know, then reading becomes very slow and difficult — this is about 98% comprehension.

Therefore I will consider 98% comprehension “near-native”: above this level, you can generally infer the remaining words from context. Below this level, say between 90% to 98% comprehension, you may understand generally what’s going on, but miss a lot of crucial details.

3.pngAbove: Perceived learning curve for a foreign language

This explains the difficulty of language learning. In the beginning, progress is fast, and in a short period of time you learn enough words to have conversations. After that, you reach a long intermediate-stage plateau where you’re learning more words, but don’t know enough to understand native-level speech, and anybody speaking to you must use a reduced vocabulary in order for you to understand. Eventually, you will know enough words to infer the rest from context, but you need a lot of work to reach this stage.

Implications for language learners

The good news is that if you want to converse with people in a language, it’s perfectly doable in 3 to 6 months. On the other hand, to watch TV shows in the language without subtitles or understand people speaking naturally is going to take a lot more work — probably living for a few years in a country where the language is spoken.

For most of us, living abroad for several years is not an option. Fortunately, there are lots of material on the Internet in any language imaginable. I built a tool called LevelText to help you find things to read in your target language (currently works for French and Spanish). It’s basically a search engine optimized for finding web and news articles of your reading level, and it can turn web articles into mini language lessons.

Is there any shortcut instead of slowly learning thousands of words? I can’t say for sure, but somehow I doubt it. By nature, words are arbitrary clusters of sounds, so no amount of cleverness can help you deduce the meaning of words you’ve never seen before. And when the proportion of unknown words is above a certain threshold, it quickly becomes infeasible to try to infer meaning from context. We’ve reached the barrier imposed by the power law distribution.


Now I will briefly engage in some sociological speculation.

My university has a lot of international students. I’ve always noticed that these students tend to form social groups speaking their native non-English languages, and rarely assimilate into English-speaking social groups. At first I thought maybe this was because their English was bad — but I talked to a lot of international students in English and their English seemed okay: noticeably non-native but I didn’t feel there was a language barrier. After all, all our lectures are in English, and they get by.

However, I noticed that when I talked to international students, I subconsciously matched their rate of speaking, speaking just a little bit slower and clearer than normal. I would also avoid the usage of colloquialisms and cultural references that they might not understand.

If the same international student went out to a bar with a group of native English speakers, everyone else would be speaking at normal native speed. Even though she understands more than 90% of the words being spoken, it’s not quite enough to follow the discussion, and she doesn’t want to interrupt the conversation to clarify a word. As everything builds on what was previously said in the conversation, missing a word here and there means she is totally lost.

It’s not that immigrants don’t want to assimilate into our culture, but rather, we don’t realize how hard it is to master a language. On the surface, going from 90% to 98% comprehension looks like a small increase, but in reality, it takes an immense amount of work.

Read further discussion of this article on /r/languagelearning!