The Power Law Distribution and the Harsh Reality of Language Learning

I’m an avid language learner, and sometimes people ask me: “how many languages do you speak?” If we’re counting all the languages in which I can have at least a basic conversation, then I can speak five languages — but can I really claim fluency in a language if I can barely read children’s books? Despite being a seemingly innocuous question, it’s not so simple to answer. In this article, I’ll try to explain why.

Let’s say you’re just starting to study Japanese. You might picture yourself being able to do the following things, after a few months or years of study:

  1. Have a conversation with a Japanese person who doesn’t speak any English
  2. Watch the latest episode of some anime in Japanese before the English subtitles come out
  3. Overhear a conversation between two Japanese people in an elevator

After learning several languages, I discovered that the first task is a lot easier than the other two, by an order of magnitude. Whether in French or in Japanese, I would quickly learn enough of the language to talk to people, but the ability to understand movies and radio remains elusive even after years of study.

There is a fundamental difference in how language is used in one-on-one conversation versus the other two tasks. When conversing with a native speaker, it is possible for him to avoid colloquialisms, speak slower, and repeat things you didn’t understand using simpler words. On the other hand, when listening to native-level speech without the speaker adjusting for your language level, you need to be near native-level yourself to understand what’s going on.

We can justify this concept using statistics. By looking at how frequencies of English words are distributed, we show that after an initial period of rapid progress, it soon becomes exponentially harder to get better at a language. Conversely, even a small decrease in language complexity can drastically increase comprehension by non-native listeners.

Reaching conversational level is easy

For the rest of this article, I’ll avoid using the word “fluent”, which is rather vague and misleading. Instead, I will call a “conversational” speaker someone who can conduct some level of conversation in a language, and a “near-native” speaker someone who can readily understand speech and media intended for native speakers.

It’s surprising how little of a language you actually need to know to have a decent conversation with someone. Basically, you need to know:

  1. A set of about 1000-2000 very basic words (eg: person, happy, cat, slow, etc).
  2. Enough grammar to form sentences (eg: present / future / past tenses; connecting words like “then”, “because”; conditionals, comparisons, etc). Grammar doesn’t need to be perfect, just close enough for the listener to understand what you’re trying to say.
  3. When you want to say something but don’t know the word for it, be flexible enough to work around the issue and express it with words you do know.

For an example of English using only basic words, look at the Simple English Wikipedia. It shows that you can explain complex things using a vocabulary of only about 1000 words.

For another example, imagine that Bob, a native English speaker, is talking to Jing, an international student from China. Their conversation might go like this:

Bob: I read in the news that a baby got abducted by wolves yesterday…

Jing: Abducted? What do you mean?

Bob: He got taken away by wolves while the family was out camping.

Jing: Wow, that’s terrible! Is he okay now?

In this conversation, Jing indicates that she doesn’t understand a complex word, “abducted”, and Bob rephrases the idea using simpler words, and the conversation goes on. This pattern happens a lot when I’m conversing with native Japanese speakers.

After some time, Bob gets an intuitive feeling for what level of words Jing can understand, and naturally simplifies his speech to accommodate. This way, the two can converse without Jing explicitly interrupting and asking Bob to repeat what he said.

Consequently, reaching conversational level in a language is not very hard. Some people claim you can achieve “fluency” in 3 months for a language. I think this is a reasonable amount of time for reaching conversational level.

What if you don’t have the luxury of the speaker simplifying his level of speech for you? We shall see that the task becomes much harder.

The curse of the Power Law

Initially, I was inspired to write this article after an experience with a group of French speakers. I could talk to any of them individually in French, which is hardly remarkable given that I studied the language since grade 4 and minored in it in university. However, when they talked between themselves, I was completely lost, and could only get a vague sense of what they were talking about.

Feeling slightly embarrassed, I sought an explanation for this phenomenon. Why was it that I could produce 20-page essays for university French classes, but struggled to understand dialogue in French movies and everyday conversations between French people?

The answer lies in the distribution of word frequencies in language. It doesn’t matter if you’re looking at English or French or Japanese — every natural language follows a power law distribution, which means that the frequency of every word is inversely proportional to its rank in the frequency table. In other words, the 1000th most common word appears twice as often as the 2000th most common word, and four times as often as the 4000th most common word, and so on.

(Aside: this phenomenon is sometimes called Zipf’s Law, but refers to the same thing. It’s unclear why this occurs, but the law holds in every natural language)

1.pngAbove: Power law distribution in natural languages

The power law distribution exhibits the long tail property, meaning that as you advance further to the right of the distribution (by learning more vocabulary), the words become less and less common, but never drops off completely. Furthermore, rare words like “constitution” or “fallacy” convey disproportionately more meaning than common words like “the” or “you”.

This is bad news for language learners. Even if you understand 90% of the words of a text, the remaining 10% are the most important words in the passage, so you actually understand much less than 90% of the meaning. Moreover, it takes exponentially more vocabulary and effort to understand 95% or 98% or 99% of the words in the text.

I set out to experimentally test this phenomenon in English. I took the Brown Corpus, containing a million words of various English text, and computed the size of vocabulary you would need to understand 50%, 80%, 90%, 95%, 98%, 99%, and 99.5% of the words in the corpus.

2.png

By knowing 75 words, you already understand half of the words in a text! Of course, just knowing words like “the” and “it” doesn’t get you very far. Learning 2000 words is enough to have a decent conversation and understand 80% of the words in a text. However, it gets exponentially harder after that: to get from 80% to 98% comprehension, you need to learn more than 10 times as many words!

(Aside: in this analysis I’m considering conjugations like “swim” and “swimming” to be different words; if you count only the stems, you end up with lower word counts but they still follow a similar distribution)

How many words can you miss and still be able to figure out the meaning by inference? In a typical English novel, I encounter about one word per page that I’m unsure of, and a page contains about 200-250 words, so I estimate 99.5% comprehension is native level. When there are more than 5 words per page that I don’t know, then reading becomes very slow and difficult — this is about 98% comprehension.

Therefore I will consider 98% comprehension “near-native”: above this level, you can generally infer the remaining words from context. Below this level, say between 90% to 98% comprehension, you may understand generally what’s going on, but miss a lot of crucial details.

3.pngAbove: Perceived learning curve for a foreign language

This explains the difficulty of language learning. In the beginning, progress is fast, and in a short period of time you learn enough words to have conversations. After that, you reach a long intermediate-stage plateau where you’re learning more words, but don’t know enough to understand native-level speech, and anybody speaking to you must use a reduced vocabulary in order for you to understand. Eventually, you will know enough words to infer the rest from context, but you need a lot of work to reach this stage.

Implications for language learners

The good news is that if you want to converse with people in a language, it’s perfectly doable in 3 to 6 months. On the other hand, to watch TV shows in the language without subtitles or understand people speaking naturally is going to take a lot more work — probably living for a few years in a country where the language is spoken.

Is there any shortcut instead of slowly learning thousands of words? I can’t say for sure, but somehow I doubt it. By nature, words are arbitrary clusters of sounds, so no amount of cleverness can help you deduce the meaning of words you’ve never seen before. And when the proportion of unknown words is above a certain threshold, it quickly becomes infeasible to try to infer meaning from context. We’ve reached the barrier imposed by the power law distribution.


Now I will briefly engage in some sociological speculation.

My university has a lot of international students. I’ve always noticed that these students tend to form social groups speaking their native non-English languages, and rarely assimilate into English-speaking social groups. At first I thought maybe this was because their English was bad — but I talked to a lot of international students in English and their English seemed okay: noticeably non-native but I didn’t feel there was a language barrier. After all, all our lectures are in English, and they get by.

However, I noticed that when I talked to international students, I subconsciously matched their rate of speaking, speaking just a little bit slower and clearer than normal. I would also avoid the usage of colloquialisms and cultural references that they might not understand.

If the same international student went out to a bar with a group of native English speakers, everyone else would be speaking at normal native speed. Even though she understands more than 90% of the words being spoken, it’s not quite enough to follow the discussion, and she doesn’t want to interrupt the conversation to clarify a word. As everything builds on what was previously said in the conversation, missing a word here and there means she is totally lost.

It’s not that immigrants don’t want to assimilate into our culture, but rather, we don’t realize how hard it is to master a language. On the surface, going from 90% to 98% comprehension looks like a small increase, but in reality, it takes an immense amount of work.

Read further discussion of this article on /r/languagelearning!

23 thoughts on “The Power Law Distribution and the Harsh Reality of Language Learning

  1. Such an excellent article. I particularly liked the final paragraphs that explain why foreigners are left aside.
    It’s time for us to be more understanding about foreigners in our countries.

    Like

  2. Thanks for the article. As a language learner in México, I have been frustrated why my progress isn’t what I hoped it would be after 5 months in this country. He law of diminishing returns is certainly at play here. Now I know firsthand what it means to be a foreigner.

    Like

  3. That’s a great article, and a very clear exemplification of that invisible barrier that separates “language learning” from being near-native.

    I’m a native language teacher (Spanish) myself, now living in the US. I would say I have a good understanding of languages and learning (graduated in Linguistics). And yet, I came to the US for the first time as an adult, and noticed that there’s a huge gap that was never covered in any course or teaching material before. There’s just so much shared underlying knowledge, in a way that native speakers are not really concious of.

    But what made me comment is the thought that it’s not just the words. It’s the values, the feelings, etc. Quick example: “commute” is not a word in Spanish. We can say “the daily trip from home to work”, but there’s so much more behind that word, and you don’t really understand it until you see it used by people.

    Anyway, forgive my rant! I just got excited from your thoughts. Good job! I’ll dig up a bit more into your blog.

    Like

  4. This is a superb article! As an active language learner and user (German and Spanish as mother tongues, English near-native, French near-native, Russian close to near-native and Arabic in the abyss between conversational and near-native) I have experienced the phenomenon that you describe several times. Although I have learned French for several years and studied in France for a year, although I understand most of the news and talk in films, it is still difficult to understand native speakers when they have a natural conversation. Or let’s take my Russian for example (I recently passed the B2 exam and the testing person told me, I should’ve passed C1 instead): although I can have a conversation about virtually everything, about the most complex issues, I will most certainly get lost in a simple conversation about the newest hot music and performers when my Russian friends have a conversation between themselves. Thanks for writing the article and sharing it! Hope to read more from you!

    Like

  5. Very nice post! Thanks a lot! Two addenda (for further consideration) are advisable thou. First, words do not “convey meanings” but rather contribute to disambiguation (or discrimination) of meaning that is distributed in an utterance or, furthermore, discourse. In other words, words help in lowering uncertainty about what the communication is about. An example that I like to use in my talks is to think of three simple sentences:
    Walk me through your blogpost on Language Learning.
    I’ll walk you to the train station.
    They walk on clouds.
    Hardcore corpus-based approach would say that the word “walk” is quite frequent here (so in your framework not very ‘meaningful’, i.e., ‘informative’). But, in fact, those are very different “walks”, don’t you think?
    Second, and in relation to my previous point, real ‘hardness’ hits when one needs to pick-up contextual regularities (or if you prefer probabilities, or likelihoods etc.); i.e., co-occurrences of word in communication (personally, I prefer the term ‘syntagmatic regularities’ but it could be just more technical). Formally, they still show similar power-law (or exponential? 😉) property as words, but combinatorial explosion kicks in and learners struggle long and suffer hard to figure out why some combinations are typical, or possible, or odd, or ‘never-ever’.

    Like

    1. I don’t really understand what you’re saying. It seems that “conveying meaning” and “disambiguating meaning” is the same thing? If you don’t understand a single word in a sentence, then it could be anything in the universe of speakable sentences, but each time you understand a word, it narrows down the range of possibilities for what the speaker might have said.

      Like

      1. Yes, exactly: “each time you understand a word, it narrows down the range of possibilities for what the speaker might have said”. Beautifully stated! My concern was only to the point that someone could conclude that you are, actually, claiming that words are descrete units of meaning, which are not and which you never said. But that’s all less important as I find your post amazing and I share it shamelessly… ☺️

        Like

  6. Huh?? You say, “Reaching conversational level in a language is not very hard. Some people claim you can achieve “fluency” in 3 months for a language. I think this is a reasonable amount of time for reaching conversational level.” That’s a very vague statement, in any language. Three months of what? A 101-102 college course? A one-hour per day course for 3 months? A five-hour per day course for 3 months? Or 24/7 concentrated study-immersion for 3 months among solely native speakers in the country where L2 is spoken? In any case, what use is what you say in your article? All your stats and chit chat about word frequencies, etc., doesn’t help one learn a second language any faster? Instead of wasting my time reading your article, I now wish I’d spent my time reading or listening to some passage in L2 I’m trying to learn. BTW: Have you noticed that so-called “linguists” spend all their time talking and writing in their own native language about L2s rather than showing and demonstrating how well they have supposedly learned an L2 themselves?

    Like

    1. It really depends on the person, some people are better at languages than others (and the language too, Spanish is much easier for English speakers than Chinese). I’m just saying that it seems reasonable for a talented language learner to achieve conversational level in a language after 3 months of study, with a moderate amount of effort.

      Many linguists do speak the L2s they are studying.

      Like

    1. I don’t know about that particular linguist, and for sure, there are plenty of monolingual linguists. But if you do linguistic fieldwork, and write 10 papers about Navajo noun classes in that time, I’d imagine you’d be decent at Navajo by then.

      Like

  7. Great article!
    A way of looking at it is about estimating the time it takes to acquire this vocabulary size. A quick calculation for learning 15,000 words, at a pace of 10 new words per day, require a little over 4 years of work…
    Another thing is that vocabulary, especially “down the long tail”, tends to be very topic dependant. Even if you are “fluent”, when you start reading about a new topic, you might still be at a loss as you need to acquire the vocabulary of the field.

    Like

  8. Pingback: The Power Law Distribution and the Harsh Reality of Language Learning | Intersection of Linguistics, Language, and Culture (ILLC)

  9. Pingback: The Power Law Distribution and the Harsh Reality of Language Learning | NSF REU Site Intersection of Linguistics, Language, and Culture (ILLC)

  10. Pingback: The reality of Language Learning | NSF REU Site Intersection of Linguistics, Language, and Culture (ILLC)

  11. This was a really interesting read. I also do research in linguistics–specifically, second language acquisition– but my focus is not lexicon or morphology and so yours is a perspective that I hadn’t extensively considered before. Vocabulary is certainly important, and you begin to touch on a few points that I think are perhaps equally important in explaining the difficulty in improving in a language and making the “jump” between having an everyday conversation in a language and, say, understanding a conversation between two native speakers of that language or a news broadcast.

    One is negotiation of meaning. As you observed, when speaking to a non-native speaker you tend to adjust your speech for them, and vice-versa when someone speaks to you in your non-native language. Moreover, in normal interactions people can ask each other clarification questions, or make requests for repetition, which may lead to paraphrase or further explanations–much as you illustrated in your example with Jing and Bob. Obviously, that is something you cannot do when you’re merely overhearing a conversation between two other people (unless you are especially bold of heart!). However, causes for difficulty in understanding are not restricted to unfamiliar vocabulary; Rate of speech, accents, cultural references, grammar, idiolects, all of these can act as potential sources of “trouble” that may be resolved through negotiation of meaning. And the ability to resolve communication breakdowns itself is considered another type of language-related competence, i.e. strategic competence or interactional competence.

    There are also other resources that are present in face-to-face conversations that may not be available when one is a bystander or audience member. Contextual cues, including physical ones, and shared history and past interactions are all key sources of information for interpreting someone’s message. Another factor that doesn’t get as much attention is affect; I’m probably going to be much more invested in trying to figure out what you are saying to me if it’s somehow consequential for me, than understanding the governor election results in Sapporo (an extreme example, but you get the point).

    As for the difficulty for non-native speakers in assimilating to the host/majority culture or community; I think listening comprehension is just one aspect of the story, albeit an important one. Every language learner has their own reasons and goals for learning a language. It might very well be the case that an international student studying in America does not want to assimilate into American culture; Perhaps they don’t see living in America as something in their future, and so being able to understand lectures and get by academically is enough, and that is perfectly valid. Not everyone wants to become highly proficient–dare I say, “fluent”–in their non-native language(s), and how they view their relationship with a language and the community will in turn affect their desire to use and improve in that language, which is another reason why many foreign language learners will plateau after reaching a certain point. That being said, I agree that the linguistic challenges that non-native speakers must face when assimilating to a culture is one that is often overlooked.

    TL;DR, I would say that that vocabulary knowledge, while undoubtedly a source of difficulty in advancement in language learning, is by no means the only or even primary explanation 🙂

    Liked by 1 person

    1. For sure, there are numerous factors that make languages difficult to learn (and also what makes language learning a fun challenge!). I think vocabulary size is a reasonable proxy for linguistic competence because (1) it’s the most basic, in the sense that you have to know a certain amount of vocabulary before any of the other stuff starts to matter, and (2) it’s easier to quantify than other dimensions of competence.

      Negotiation of meaning seems like a meta-linguistic skill that’s transferable to leaning other foreign languages, unlike vocabulary which is tied to a specific language. This is partly why learning the first foreign language is usually the most difficult, and learning the second foreign language is easier.

      I’m also a researcher (in NLP); I’m familiar with phonology / morphology / syntax, but don’t know much about the pragmatics or sociolinguistics fields that you mentioned. I’d be interested to know if the patterns I casually observed have been attested in the literature or if there are any aspects where the literature disagrees.

      Like

  12. Very interesting article. I am from South Africa, can hear all 10 official languages except 1 (Afrikaans). I think it was my upbringing that contributed to that, So basically from birth I was raised in an area where three languages were spoken (the area where my father was working), at five we moved to another province (where my father my born) and in that area different three languages were common, I would frequently visit these areas on holidays back and forth. At school there were kids who spoke a different language, which I leant too, my sister got married to a man who spoke another language so I learnt his language (mostly from the kids)…The story is quite long, but my main point is this, it is better to learn speaking a language when you’re very young and exposed to other kids who speak different languages. The only language I wasn’t exposed to in that manner was Afrikaans, which I was forced to learn at school for a few years and had nobody to communicate it with, which could be why I don’t remember most of the stuff I leant using that language, it’s a mission to understand a simple paragraph in that language.

    So one of the things I plan to do with my children is to give them that kind of exposure from a young age, but instead of being in one country, I would move between countries with my wife. It’s a crazy Idea, but I’m looking forward to it.

    Like

    1. Yea, children have remarkable language learning ability. Southern China has a similar situation where you can grow up speaking several regional languages (Cantonese + Min Nan) plus the national language (Mandarin). Unfortunately as a child you don’t really get to choose what languages you’re exposed to.

      Like

Leave a comment