The biggest headache with Chinese NLP: indeterminate word segmentation

I’ve had a few opportunities to work with NLP in Chinese. English and Chinese are very different languages, yet generally the same techniques apply to both. But there is one source of frustration that comes up from time to time, and it’s perhaps not what you’d expect.

The difficulty is that Chinese doesn’t put words between spaces. Soallyourwordsarejumbledtogetherlikethis.

“Okay, that’s fine,” you say. “We’ll just have to run a tokenizer to separate apart the words before we do anything else. And here’s a neural network that can do this with 93% accuracy (Qi et al., 2020). That should be good enough, right?”

Well, kind of. Accuracy here isn’t very well-defined because Chinese people don’t know how to segment words either. When you ask two native Chinese speakers to segment a sentence into words, they only agree about 90% of the time (Wang et al., 2017). Chinese has a lot of compound words and multi-word expressions, so there’s no widely accepted definition of what counts as a word. Some examples: 吃饭,外国人,开车,受不了. It is also possible (but rare) for a sentence to have multiple segmentations that mean different things.

Arguably, word boundaries are ill-defined in all languages, not just Chinese. Hapselmath (2011) defined 10 linguistic criteria to determine if something is a word (vs an affix or expression), but it’s hard to come up with anything consistent. Most writing systems puts spaces in between words, so there’s no confusion. Other than Chinese, only a handful of other languages (Japanese, Vietnamese, Thai, Khmer, Lao, and Burmese) have this problem.

Word segmentation ambiguity causes problems in NLP systems when different components expect different ways of segmenting a sentence. Another way the problem can appear is if the segmentation for some human-annotated data doesn’t match what a model expects.

Here is a more concrete example from one of my projects. I’m trying to get a language model to predict a tag for every word (imagine POS tagging using BERT). The language model uses SentencePiece encoding, so when a word is out-of-vocab, it gets converted into multiple subword tokens.

“expedite ratification of the proposed law”
=> [“expedi”, “-te”, “ratifica”, “-tion”, “of”, “the”, “propose”, “-d”, “law”]

In English, a standard approach is to use the first subword token of every word, and ignore the other tokens, like this:

This doesn’t work in Chinese — because of the word segmentation ambiguity, the tokenizer might produce tokens that span across multiple of our words:

So that’s why Chinese is sometimes headache-inducing when you’re doing multilingual NLP. You can work around the problem in a few ways:

  1. Ensure that all parts of the system uses a consistent word segmentation scheme. This is easy if you control all the components, but hard when working with other people’s models and data though.
  2. Work on the level of characters and don’t do word segmentation at all. This is what I ended up doing, and it’s not too bad, because individual characters do carry semantic meaning. But some words are unrelated to their character meanings, like transliterations of foreign words.
  3. Do some kind of segment alignment using Levenshtein distance — see the appendix of this paper by Tenney et al. (2019). I’ve never tried this method.

One final thought: the non-ASCII Chinese characters surprisingly never caused any difficulties for me. I would’ve expected to run into encoding problems occasionally, as I had in the past, but never had any character encoding problems with Python 3.


  1. Haspelmath, Martin. “The indeterminacy of word segmentation and the nature of morphology and syntax.” Folia linguistica 45.1 (2011): 31-80.
  2. Qi, Peng, et al. “Stanza: A python natural language processing toolkit for many human languages.” Association for Computational Linguistics (ACL) System Demonstrations. 2020.
  3. Tenney, Ian, et al. “What do you learn from context? Probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019.
  4. Wang, Shichang, et al. “Word intuition agreement among Chinese speakers: a Mechanical Turk-based study.” Lingua Sinica 3.1 (2017): 13.

Representation Learning for Discovering Phonemic Tone Contours

My paper titled “Representation Learning for Discovering Phonemic Tone Contours” was recently presented at the SIGMORPHON workshop, held concurrently with ACL 2020. This is joint work with Jing Yi Xie and Frank Rudzicz.

Problem: Can an algorithm learn the shapes of phonemic tones in a tonal language, given a list of spoken words?

Answer: We train a convolutional autoencoder to learn a representation for each contour, then use the mean shift algorithm to find clusters in the latent space.


By feeding the centers of each cluster into the decoder, we produce a prototypical contour that represents each cluster. Here are the results for Mandarin and Chinese.


We evaluate on mutual information with the ground truth tones, and the method is partially successful, but contextual effects and allophonic variation present considerable difficulties.

For the full details, read my paper here!

I didn’t break the bed, the bed broke: Exploring semantic roles with VerbNet / FrameNet

Some time ago, my bed fell apart, and I entered into a dispute with my landlord. “You broke the bed,” he insisted, “so you will have to pay for a new one.”

Being a poor grad student, I wasn’t about to let him have his way. “No, I didn’t break the bed,” I replied. “The bed broke.”


Above: My sad and broken bed. Did it break, or did I break it?

What am I implying here? It’s interesting how this argument relies on a crucial semantic difference between the two sentences:

  1. I broke the bed
  2. The bed broke

The difference is that (1) means I caused the bed to break (eg: by jumping on it), whereas (2) means the bed broke by itself (eg: through normal wear and tear).

This is intuitive to a native speaker, but maybe not so obvious why. One might guess from this example that any intransitive verb when used transitively (“X VERBed Y”) always means “X caused Y to VERB“. But this is not the case: consider the following pair of sentences:

  1. I attacked the bear
  2. The bear attacked

Even though the syntax is identical to the previous example, the semantic structure is quite different. Unlike in the bed example, sentence (1) cannot possibly mean “I caused the bear to attack”. In (1), the bear is the one being attacked, while in (2), the bear is the one attacking something.

broke-attackAbove: Semantic roles for verbs “break” and “attack”.

Sentences which are very similar syntactically can have different structures semantically. To address this, linguists assign semantic roles to the arguments of verbs. There are many semantic roles (and nobody agrees on a precise list of them), but two of the most fundamental ones are Agent and Patient.

  • Agent: entity that intentionally performs an action.
  • Patient: entity that changes state as a result of an action.
  • Many more.

The way that a verb’s syntactic arguments (eg: Subject and Object) line up with its semantic arguments (eg: Agent and Patient) is called the verb’s argument structure. Note that an agent is not simply the subject of a verb: for example, in “the bed broke“, the bed is syntactically a subject but is semantically a patient, not an agent.

Computational linguists have created several corpora to make this information accessible to computers. Two of these corpora are VerbNet and FrameNet. Let’s see how a computer would be able to understand “I didn’t break the bed; the bed broke” using these corpora.


Above: Excerpt from VerbNet entry for the verb “break”.

VerbNet is a database of verbs, containing syntactic patterns where the verb can be used. Each entry contains a mapping from syntactic positions to semantic roles, and restrictions on the arguments. The first entry for “break” has the transitive form: “Tony broke the window“.

Looking at the semantics, you can conclude that: (1) the agent “Tony” must have caused the breaking event, (2) something must have made contact with the window during this event, (3) the window must have its material integrity degraded as a result, and (4) the window must be a physical object. In the intransitive usage, the semantics is simpler: there is no agent that caused the event, and no instrument that made contact during the event.

The word “break” can take arguments in other ways, not just transitive and intransitive. VerbNet lists 10 different patterns for this word, such as “Tony broke the piggy bank open with a hammer“. This sentence contains a result (open), and also an instrument (a hammer). The entry for “break” also groups together a list of words like “fracture”, “rip”, “shatter”, etc, that have similar semantic patterns as “break”.


Above: Excerpt from FrameNet entry for the verb “break”.

FrameNet is a similar database, but based on frame semantics. The idea is that in order to define a concept, you have to define it in terms of other concepts, and it’s hard to avoid a cycle in the definition graph. Instead, it’s sometimes easier to define a whole semantic frame at once, which describes a conceptual situation with many different participants. The frame then defines each participant by what role they play in the situation.

The word “break” is contained in the frame called “render nonfunctional“. In this frame, an agent affects an artifact so that it’s no longer capable of performing its function. The core (semantically obligatory) arguments are the agent and the artifact. There are a bunch of optional non-core arguments, like the manner that the event happened, the reason that the agent broke the artifact, the time and place it happened, and so on. FrameNet tries to make explicit all of the common-sense world knowledge that you need to understand the meaning of an event.

Compared to VerbNet, FrameNet is less concerned with the syntax of verbs: for instance, it does not mention that “break” can be used intransitively. Also, it has more fine-grained categories of semantic roles, and contains a description in English (rather than VerbNet’s predicate logic) of how each semantic argument participates in the frame.

An open question is: how can computers use VerbNet and FrameNet to understand language? Nowadays, deep learning has come to dominate NLP research, so that VerbNet and FrameNet are often seen as relics of a past era, when people still used rule-based systems to do NLP. It turned out to be hard to use VerbNet and FrameNet to make computers do useful tasks.

But recently, the NLP community is realizing that deep learning has limitations when it comes to common-sense reasoning, that you can’t solve just by adding more layers on to BERT and feeding it more data. So maybe deep learning systems can benefit from these lexical semantic resources.

Edge probing BERT and other language models

Recently, there has been a lot of research in the field of “BERTology”: investigating what aspects of language BERT does and doesn’t understand, using probing techniques. In this post, I will describe the “edge probing” technique developed by Tenney et al., in two papers titled “What do you learn from context? Probing for sentence structure in contextualized word representations” and “BERT Rediscovers the Classical NLP Pipeline“. On my first read through these papers, the method didn’t seem very complicated, and the authors only spend a paragraph explaining how it works. But upon closer look, the method is actually nontrivial, and took me some time to understand. The details are there, but hidden in an appendix in the first of the two papers.

The setup for edge probing is you have some classification task that takes a sentence, and a span of consecutive tokens within the sentence, and produces an N-way classification. This can be generalized to two spans in the sentence, but we’ll focus on the single-span case. The tasks cover a range of syntactic and semantic functions, varying from part-of-speech tagging, dependency parsing, coreference resolution, etc.


Above: Examples of tasks where edge probing may be used. Table taken from Tenney (2019a).

Let’s go through the steps of how edge probing works. Suppose we want to probe for which parts of BERT-BASE contain the most information about named entity recognition (NER). In this NER setup, we’re given the named entity spans and only need to classify which type of entity it is (e.g: person, organization, location, etc). The first step is to feed the sentence through BERT, giving 12 layers, each layer being a 768-dimensional vector.


Above: Edge probing example for NER.

The probing model has several stages:

  1. Mix: learn a task-specific linear combination of the layers, giving a single 768-dimensional vector for each span token. The weights that are learned indicate how much useful information for the task is contained in each layer.
  2. Projection: learn a linear mapping from 768 down to 256 dimensions.
  3. Self-attention pooling: learn a function to generate a scalar weight for each span vector. Then, we normalize the weights to sum up to 1, and take a weighted average. The purpose of this is to collapse the variable-length sequence of span vectors into a single fixed-length 256-dimensional vector.
  4. Feedforward NN: learn a multi-layer perceptron classifier with 2 hidden layers.

For the two-span case, they use two separate sets of weights for the mix, projection, and self-attention steps. Then, the feedforward neural network takes a concatenated 512-dimensional vector instead of a 256-dimensional vector.

The edge probing model needs to be trained on a dataset of classification instances. The probe has weights that are initialized randomly and trained using gradient descent, but the BERT weights are kept constant while the probe is being trained.

This setup was more sophisticated than it looked at first glance, and it turns out the probe itself is quite powerful. In one of the experiments, the authors found that even with randomized input, the probe was able to get over 70% of the full model’s performance!

Edge probing is not the only way of probing deep language models. Other probing experiments (Liu et al., 2019) used a simple linear classifier: this is better for measuring what information can easily be recovered from representations.


  1. Tenney, Ian, et al. “What do you learn from context? probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019a.
  2. Tenney, Ian, Dipanjan Das, and Ellie Pavlick. “BERT Rediscovers the Classical NLP Pipeline.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019b.
  3. Liu, Nelson F., et al. “Linguistic Knowledge and Transferability of Contextual Representations.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

What if the government hadn’t released any Coronavirus economic stimulus?

It is March 23, 2020. After a month of testing fiascos, COVID-19 is ravaging the United States, with 40,000 cases in the US and 360,000 worldwide. There is a growing sense of panic as cities begin to lock down. Market circuit breakers have triggered four times in quick succession, with the stock market losing 30% of its value in mere weeks. There’s no sign that the worst is over.

covid-mar-23Above: Global Coronavirus stats on March 23, 2020, when the S&P 500 reached its lowest point during the pandemic (Source).

With businesses across the country closed and millions out of work, it’s clear that a massive financial stimulus is needed to prevent a total economic collapse. However, Congress is divided and is unable to pass the bill. Even when urgent action is needed, they squabble and debate over minor details, refusing to come to a compromise. The president denies that there’s any need for action. Both the democrats and the republicans are willing to do anything to prevent the other side from scoring a victory. The government is in a gridlock.

Let the businesses fail, they say. Don’t bail them out, they took the risk when the times were good, now you reap what you sow. Let them go bankrupt, punish the executives taking millions of dollars of bonuses. Let the free market do its job, after all, they can always start new businesses once this is all over.

April comes without any help from the government. Massive layoffs across all sectors of the economy as companies see their revenues drop to a fraction of normal levels, and layoff employees to try to preserve their cash. Retail and travel sectors are the most heavily affected, but soon, all companies are affected since people are hesitant to spend money. Unemployment numbers skyrocket to levels even greater than during the Great Depression.

Without a job, millions of people miss their rent payments, instead saving their money for food and essential items. Restaurants and other small businesses shut down. When people and businesses cannot pay rent, their landlords cannot pay the mortgages that they owe to the bank. A few small banks go bankrupt, and Wall Street waits anxiously for a government bailout. But unlike 2008, the state is in a deadlock, and there is no bailout coming. In 2020, no bank is too big to fail.

Each bank that goes down takes another bank down with it, until there is soon a cascading domino effect of bank failures. Everyone rushes to withdraw cash from their checking accounts before the bank collapses, which of course makes matters worse. Those too late to withdraw their cash lose their savings. This is devastation for businesses: even for those that escaped the pandemic, but there is no escaping systemic bank failure. Companies have no money in the bank to pay suppliers or make payroll, and thus, thousands of companies go bankrupt overnight.

Across the nation, people are angry at the government’s inaction, and take to the streets in protest. Having depleted their savings, some rob and steal from grocery stores to avoid starvation. The government finally steps in and deploys the military to keep order in the cities. They arrange for emergency supplies, enough to keep everybody fed, but just barely.

The lockdown lasts a few more months, and the virus is finally under control. Everyone is free to go back to work, but the problem is there are no jobs to go back to. In the process of all the biggest corporations going bankrupt, society has lost its complex network of dependencies and organizational knowledge. It only takes a day to lay off 100,000 employees, but to build up this structure from scratch will take decades.

A new president is elected, but it is too late, the damage has been done and cannot be reversed. The economy slowly recovers, but with less efficiency than before, and with workers employed in less productive roles, and the loss of productivity means that everyone enjoys a lower standard of living. Five years later, the virus is long gone, but the economy is nowhere close to its original state. By then, China emerges as the new dominant world power. The year 2020 goes down in history as a year of failure, where through inaction, a temporary health crisis led to societal collapse.

In our present timeline, fortunately, none of the above actually happened. The democrats and republicans put aside their differences and on March 25, swiftly passed a $2 billion dollar economic stimulus. The stock market immediately rebounded.

There was a period in March when it was seemed the government was in gridlock, and it wasn’t clear whether the US was politically capable of passing such a large stimulus bill. Is an economic collapse likely? Not really — no reasonable government would have allowed all of the banks to fail, so we would likely have a recession and not a total collapse. Banks did fail during the Great Depression, but macroeconomic theory was in its infancy at that time, and there’s no way such mistakes would’ve been repeated today. Still, this is the closest we’ve come to an economic collapse in a long time, and it’s fun to speculate about the consequences of what it would be like.

Three books about AI / ML / Stats

In this edition of my book review blog post series, I will summarize three books I recently read about artificial intelligence. AI is a hot topic nowadays, and people inside and outside the field have very different perspectives. Even as an AI researcher, I found a lot to learn from these books.

Emergence of the statistics discipline

Machine learning and statistics are closely related areas: ML can be viewed as statistics but with computers. Thus, to understand machine learning, it’s natural to start from the beginning and study the history of statistics.

The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century by David Salsburg

This book tells the story of how statistics emerged as a scientific discipline in the 20th century. The title of the book comes from a story where Fisher wanted to see if a lady can tell when tea or milk was added to the cup first; he comes up with a series of randomized experiments that motivate modern hypothesis testing. The book describes the lives and circumstances of the people involved, explaining the math pretty well using words, without getting too technical with equations. Some of the founding fathers of statistics:

  • Karl Pearson (1857-1936) was the founder of mathematical statistics, devised methods of estimating statistical parameters from data, founded the journal Biometrika, and applied these methods to confirm Darwin’s theory of natural selection. He had a dominating personality, and his son Egon Pearson also became a famous statistician.

  • William Gosset (1876-1937) discovered the Student t-distribution while working for Guinness, improving methods to brew beer. He had to publish under the pseudonym “Student” because Guinness wouldn’t let their employees publish.

  • Ronald Fisher (1890-1962) was a genius that invented a lot of modern statistics including MLE for estimating parameters, ANOVA, and experimental design. He originally used these methods to study the effects of fertilizers on crop variation, and eventually became a distinguished professor. Fisher did not get along with Pearson, and also dismissed evidence that smoking caused cancer long after it was accepted by the scientific community.

  • Jerzy Neyman (1894-1981) invented the standard textbook formulation of hypothesis testing against a null hypothesis, and introduced the concept of confidence interval. Fisher and many others were skeptical since it’s unclear what is the interpretation of p-value and 95% probability of the 95% confidence interval.

Statistics is now a crucial part in experiments across many scientific disciplines. Undoubtedly, statistics changed the way we do science, and this book tells the story of how it happened. I liked the first part of the book more, since it talks about the most influential figures in early statistics. By the latter half of the book, statistics had already diversified into numerous sub-disciplines, and the book jumps rapidly between a plethora of scientists.

Causal reasoning: a limitation of ML

Book by Judea Pearl, one of the leaders of causal inference who received the 2011 Turing award for his work on Bayesian networks. Pearl points out a flaw that affects all machine learning models, from the simplest linear regression to the deepest neural networks — it’s impossible to tell the difference between causation and correlation using data alone. Every morning I hear the birds chirp before sunrise, so do the birds cause the sun to rise? Obviously not, but for a machine, this is surprisingly difficult to deduce.

The Book of Why: The New Science of Cause and Effect by Judea Pearl

Pearl gives three levels of causation, where each level can’t be built up from tools of the lower levels:

  • Level 1 — Association: this is where most machine learning and statistics methods stand today. They can find correlations but can’t differentiate them from causation.

  • Level 2 — Intervention: using causal diagrams and do-notation, you can tell whether X causes Y. The first step is to use this machinery to determine if a causal relation is possible from the data, then apply level 1 methods to compute the strength of the causality.

  • Level 3 — Counterfactuals: given that you did X and Y happened, determine what would have happened if you did X’ instead.

The most reliable way to determine causality is through a randomized trial, but often this is impractical due to cost or ethics, and we only have observational data. A lot of scientists just control for as many variables as possible, but there are situations where this strategy is flawed. Using causal diagrams, the book explains more sophisticated techniques to determine causality, and a quick algorithm to decide if a variable should be controlled or not.

Causal inference is an active area of machine learning research, although an area that’s often ignored by mainstream ML. Judea Pearl thinks that figuring out a better representation of causation is a key missing ingredient for strong AI.

AI in the far future

When will we have superhuman artificial general intelligence (AGI)? Well, it depends on who you ask. The media often portrays AGI as on the verge of being achieved in just a few years, but AI researchers predict it to be out of reach for several decades or even centuries.

Superintelligence: Paths, Dangers, Strategies by Nick Bostrom

Bostrom believes strong AGI has a serious possibility of being achieved in the near future, say, by 2050. And once this happens, AI is an existential threat to humanity. Once AI has initially exceeded human ability, it will rapidly improve itself or use its programming skills to develop even stronger AI, and humans will be left in the dust. It will be very difficult to keep a superintelligent AI boxed, since it can develop advanced technology and there are many ways that it can escape from its sandbox.

Depending on how the AI is programmed, it may have very different values from humans. In many of Bostrom’s hypothetical scenarios, an AI designed for some narrow task (eg: producing paperclips) decides to take over the world and unleash an army of self-replicating nanobots to turn every atom in the universe into paperclips. There are a lot of unsolved questions of how to design an agent that can only single-mindedly maximize an objective function, without risk of it doing catastrophic unintended actions to maximize its objective.

For now, there is no imminent possibility of AGI, so it’s unclear to what extent specifying the value function will actually be a problem. There are much more immediate dangers of AI technology, for example, unfair bias against certain groups, and economic consequences of automation taking over jobs. Andrew Ng famously said: “fearing a rise of killer robots is like worrying about overpopulation on Mars“. Nevertheless, Bostrom makes a valid point: the risks of superhuman AI to humanity is so great that it’s worth taking seriously and investing in further research.

Why do polysynthetic languages all have very few speakers?

Polysynthetic languages are able to express complex ideas in one word, that in most languages would require a whole sentence. For example, in Inuktitut:


“I’ll have to go to the airport”

There’s no widely accepted definition of a polysynthetic language. Generally, polysynthetic languages have noun incorporation (where noun arguments are expressed affixes of a verb) and serial verb construction (where a single word contains multiple verbs). They are considered some of the most grammatically complex languages in the world.

Polysynthetic languages are most commonly found among the indigenous languages of North America. Only a few such languages have more than 100k speakers: Nahuatl (1.5m speakers), Navajo (170k speakers), and Cree (110k speakers). Most polysynthetic languages are spoken by a very small number of people and many are in danger of becoming extinct.

Why aren’t there more polysynthetic languages — major national languages with millions of speakers? Is it mere coincidence that the most complex languages have few speakers? According to Wray (2007), it’s not just coincidence, rather, languages spoken within a small, close-knit community with little outside contact tend to develop grammatical complexity. Languages with lots of external contact and adult learners tend to be more simplified and regular.

It’s well known that children are better language learners than adults. L1 and L2 language acquisition processes work very differently, so that children and adults have different needs when learning a language. Adult learners prefer regularity and expressions that can be decomposed into smaller parts. Anyone who has studied a foreign language has seen tables of verb conjugations like these:



For adult learners, the ideal language is predictable and has few exceptions. The number 12 is pronounced “ten-two“, not “twelve“. A doctor who treats your teeth is a “tooth-doctor“, rather than a “dentist“. Exceptions give the adult learner difficulties since they have to be individually memorized. An example of a very predictable language is the constructed language Esperanto, designed to have as few exceptions as possible and be easy to learn for native speakers of any European language.

Children learn languages differently. At the age of 12 months (the holophrastic stage), children start producing single words that can represent complex ideas. Even though they are multiple words in the adult language, the child initially treats them as a single unit:

whasat (what’s that)

gimme (give me)

Once they reach 18-24 months of age, children pick up morphology and start using multiple words at a time. Children learn whole phrases first, then only later learn to analyze them into parts on an as-needed basis, thus they have no difficulty with opaque idioms and irregular forms. They don’t really benefit from regularity either: when children learn Esperanto as a native language, they introduce irregularities, even when the language is perfectly regular.

We see evidence of this process in English. Native speakers frequently make mistakes like using “could of” instead of “could’ve“, or using “your” instead of “you’re“. This is evidence that native English speakers think of them as a single unit, and don’t naturally analyze them into their sub-components: “could+have” and “you+are“.

According to the theory, in languages spoken in isolated communities, where few adults try to learn the language, it ends up with complex and irregular words. When lots of grown-ups try to learn the language, they struggle with the grammatical complexity and simplify it. Over time, these simplifications eventually become a standard part of the language.

Among the world’s languages, various studies have found correlations between grammatical complexity and smaller population size, supporting this theory. However, the theory is not without its problems. As with any observational study, correlation doesn’t imply causation. The European conquest of the Americas decimated the native population, and consequently, speakers of indigenous languages have declined drastically in the last few centuries. Framing it this way, the answer to “why aren’t there more polysynthetic languages with millions of speakers” is simply: “they all died of smallpox or got culturally assimilated”.

If instead, Native Americans had sailed across the ocean and colonized Europe, would more of us be speaking polysynthetic languages now? Until we can go back in time and rewrite history, we’ll never know the answer for sure.

Further reading

  • Atkinson, Mark David. “Sociocultural determination of linguistic complexity.” (2016). PhD Thesis. Chapter 1 provides a good overview of how languages are affected by social structure.
  • Kelly, Barbara, et al. “The acquisition of polysynthetic languages.” Language and Linguistics Compass 8.2 (2014): 51-64.
  • Trudgill, Peter. Sociolinguistic typology: Social determinants of linguistic complexity. Oxford University Press, 2011.
  • Wray, Alison, and George W. Grace. “The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form.” Lingua 117.3 (2007): 543-578. This paper proposes the theory and explains it in detail.

Predictions for 2030

Now that it’s Jan 1, 2020, I’m going to make some predictions about what we will see in the next decade. By the year 2030:

  • Deep learning will be a standard tool and integrated into workflows of many professions, eg: code completion for programmers, note taking during meetings. Speech recognition will surpass human accuracy. Machine translation will still be inferior to human professionals.

  • Open-domain conversational dialogue (aka the Turing Test) will be on par with an average human, using a combination of deep learning and some new technique not available today. It will be regarded as more of a “trick” than strong AI; the bar for true AGI will be shifted higher.

  • Driverless cars will be in commercial use in a few limited scenarios. Most cars will have some autonomous features, but full autonomy still not widely deployed.

  • S&P 500 index (a measure of the US economy, currently at 3230) will double to between 6000-7000. Bitcoin will still exist but its price will fall under 1000 USD (currently ~7000 USD).

  • Real estate prices in Toronto will either have a sharp fall or flatten out; overall increase in 2020-2030 period will not exceed inflation.

  • All western nations will have implemented some kind of carbon tax as political pressure increases from young people; no serious politician will suggest removing carbon tax.

  • About half of my Waterloo cohort will be married, but majority will not have any kids, at the age of 35.

  • China will overtake USA as world’s biggest economy, but growth will slow down, and PPP per capita will still be well below USA.

Five books to understand controversial political issues

Climate change, housing crisis, China, and recycling. Here are four highly complex and controversial topics that come up again and again in the news, in election debates, and in discussions with friends and family.

Despite all the media coverage, it’s difficult to get a balanced view. Individual news articles tend to present a simplistic, one-sided stance on a complex problem. Furthermore, these issues are politically polarizing so it’s easy to find yourself in a filter bubble and only see one side of the story.

This year, I resolved to read books to get a well-rounded understanding of some of the world’s most pressing and controversial issues. A book can go into much greater depth than an article, and should ideally be well-researched and present both sides of an argument.

Issue 1: Climate Change

The Climate Casino by William Nordhaus

This book, by a Nobel Prize winning economist, talks about climate change from an economic perspective. Climate change typically evokes extreme responses: conservatives deny it altogether, while environmentalists warn of doomsday scenarios. The reality is somewhere in the middle: if we don’t do anything about climate change, average surface temperature is projected to rise by 3.5C by end of the century, and would cost 1-4% of world GDP. It’s definitely a serious concern, but probably won’t cause the collapse of civilization either.

There have been thousands of papers at the IPCC studying various aspects of climate change, but there are still large uncertainties in all the projections. There is a small chance that we might cross a “tipping point” like the melting of polar ice caps, where the system changes in a catastrophic and irreversible way after a certain temperature point. It’s poorly understood exactly what temperature triggers a tipping point, thus adding even more uncertainty to models. Every time we emit CO2, we are gambling in the climate casino.

The three approaches to dealing with climate change are mitigation (emit less CO2), adapting to the effects of climate change, and geoengineering. We can reduce emissions quite a lot with only modest cost, but costs go up exponentially as you try to cut more and more emissions. It’s crucial that all countries participate: climate targets become impossible if only half of countries participate.

Economists agree that carbon tax is a simple and effective way of reducing carbon emissions across the board, and is more effective than direct government regulation. A carbon tax sends a signal to the market and significantly discourages high-carbon technologies like coal, while only increasing gas and electricity costs by a modest amount.

Currently, climate change is a partisan issue: opinions on climate change is highly correlated with political views. The scientific evidence is insurmountable, and gradually the public opinion will change.

Issue 2: Rise of China

China is an emerging superpower, projected to overtake the US as the world’s biggest economy in the next decade. There’s been a lot of tension between the two countries recently, looking at the trade wars and Hong Kong protests. I picked up two books on this topic: one with a Chinese perspective, and another with a western perspective.

China Emerging by Wu Xiaobo

I got this book in Shenzhen, one of the few English books about China in the bookstore. It describes the history of China from 1978 to today. In 1978, China was very poor, having experienced famines and the cultural revolution under Mao Zedong’s rule. The early 1980s was a turning point for China, where Deng Xiaoping started to open up the country to foreign investment and capitalism. He started by setting up “Special Economic Zones” in coastal cities like Shenzhen and Xiamen where he experimented with capitalism, with great success.

The 1980s and 1990s saw a gradual shift from communism to capitalism, where the state relinquished control to entrepreneurs and investors. By the 2000s, China was a manufacturing giant and everything was “made in China”. During the mid 2000s, there was a boom in massive construction projects like high speed rail and hundreds of skyscrapers. Development at such speed and scale is historically unprecedented — the book describes it as “a big ship sailing towards the future”.

This book helped me understand the Chinese mindset, although it is quite one-sided and at times reads like state propaganda. Of course, being published in China, it leaves out sensitive topics like the Tiananmen massacre and internet censorship.

China’s Economy: What Everyone Needs to Know by Arthur R. Kroeber

This book describes all aspects of China’s economy, written by a western author, and presents a quite different view of the situation. Since the economic reforms of 1978, life has gotten tremendously better, with the population living in extreme poverty falling from 90% in 1978 to less than 1% now. However, the growth has been uneven, and there is a high level of inequality.

One example of inequality is the hukou system. Industrialization brought many people into the cities, and now about 2/3 of the population is urban, but there is a lot of inequality between migrant workers and those with urban hukou. Only people with hukou has access to social services and healthcare. However you can’t just give everyone hukou because the top-tier cities don’t have infrastructure to support so many migrants.

Another example of inequality is in real estate. Around 2003, the government sold urban property at very low rates, essentially a large transfer of wealth. At the same time, local governments forcefully bought rural land at below market rates, exacerbating the urban-rural inequality. Chinese people like to invest in real estate because they believe it will always go up (as it had for the last 20 years).

In the early stages of economic reform, the priority was to mobilize the country’s enormous workforce, so some inefficiency and corruption was tolerated. China focused on labor-intensive light industry, like clothing and consumer appliances, rather than capital-intensive heavy industry. Recently, as wages rise, cheap labor is less abundant, so you need to increase economic efficiency for further growth. However, China struggles to produce quality, more advanced technology like cars, aircraft, and electronics (lots of phones are made in China but only the final assembly stage), and mostly produces cheap items of medium quality.

China will be the world’s largest economy in the next decade, although this doesn’t mean much after accounting for population size. Despite its economy, it has limited political influence, and has no strong allied countries, even in East Asia. It also struggles to become a technological leader: most of its tech companies only have a domestic market, and don’t gain traction outside the country. It’s clear that China has vastly different values from Western countries: they don’t value elections and democracy, rather the government is good as long as it keeps the economy running; these ideologies will need to learn to coexist in the future.

Issue 3: Canadian Housing Bubble

When the Bubble Burst by Hilliard MacBeth

Real estate prices and rent has risen a lot in the last 10 years in many Canadian cities (most notably Toronto and Vancouver) so that housing is unaffordable for many people now. Hilliard MacBeth has the controversial opinion that Canada is about to hit a housing bubble, with corrections on the order of 40-50%. Many people blame immigration and foreign investment, but the rise in real estate prices is due to low interest rate and the willingness of banks to make large mortgage loans.

Many people assume that house prices will always go up. This has been the case for the last 20 years, but there are clear counterexamples, like USA in 2008 and Japan in the 1990s. The Case-Shiller index shows that over an 100-year period, real estate values in the USA have approximately matched inflation. We’re likely overfitting to the most recent data, where we develop heuristics for the last few decades of growth and extrapolate it indefinitely into the future.

A few years ago, I asked a question on stackexchange about why should we expect stocks to go up in the long term — there is strong reason to believe the annual growth of 5-7% will continue indefinitely. For real estate, there’s no good reason why it should increase over the long term, so it’s more like speculation than investment. Instead of an investment, it’s better to think of real estate as paying for a lot of future rent upfront, and for young people to take on debt to get a mortgage is risky, especially in this market.

The author is extremely pessimistic about the future of Canadian real estate, which I don’t think is justified. Nevertheless, it’s good to question some common assumptions about real estate investing. In particular, the tradeoff between renting and ownership depends on a lot of factors, and we shouldn’t jump to the conclusion that ownership is better.

Issue 4: Recycling

Junkyard Planet by Adam Minter

The recycling industry is in chaos as China announced this year that it is no longer importing recyclable waste from western countries. Some media investigations have found that recycling is just a show, and much of it ends up in landfills or being burnt. Wait, what is going on exactly?

This book is written by a journalist and son of a scrapyard owner. In western countries, we normally think of recycling as an environmentalist act, but in reality it’s more accurate to think of it as harvesting valuable materials out of what would otherwise be trash. Metals like copper, steel, and aluminum are harvested from all kinds of things like Christmas lights, cables, cars, etc. It’s a lot cheaper and takes less energy to harvest metal (copper is worth a few thousand dollars a ton) from used items than to mine it from ore, which would take 100 tons to produce one ton of metal.

There’s a large international trade for scrap metal. America is the biggest producer of scrap, and it gets sent to China because of the cheap labor and China has a lot of demand for metal to build its developing infrastructure. The trade then goes into secondary markets where all kinds of scrap are defined by quality and sold to the highest bidder — there is little concern for the environment. The free market with economic incentives is much more efficient at recycling than citizens with good intentions.

Metals are highly recyclable, whereas plastic is almost impossible to recycle profitably because its value per ton is so low compared to metals. For some time, plastic recycling was done in Wen’an, but it was only possible because there was no environmental regulations and the workers didn’t wear protective equipment when handling dangerous chemicals. The government shut it down after people started getting sick. It’s more costly to recycle while complying with regulations, which is why very little of it is done in the US; the scrap is simply exported to countries with weak oversight.

Recycling usually takes place in places with cheap labor. Often you have the choice between building a machine to do something, or hire humans to do it. It all comes down to price: in one case, a task of sorting different metals is done by hundreds of young women in China and an expensive machine in America. Labor is getting more expensive in China with its rising middle class, so this is rapidly changing.

In the developed world, we have a misguided idea of how recycling works, and often we think of recycling as a “free pass” that allows us to consume as much as we want, as long as we recycle. In reality, recycling is imperfect and it will be turned into a lower-grade product; in the “reduce, reuse, recycle” mantra, reducing consumption has by far the biggest impact, reusing is good, and recycling should be considered a distant third option.

Directionality of word class conversion

Many nouns (like google, brick, bike) can be used as verbs:

  • Let me google that for you.
  • The software update bricked my phone.
  • Bob biked to work yesterday.

Conversely, many verbs (like talk, call) can be used as nouns:

  • She gave a talk at the conference.
  • I’m on a call with my boss.

Here, we just assumed that {google, brick, bike} are primarily nouns and {talk, call} are primarily verbs — but is this justified? After all, all five of these words can be used as either a noun or a verb. Then, what’s the difference between the first group {google, brick, bike} and the second group {talk, call}?

These are examples of word class flexibility: words that can be used across multiple part-of-speech classes. In this blog post, I’ll describe some objective criteria to determine if a random word like “sleep” is primarily a noun or a verb.

Five criteria for deciding directionality

Linguists have studied the problem of deciding what is the base / dominant part-of-speech category (equivalently, deciding the directionality of conversion). Five methods are commonly listed in the literature: frequency of occurrence, attestation date, semantic range, semantic dependency, and semantic pattern (Balteiro, 2007; Bram, 2011).

  1. Frequency of occurrence: a word is noun-dominant if it occurs more often as a noun than a verb. This is the easiest to compute since all you need is a POS-tagged corpus. The issue is the direction now depends on which corpus you use, and there can be big differences between genres.
  2. Attestation date: a word is noun-dominant if it was used first as a noun and only later as a verb. This works for newer words, Google (the company) existed for a while before anyone started “googling” things. But we run into problems with older words, and the direction then depends on the precise dating of Middle English manuscripts. If the word is from Proto-Germanic / Proto-Indo-European then finding the attestation date becomes impossible. This method is also philosophically questionable because you shouldn’t need to know the history of a language to describe its current form.
  3. Semantic range: if a dictionary lists more noun meanings than verb meanings for a word, then it’s noun-dominant. This is not so reliable because different dictionaries disagree on how many senses to include, and how different must two senses be in order to have separate entries. Also, some meanings are rare or domain specific (eg: “call option” in finance) and it doesn’t seem right to count them equally.
  4. Semantic dependency: if the definition of the verb meaning refers to the noun meaning, then the word is noun-dominant. For example, “to bottle” means “to put something into a bottle”. This criterion is not always clear to decide, sometimes you can define it either way, or have neither definition refer to the other.
  5. Semantic pattern: a word is noun-dominant if it refers to an entity / object, and verb-dominant if refers to an action. A bike is something that you can touch and feel; a walk is not. Haspelmath (2012) encourages distinguishing {entity, action, property} rather than {noun, verb, adjective}. However, it’s hard to determine without subjective judgement (especially for abstract nouns like “test” or “work”), whether the entity or action sense is more primary.

Comparisons using corpus methods

How do we make sense of all these competing criteria? To answer this question, Balteiro (2007) compare 231 pairs of flexible noun/verb pairs and rated them all according to the five criteria I listed above, as well as a few more that I didn’t include. Later, Bram (2011) surveyed a larger set of 2048 pairs.

The details are quite messy, because applying the criteria are not so straightforward. For example, polysemy: the word “call” has more than 40 definitions in the OED, and some of them are obsolete, so which one do you use for attestation date? How do you deal with homonyms like “bank” that have two unrelated meanings? With hundreds of pages of painstaking analysis, the researchers came to a judgement for each word. Then, they measured the agreement between each pair of criteria:

bram-thesis-tableTable of pairwise agreement (adapted from Table 5.2 of Bram’s thesis)

There is only a moderate level of agreement between the different criteria, on average about 65% — better than random, but not too convincing either. Only frequency and attestation date agree more than 80% of the time. Only a small minority of words have all of the criteria agree.

Theoretical ramifications

This puts us in a dilemma: how do we make sense of these results? What’s the direction of conversion if these criteria don’t agree? Are some of the criteria better than others, perhaps take a majority vote? Is it even possible to determine a direction at all?

Linguists have disagreed for decades over what to do with this situation. Van Lier and Rijkhoff (2013) gives a survey of the various views. Some linguists maintain that flexible words must be either noun-dominant or verb-dominant, and is converted to the other category. Other linguists note the disagreements between criteria and propose instead that words are underspecified. Just like a stem cell that can morph into a skin or lung cell as needed, a word like “sleep” is neither a noun or verb, but a pre-categorical form that can morph into either a noun or verb depending on context.

Can we really determine the dominant category of a conversion pair? It seems doubtful that this issue will ever be resolved. Presently, none of the theories make any scientific predictions that can be tested and falsified. Until then, the theories co-exist as different ways to view and analyze the same data.

The idea of a “dominant” category doesn’t exist in nature, it is merely an artificial construct to help explain the data. In mathematics, it’s nonsensical to ask if imaginary numbers really “exist”. Nobody has seen an imaginary number, but mathematicians use them because they’re good for describing a lot of things. Likewise, it doesn’t make sense to ask if flexible words really have a dominant category. We can only ask whether a theory that assumes the existence of a dominant category is simpler than a theory that does not.


  1. Balteiro, Isabel. The directionality of conversion in English: A dia-synchronic study. Vol. 59. Peter Lang, 2007.
  2. Bram, Barli. “Major total conversion in English: The question of directionality.” (2011). PhD Thesis.
  3. Haspelmath, Martin. “How to compare major word-classes across the world’s languages.” Theories of everything: In honor of Edward Keenan 17 (2012): 109-130.
  4. Van Lier, Eva, and Jan Rijkhoff. “Flexible word classes in linguistic typology and grammatical theory.” Flexible word classes: a typological study of underspecified parts-of-speech (2013): 1-30.