Three books about AI / ML / Stats

In this edition of my book review blog post series, I will summarize three books I recently read about artificial intelligence. AI is a hot topic nowadays, and people inside and outside the field have very different perspectives. Even as an AI researcher, I found a lot to learn from these books.

Emergence of the statistics discipline

Machine learning and statistics are closely related areas: ML can be viewed as statistics but with computers. Thus, to understand machine learning, it’s natural to start from the beginning and study the history of statistics.

The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century by David Salsburg

This book tells the story of how statistics emerged as a scientific discipline in the 20th century. The title of the book comes from a story where Fisher wanted to see if a lady can tell when tea or milk was added to the cup first; he comes up with a series of randomized experiments that motivate modern hypothesis testing. The book describes the lives and circumstances of the people involved, explaining the math pretty well using words, without getting too technical with equations. Some of the founding fathers of statistics:

  • Karl Pearson (1857-1936) was the founder of mathematical statistics, devised methods of estimating statistical parameters from data, founded the journal Biometrika, and applied these methods to confirm Darwin’s theory of natural selection. He had a dominating personality, and his son Egon Pearson also became a famous statistician.

  • William Gosset (1876-1937) discovered the Student t-distribution while working for Guinness, improving methods to brew beer. He had to publish under the pseudonym “Student” because Guinness wouldn’t let their employees publish.

  • Ronald Fisher (1890-1962) was a genius that invented a lot of modern statistics including MLE for estimating parameters, ANOVA, and experimental design. He originally used these methods to study the effects of fertilizers on crop variation, and eventually became a distinguished professor. Fisher did not get along with Pearson, and also dismissed evidence that smoking caused cancer long after it was accepted by the scientific community.

  • Jerzy Neyman (1894-1981) invented the standard textbook formulation of hypothesis testing against a null hypothesis, and introduced the concept of confidence interval. Fisher and many others were skeptical since it’s unclear what is the interpretation of p-value and 95% probability of the 95% confidence interval.

Statistics is now a crucial part in experiments across many scientific disciplines. Undoubtedly, statistics changed the way we do science, and this book tells the story of how it happened. I liked the first part of the book more, since it talks about the most influential figures in early statistics. By the latter half of the book, statistics had already diversified into numerous sub-disciplines, and the book jumps rapidly between a plethora of scientists.

Causal reasoning: a limitation of ML

Book by Judea Pearl, one of the leaders of causal inference who received the 2011 Turing award for his work on Bayesian networks. Pearl points out a flaw that affects all machine learning models, from the simplest linear regression to the deepest neural networks — it’s impossible to tell the difference between causation and correlation using data alone. Every morning I hear the birds chirp before sunrise, so do the birds cause the sun to rise? Obviously not, but for a machine, this is surprisingly difficult to deduce.

The Book of Why: The New Science of Cause and Effect by Judea Pearl

Pearl gives three levels of causation, where each level can’t be built up from tools of the lower levels:

  • Level 1 — Association: this is where most machine learning and statistics methods stand today. They can find correlations but can’t differentiate them from causation.

  • Level 2 — Intervention: using causal diagrams and do-notation, you can tell whether X causes Y. The first step is to use this machinery to determine if a causal relation is possible from the data, then apply level 1 methods to compute the strength of the causality.

  • Level 3 — Counterfactuals: given that you did X and Y happened, determine what would have happened if you did X’ instead.

The most reliable way to determine causality is through a randomized trial, but often this is impractical due to cost or ethics, and we only have observational data. A lot of scientists just control for as many variables as possible, but there are situations where this strategy is flawed. Using causal diagrams, the book explains more sophisticated techniques to determine causality, and a quick algorithm to decide if a variable should be controlled or not.

Causal inference is an active area of machine learning research, although an area that’s often ignored by mainstream ML. Judea Pearl thinks that figuring out a better representation of causation is a key missing ingredient for strong AI.

AI in the far future

When will we have superhuman artificial general intelligence (AGI)? Well, it depends on who you ask. The media often portrays AGI as on the verge of being achieved in just a few years, but AI researchers predict it to be out of reach for several decades or even centuries.

Superintelligence: Paths, Dangers, Strategies by Nick Bostrom

Bostrom believes strong AGI has a serious possibility of being achieved in the near future, say, by 2050. And once this happens, AI is an existential threat to humanity. Once AI has initially exceeded human ability, it will rapidly improve itself or use its programming skills to develop even stronger AI, and humans will be left in the dust. It will be very difficult to keep a superintelligent AI boxed, since it can develop advanced technology and there are many ways that it can escape from its sandbox.

Depending on how the AI is programmed, it may have very different values from humans. In many of Bostrom’s hypothetical scenarios, an AI designed for some narrow task (eg: producing paperclips) decides to take over the world and unleash an army of self-replicating nanobots to turn every atom in the universe into paperclips. There are a lot of unsolved questions of how to design an agent that can only single-mindedly maximize an objective function, without risk of it doing catastrophic unintended actions to maximize its objective.

For now, there is no imminent possibility of AGI, so it’s unclear to what extent specifying the value function will actually be a problem. There are much more immediate dangers of AI technology, for example, unfair bias against certain groups, and economic consequences of automation taking over jobs. Andrew Ng famously said: “fearing a rise of killer robots is like worrying about overpopulation on Mars“. Nevertheless, Bostrom makes a valid point: the risks of superhuman AI to humanity is so great that it’s worth taking seriously and investing in further research.

Why do polysynthetic languages all have very few speakers?

Polysynthetic languages are able to express complex ideas in one word, that in most languages would require a whole sentence. For example, in Inuktitut:

qangatasuukkuvimmuuriaqalaaqtunga

“I’ll have to go to the airport”

There’s no widely accepted definition of a polysynthetic language. Generally, polysynthetic languages have noun incorporation (where noun arguments are expressed affixes of a verb) and serial verb construction (where a single word contains multiple verbs). They are considered some of the most grammatically complex languages in the world.

Polysynthetic languages are most commonly found among the indigenous languages of North America. Only a few such languages have more than 100k speakers: Nahuatl (1.5m speakers), Navajo (170k speakers), and Cree (110k speakers). Most polysynthetic languages are spoken by a very small number of people and many are in danger of becoming extinct.

Why aren’t there more polysynthetic languages — major national languages with millions of speakers? Is it mere coincidence that the most complex languages have few speakers? According to Wray (2007), it’s not just coincidence, rather, languages spoken within a small, close-knit community with little outside contact tend to develop grammatical complexity. Languages with lots of external contact and adult learners tend to be more simplified and regular.

It’s well known that children are better language learners than adults. L1 and L2 language acquisition processes work very differently, so that children and adults have different needs when learning a language. Adult learners prefer regularity and expressions that can be decomposed into smaller parts. Anyone who has studied a foreign language has seen tables of verb conjugations like these:

french-conjugation

korean-verb

For adult learners, the ideal language is predictable and has few exceptions. The number 12 is pronounced “ten-two“, not “twelve“. A doctor who treats your teeth is a “tooth-doctor“, rather than a “dentist“. Exceptions give the adult learner difficulties since they have to be individually memorized. An example of a very predictable language is the constructed language Esperanto, designed to have as few exceptions as possible and be easy to learn for native speakers of any European language.

Children learn languages differently. At the age of 12 months (the holophrastic stage), children start producing single words that can represent complex ideas. Even though they are multiple words in the adult language, the child initially treats them as a single unit:

whasat (what’s that)

gimme (give me)

Once they reach 18-24 months of age, children pick up morphology and start using multiple words at a time. Children learn whole phrases first, then only later learn to analyze them into parts on an as-needed basis, thus they have no difficulty with opaque idioms and irregular forms. They don’t really benefit from regularity either: when children learn Esperanto as a native language, they introduce irregularities, even when the language is perfectly regular.

We see evidence of this process in English. Native speakers frequently make mistakes like using “could of” instead of “could’ve“, or using “your” instead of “you’re“. This is evidence that native English speakers think of them as a single unit, and don’t naturally analyze them into their sub-components: “could+have” and “you+are“.

According to the theory, in languages spoken in isolated communities, where few adults try to learn the language, it ends up with complex and irregular words. When lots of grown-ups try to learn the language, they struggle with the grammatical complexity and simplify it. Over time, these simplifications eventually become a standard part of the language.

Among the world’s languages, various studies have found correlations between grammatical complexity and smaller population size, supporting this theory. However, the theory is not without its problems. As with any observational study, correlation doesn’t imply causation. The European conquest of the Americas decimated the native population, and consequently, speakers of indigenous languages have declined drastically in the last few centuries. Framing it this way, the answer to “why aren’t there more polysynthetic languages with millions of speakers” is simply: “they all died of smallpox or got culturally assimilated”.

If instead, Native Americans had sailed across the ocean and colonized Europe, would more of us be speaking polysynthetic languages now? Until we can go back in time and rewrite history, we’ll never know the answer for sure.

Further reading

  • Atkinson, Mark David. “Sociocultural determination of linguistic complexity.” (2016). PhD Thesis. Chapter 1 provides a good overview of how languages are affected by social structure.
  • Kelly, Barbara, et al. “The acquisition of polysynthetic languages.” Language and Linguistics Compass 8.2 (2014): 51-64.
  • Trudgill, Peter. Sociolinguistic typology: Social determinants of linguistic complexity. Oxford University Press, 2011.
  • Wray, Alison, and George W. Grace. “The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form.” Lingua 117.3 (2007): 543-578. This paper proposes the theory and explains it in detail.

Predictions for 2030

Now that it’s Jan 1, 2020, I’m going to make some predictions about what we will see in the next decade. By the year 2030:

  • Deep learning will be a standard tool and integrated into workflows of many professions, eg: code completion for programmers, note taking during meetings. Speech recognition will surpass human accuracy. Machine translation will still be inferior to human professionals.

  • Open-domain conversational dialogue (aka the Turing Test) will be on par with an average human, using a combination of deep learning and some new technique not available today. It will be regarded as more of a “trick” than strong AI; the bar for true AGI will be shifted higher.

  • Driverless cars will be in commercial use in a few limited scenarios. Most cars will have some autonomous features, but full autonomy still not widely deployed.

  • S&P 500 index (a measure of the US economy, currently at 3230) will double to between 6000-7000. Bitcoin will still exist but its price will fall under 1000 USD (currently ~7000 USD).

  • Real estate prices in Toronto will either have a sharp fall or flatten out; overall increase in 2020-2030 period will not exceed inflation.

  • All western nations will have implemented some kind of carbon tax as political pressure increases from young people; no serious politician will suggest removing carbon tax.

  • About half of my Waterloo cohort will be married, but majority will not have any kids, at the age of 35.

  • China will overtake USA as world’s biggest economy, but growth will slow down, and PPP per capita will still be well below USA.

Five books to understand controversial political issues

Climate change, housing crisis, China, and recycling. Here are four highly complex and controversial topics that come up again and again in the news, in election debates, and in discussions with friends and family.

Despite all the media coverage, it’s difficult to get a balanced view. Individual news articles tend to present a simplistic, one-sided stance on a complex problem. Furthermore, these issues are politically polarizing so it’s easy to find yourself in a filter bubble and only see one side of the story.

This year, I resolved to read books to get a well-rounded understanding of some of the world’s most pressing and controversial issues. A book can go into much greater depth than an article, and should ideally be well-researched and present both sides of an argument.

Issue 1: Climate Change

The Climate Casino by William Nordhaus

This book, by a Nobel Prize winning economist, talks about climate change from an economic perspective. Climate change typically evokes extreme responses: conservatives deny it altogether, while environmentalists warn of doomsday scenarios. The reality is somewhere in the middle: if we don’t do anything about climate change, average surface temperature is projected to rise by 3.5C by end of the century, and would cost 1-4% of world GDP. It’s definitely a serious concern, but probably won’t cause the collapse of civilization either.

There have been thousands of papers at the IPCC studying various aspects of climate change, but there are still large uncertainties in all the projections. There is a small chance that we might cross a “tipping point” like the melting of polar ice caps, where the system changes in a catastrophic and irreversible way after a certain temperature point. It’s poorly understood exactly what temperature triggers a tipping point, thus adding even more uncertainty to models. Every time we emit CO2, we are gambling in the climate casino.

The three approaches to dealing with climate change are mitigation (emit less CO2), adapting to the effects of climate change, and geoengineering. We can reduce emissions quite a lot with only modest cost, but costs go up exponentially as you try to cut more and more emissions. It’s crucial that all countries participate: climate targets become impossible if only half of countries participate.

Economists agree that carbon tax is a simple and effective way of reducing carbon emissions across the board, and is more effective than direct government regulation. A carbon tax sends a signal to the market and significantly discourages high-carbon technologies like coal, while only increasing gas and electricity costs by a modest amount.

Currently, climate change is a partisan issue: opinions on climate change is highly correlated with political views. The scientific evidence is insurmountable, and gradually the public opinion will change.

Issue 2: Rise of China

China is an emerging superpower, projected to overtake the US as the world’s biggest economy in the next decade. There’s been a lot of tension between the two countries recently, looking at the trade wars and Hong Kong protests. I picked up two books on this topic: one with a Chinese perspective, and another with a western perspective.

China Emerging by Wu Xiaobo

I got this book in Shenzhen, one of the few English books about China in the bookstore. It describes the history of China from 1978 to today. In 1978, China was very poor, having experienced famines and the cultural revolution under Mao Zedong’s rule. The early 1980s was a turning point for China, where Deng Xiaoping started to open up the country to foreign investment and capitalism. He started by setting up “Special Economic Zones” in coastal cities like Shenzhen and Xiamen where he experimented with capitalism, with great success.

The 1980s and 1990s saw a gradual shift from communism to capitalism, where the state relinquished control to entrepreneurs and investors. By the 2000s, China was a manufacturing giant and everything was “made in China”. During the mid 2000s, there was a boom in massive construction projects like high speed rail and hundreds of skyscrapers. Development at such speed and scale is historically unprecedented — the book describes it as “a big ship sailing towards the future”.

This book helped me understand the Chinese mindset, although it is quite one-sided and at times reads like state propaganda. Of course, being published in China, it leaves out sensitive topics like the Tiananmen massacre and internet censorship.

China’s Economy: What Everyone Needs to Know by Arthur R. Kroeber

This book describes all aspects of China’s economy, written by a western author, and presents a quite different view of the situation. Since the economic reforms of 1978, life has gotten tremendously better, with the population living in extreme poverty falling from 90% in 1978 to less than 1% now. However, the growth has been uneven, and there is a high level of inequality.

One example of inequality is the hukou system. Industrialization brought many people into the cities, and now about 2/3 of the population is urban, but there is a lot of inequality between migrant workers and those with urban hukou. Only people with hukou has access to social services and healthcare. However you can’t just give everyone hukou because the top-tier cities don’t have infrastructure to support so many migrants.

Another example of inequality is in real estate. Around 2003, the government sold urban property at very low rates, essentially a large transfer of wealth. At the same time, local governments forcefully bought rural land at below market rates, exacerbating the urban-rural inequality. Chinese people like to invest in real estate because they believe it will always go up (as it had for the last 20 years).

In the early stages of economic reform, the priority was to mobilize the country’s enormous workforce, so some inefficiency and corruption was tolerated. China focused on labor-intensive light industry, like clothing and consumer appliances, rather than capital-intensive heavy industry. Recently, as wages rise, cheap labor is less abundant, so you need to increase economic efficiency for further growth. However, China struggles to produce quality, more advanced technology like cars, aircraft, and electronics (lots of phones are made in China but only the final assembly stage), and mostly produces cheap items of medium quality.

China will be the world’s largest economy in the next decade, although this doesn’t mean much after accounting for population size. Despite its economy, it has limited political influence, and has no strong allied countries, even in East Asia. It also struggles to become a technological leader: most of its tech companies only have a domestic market, and don’t gain traction outside the country. It’s clear that China has vastly different values from Western countries: they don’t value elections and democracy, rather the government is good as long as it keeps the economy running; these ideologies will need to learn to coexist in the future.

Issue 3: Canadian Housing Bubble

When the Bubble Burst by Hilliard MacBeth

Real estate prices and rent has risen a lot in the last 10 years in many Canadian cities (most notably Toronto and Vancouver) so that housing is unaffordable for many people now. Hilliard MacBeth has the controversial opinion that Canada is about to hit a housing bubble, with corrections on the order of 40-50%. Many people blame immigration and foreign investment, but the rise in real estate prices is due to low interest rate and the willingness of banks to make large mortgage loans.

Many people assume that house prices will always go up. This has been the case for the last 20 years, but there are clear counterexamples, like USA in 2008 and Japan in the 1990s. The Case-Shiller index shows that over an 100-year period, real estate values in the USA have approximately matched inflation. We’re likely overfitting to the most recent data, where we develop heuristics for the last few decades of growth and extrapolate it indefinitely into the future.

A few years ago, I asked a question on stackexchange about why should we expect stocks to go up in the long term — there is strong reason to believe the annual growth of 5-7% will continue indefinitely. For real estate, there’s no good reason why it should increase over the long term, so it’s more like speculation than investment. Instead of an investment, it’s better to think of real estate as paying for a lot of future rent upfront, and for young people to take on debt to get a mortgage is risky, especially in this market.

The author is extremely pessimistic about the future of Canadian real estate, which I don’t think is justified. Nevertheless, it’s good to question some common assumptions about real estate investing. In particular, the tradeoff between renting and ownership depends on a lot of factors, and we shouldn’t jump to the conclusion that ownership is better.

Issue 4: Recycling

Junkyard Planet by Adam Minter

The recycling industry is in chaos as China announced this year that it is no longer importing recyclable waste from western countries. Some media investigations have found that recycling is just a show, and much of it ends up in landfills or being burnt. Wait, what is going on exactly?

This book is written by a journalist and son of a scrapyard owner. In western countries, we normally think of recycling as an environmentalist act, but in reality it’s more accurate to think of it as harvesting valuable materials out of what would otherwise be trash. Metals like copper, steel, and aluminum are harvested from all kinds of things like Christmas lights, cables, cars, etc. It’s a lot cheaper and takes less energy to harvest metal (copper is worth a few thousand dollars a ton) from used items than to mine it from ore, which would take 100 tons to produce one ton of metal.

There’s a large international trade for scrap metal. America is the biggest producer of scrap, and it gets sent to China because of the cheap labor and China has a lot of demand for metal to build its developing infrastructure. The trade then goes into secondary markets where all kinds of scrap are defined by quality and sold to the highest bidder — there is little concern for the environment. The free market with economic incentives is much more efficient at recycling than citizens with good intentions.

Metals are highly recyclable, whereas plastic is almost impossible to recycle profitably because its value per ton is so low compared to metals. For some time, plastic recycling was done in Wen’an, but it was only possible because there was no environmental regulations and the workers didn’t wear protective equipment when handling dangerous chemicals. The government shut it down after people started getting sick. It’s more costly to recycle while complying with regulations, which is why very little of it is done in the US; the scrap is simply exported to countries with weak oversight.

Recycling usually takes place in places with cheap labor. Often you have the choice between building a machine to do something, or hire humans to do it. It all comes down to price: in one case, a task of sorting different metals is done by hundreds of young women in China and an expensive machine in America. Labor is getting more expensive in China with its rising middle class, so this is rapidly changing.

In the developed world, we have a misguided idea of how recycling works, and often we think of recycling as a “free pass” that allows us to consume as much as we want, as long as we recycle. In reality, recycling is imperfect and it will be turned into a lower-grade product; in the “reduce, reuse, recycle” mantra, reducing consumption has by far the biggest impact, reusing is good, and recycling should be considered a distant third option.

Directionality of word class conversion

Many nouns (like google, brick, bike) can be used as verbs:

  • Let me google that for you.
  • The software update bricked my phone.
  • Bob biked to work yesterday.

Conversely, many verbs (like talk, call) can be used as nouns:

  • She gave a talk at the conference.
  • I’m on a call with my boss.

Here, we just assumed that {google, brick, bike} are primarily nouns and {talk, call} are primarily verbs — but is this justified? After all, all five of these words can be used as either a noun or a verb. Then, what’s the difference between the first group {google, brick, bike} and the second group {talk, call}?

These are examples of word class flexibility: words that can be used across multiple part-of-speech classes. In this blog post, I’ll describe some objective criteria to determine if a random word like “sleep” is primarily a noun or a verb.

Five criteria for deciding directionality

Linguists have studied the problem of deciding what is the base / dominant part-of-speech category (equivalently, deciding the directionality of conversion). Five methods are commonly listed in the literature: frequency of occurrence, attestation date, semantic range, semantic dependency, and semantic pattern (Balteiro, 2007; Bram, 2011).

  1. Frequency of occurrence: a word is noun-dominant if it occurs more often as a noun than a verb. This is the easiest to compute since all you need is a POS-tagged corpus. The issue is the direction now depends on which corpus you use, and there can be big differences between genres.
  2. Attestation date: a word is noun-dominant if it was used first as a noun and only later as a verb. This works for newer words, Google (the company) existed for a while before anyone started “googling” things. But we run into problems with older words, and the direction then depends on the precise dating of Middle English manuscripts. If the word is from Proto-Germanic / Proto-Indo-European then finding the attestation date becomes impossible. This method is also philosophically questionable because you shouldn’t need to know the history of a language to describe its current form.
  3. Semantic range: if a dictionary lists more noun meanings than verb meanings for a word, then it’s noun-dominant. This is not so reliable because different dictionaries disagree on how many senses to include, and how different must two senses be in order to have separate entries. Also, some meanings are rare or domain specific (eg: “call option” in finance) and it doesn’t seem right to count them equally.
  4. Semantic dependency: if the definition of the verb meaning refers to the noun meaning, then the word is noun-dominant. For example, “to bottle” means “to put something into a bottle”. This criterion is not always clear to decide, sometimes you can define it either way, or have neither definition refer to the other.
  5. Semantic pattern: a word is noun-dominant if it refers to an entity / object, and verb-dominant if refers to an action. A bike is something that you can touch and feel; a walk is not. Haspelmath (2012) encourages distinguishing {entity, action, property} rather than {noun, verb, adjective}. However, it’s hard to determine without subjective judgement (especially for abstract nouns like “test” or “work”), whether the entity or action sense is more primary.

Comparisons using corpus methods

How do we make sense of all these competing criteria? To answer this question, Balteiro (2007) compare 231 pairs of flexible noun/verb pairs and rated them all according to the five criteria I listed above, as well as a few more that I didn’t include. Later, Bram (2011) surveyed a larger set of 2048 pairs.

The details are quite messy, because applying the criteria are not so straightforward. For example, polysemy: the word “call” has more than 40 definitions in the OED, and some of them are obsolete, so which one do you use for attestation date? How do you deal with homonyms like “bank” that have two unrelated meanings? With hundreds of pages of painstaking analysis, the researchers came to a judgement for each word. Then, they measured the agreement between each pair of criteria:

bram-thesis-tableTable of pairwise agreement (adapted from Table 5.2 of Bram’s thesis)

There is only a moderate level of agreement between the different criteria, on average about 65% — better than random, but not too convincing either. Only frequency and attestation date agree more than 80% of the time. Only a small minority of words have all of the criteria agree.

Theoretical ramifications

This puts us in a dilemma: how do we make sense of these results? What’s the direction of conversion if these criteria don’t agree? Are some of the criteria better than others, perhaps take a majority vote? Is it even possible to determine a direction at all?

Linguists have disagreed for decades over what to do with this situation. Van Lier and Rijkhoff (2013) gives a survey of the various views. Some linguists maintain that flexible words must be either noun-dominant or verb-dominant, and is converted to the other category. Other linguists note the disagreements between criteria and propose instead that words are underspecified. Just like a stem cell that can morph into a skin or lung cell as needed, a word like “sleep” is neither a noun or verb, but a pre-categorical form that can morph into either a noun or verb depending on context.

Can we really determine the dominant category of a conversion pair? It seems doubtful that this issue will ever be resolved. Presently, none of the theories make any scientific predictions that can be tested and falsified. Until then, the theories co-exist as different ways to view and analyze the same data.

The idea of a “dominant” category doesn’t exist in nature, it is merely an artificial construct to help explain the data. In mathematics, it’s nonsensical to ask if imaginary numbers really “exist”. Nobody has seen an imaginary number, but mathematicians use them because they’re good for describing a lot of things. Likewise, it doesn’t make sense to ask if flexible words really have a dominant category. We can only ask whether a theory that assumes the existence of a dominant category is simpler than a theory that does not.

References

  1. Balteiro, Isabel. The directionality of conversion in English: A dia-synchronic study. Vol. 59. Peter Lang, 2007.
  2. Bram, Barli. “Major total conversion in English: The question of directionality.” (2011). PhD Thesis.
  3. Haspelmath, Martin. “How to compare major word-classes across the world’s languages.” Theories of everything: In honor of Edward Keenan 17 (2012): 109-130.
  4. Van Lier, Eva, and Jan Rijkhoff. “Flexible word classes in linguistic typology and grammatical theory.” Flexible word classes: a typological study of underspecified parts-of-speech (2013): 1-30.

Explaining chain-shift tone sandhi in Min Nan Chinese

In my previous post on the Teochew dialect, I noted that Teochew has a complex system of tone sandhi. The last syllable of a word keeps its citation (base) form, while all preceding syllables undergo sandhi. For example:

gu5 (cow) -> gu1 nek5 (cow-meat = beef)

seng52 (play) -> seng35 iu3 hi1 (play a game)

The sandhi system is quite regular — for instance, if a word’s base tone is 52 (falling tone), then its sandhi tone will be 35 (rising tone), across many words:

toin52 (see) -> toin35 dze3 (see-book = read)

mang52 (mosquito) -> mang35 iu5 (mosquito-oil)

We can represent this relationship as an edge in a directed graph 52 -> 35. Similarly, words with base tone 5 have sandhi tone 1, so we have an edge 5 -> 1. In Teochew, the sandhi graph of the six non-checked tones looks like this:

teochew-sandhi

Above: Teochew tone sandhi, Jieyang dialect, adapted from Xu (2007). For simplicity, we ignore checked tones (ending in -p, -t, -k), which have different sandhi patterns.

This type of pattern is not unique to Teochew, but exists in many dialects of Min Nan. Other dialects have different tones but a similar system. It’s called right-dominant chain-shift, because the rightmost syllable of a word keeps its base tone. It’s also called a “tone circle” when the graph has a cycle. Most notably, the sandhi pattern where A -> B, and B -> C, yet A !-> C is quite rare cross-linguistically, and does not occur in any Chinese dialect other than in the Min family.

Is there any explanation for this unusual tone sandhi system? In this blog post, I give an overview of some attempts at an explanation from theoretical phonology and historical linguistics.

Xiamen tone circle and Optimality Theory

The Xiamen / Amoy dialect is perhaps the most studied variety of Min Nan. Its sandhi system looks like this:

xiamen-sandhi

Barrie (2006) and Thomas (2008) attempt to explain this system with Optimality Theory (OT). In modern theoretical phonology, OT is a framework that describes how the underlying phonemes are mapped to the output phonemes, not using rules, but rather with a set of constraints. The constraints dictate what kinds of patterns that are considered “bad” in the language, but some violations are worse than others, so the constraints are ranked in a hierarchy. Then, the output is the solution that is “least bad” according to the ranking.

To explain the Xiamen tone circle sandhi, Thomas begins by introducing the following OT constraints:

  • *RISE: incur a penalty for every sandhi tone that has a rising contour.
  • *MERGE: incur a penalty when two citation tones are mapped to the same sandhi tone.
  • DIFFER: incur penalty when a base tone is mapped to itself as a sandhi tone.

Without any constraints, there are 5^5 = 3125 possible sandhi systems in a 5-tone language. With these constraints, most of the hypothetical systems are eliminated — for example, the null system (where every tone is mapped to itself) incurs 5 violations of the DIFFER constraint.

These 3 rules aren’t quite enough to fully explain the Xiamen tone system: there are still 84 hypothetical systems that are equally good as the actual system. With the aid of a Perl script, Thomas then introduces more rules until only one system (the actual observed one) emerges as the best under the constraints.

Problems with the OT explanation

There are several reasons why I didn’t find this explanation very satisfying. First, it’s not falsifiable: if your constraints don’t generate the right result, you can keep adding more and more constraints, and tweak the ranking, until they produce the result you want.

Second, the constraints are very arbitrary and lack any cognitive-linguistic motivation. You can explain the *MERGE constraint as trying to preserve contrasts, which makes sense from an information theory point of view, but what about DIFFER? It’s unclear why base tones shouldn’t be mapped to the same sandhi tone, especially since many languages (like Cantonese) manage fine with no sandhi at all.

Even considering Teochew, which is more closely related to the Xiamen dialect, we see that all three constraints are violated. I’m not aware of any analysis of Teochew sandhi using OT, and it would be interesting to see, but surely it would have a very different set of constraints from the Xiamen system.

Nevertheless, OT has been an extremely successful framework in modern phonology. In some cases, OT can describe a pattern very cleanly, where you’d need very complicated rules to describe them. In that case, the set of OT constraints would be a good explanation for the pattern.

Also, if the same constraint shows up in a lot of languages, then that increases its credibility that it’s a true cross-language tendency, rather than a just a made-up rule to explain the data. For example, if the *RISE constraint shows up in OT grammars for many languages, then you could claim that there’s a general tendency for languages to prefer falling tones over rising tones.

Evidence from Middle Chinese

Chen (2000) gives a different perspective. Essentially, he claims that it’s impossible to make sense of the data in any particular modern-day dialect. Instead, we should compare multiple dialects together in the context of historical sound changes.

The evidence he gives is from the Zhangzhou dialect, located about 40km inland from Xiamen. The Zhangzhou dialect has a similar tone circle as Xiamen, but with different values!

3sandhi

It’s not obvious how the two systems are related, until you consider the mapping to Middle Chinese tone categories:

mc-circle

The roman numerals I, II, III denote tones of Middle Chinese, spoken during ~600AD. Middle Chinese had four tones, but none of the present day Chinese dialects retain this system, after centuries of tone splits and merges. In many dialects, a Middle Chinese tone splits into two tones depending on whether the initial is voiced or voiceless. When comparing tones from different dialects, it’s often useful to refer to historical tone categories like “IIIa”, which roughly means “syllables that were tone III in Middle Chinese and the initial consonant is voiceless”.

It’s unlikely that both Xiamen and Zhangzhou coincidentally developed sandhi patterns that map to the same Middle Chinese tone categories. It’s far more likely that the tone circle developed in a common ancestral language, then their phonetic values diverged afterwards in the respective present-day dialects.

That still leaves open the question of: how exactly did the tone circle develop in the first place? It’s likely that we’ll never know for sure: the details are lost to time, and the processes driving historical tone change are not very well understood.

In summary, theoretical phonology and historical linguistics offer complementary insights that explain the chain-shift sandhi patterns in Min Nan languages. Optimality Theory proposes tendencies for languages to prefer certain structures over others. This partially explains the pattern; a lot of it is simply due to historical accident.

References

  1. Barrie, Michael. “Tone circles and contrast preservation.” Linguistic Inquiry 37.1 (2006): 131-141.
  2. Chen, Matthew Y. Tone sandhi: Patterns across Chinese dialects. Vol. 92. Cambridge University Press, 2000. Pages 38-49.
  3. Thomas, Guillaume. “An analysis of Xiamen tone circle.” Proceedings of the 27th West Coast Conference on Formal Linguistics. Cascadilla Proceedings Project, Somerville, MA. 2008.
  4. Xu, Hui Ling. “Aspect of Chaozhou grammar: a synchronic description of the Jieyang variety.” (2007).

Non-technical challenges of medical NLP research

Machine learning has recently made a lot of headlines in healthcare applications, like identifying tumors from images, or technology for personalized treatment. In this post, I describe my experiences as a healthcare ML researcher: the difficulties in doing research in this field, as well as reasons for optimism.

My research group focuses on applications of NLP to healthcare. For a year or two, I was involved in a number of projects in this area (specifically, detecting dementia through speech). From my own projects and from talking to others in my research group, I noticed that a few recurring difficulties frequently came up in healthcare NLP research — things that rarely occurred in other branches of ML. These are non-technical challenges that take up time and impede progress, and generally considered not very interesting to solve. I’ll give some examples of what I mean.

Collecting datasets is hard. Any time you want to do anything involving patient data, you have to undergo a lengthy ethics approval process. Even with something as innocent as an anonymous online questionnaire, there is a mandatory review by an ethics board before the experiment is allowed to proceed. As a result, most datasets in healthcare ML are small: a few dozen patient samples is common, and you’re lucky to have more than a hundred samples to work with. This is tiny compared to other areas of ML where you can easily find thousands of samples.

In my master’s research project, where I studied dementia detection from speech, the largest available corpus had about 300 patients, and other corpora had less than 100. This constrained the types of experiments that were possible. Prior work in this area used a lot of feature engineering approaches, because it was commonly believed that you needed at least a few thousand examples to do deep learning. With less data than that, deep learning would just learn to overfit.

Even after the data has been collected, it is difficult to share with others. This is again due to the conservative ethics processes required to share data. Data transfer agreements need to be reviewed and signed, and in some cases, data must remain physically on servers in a particular hospital. Researchers rarely open-source their code along with the paper, since there’s no point of doing so without giving access to the data; this makes it hard to reproduce any experimental results.

Medical data is messy. Data access issues aside, healthcare NLP has some of the messiest datasets in machine learning. Many datasets in ML are carefully constructed and annotated for the purpose of research, but this is not the case for medical data. Instead, data comes from real patients and hospitals, which are full of shorthand abbreviations of medical terms written by doctors, which mean different things depending on context. Unsurprisingly, many NLP techniques fail to work. Missing values and otherwise unreliable data are common, so a lot of not-so-glamorous data preprocessing is often needed.


I’ve so far painted a bleak picture of medical NLP, but I don’t want to give off such a negative image of my field. In the second part of this post, I give some counter-arguments to the above points as well as some of the positive aspects of research.

On difficulties in data access. There are good reasons for caution — patient data is sensitive and real people can be harmed if the data falls into the wrong hands. Even after removing personally identifiable information, there’s still a risk of a malicious actor deanonymizing the data and extracting information that’s not intended to be made public.

The situation is improving though. The community recognizes the need to share clinical data, to strike a balance between protecting patient privacy and allowing research. There have been efforts like the relatively open MIMIC critical care database to promote more collaborative research.

On small / messy datasets. With every challenge, there comes an opportunity. In fact, my own master’s research was driven by lack of data. I was trying to extend dementia detection to Chinese, but there wasn’t much data available. So I proposed a way to transfer knowledge from the much larger English dataset to Chinese, and got a conference paper and a master’s thesis from it. If it wasn’t for lack of data, then you could’ve just taken the existing algorithm and applied it to Chinese, which wouldn’t be as interesting.

Also, deep learning in NLP has recently gotten a lot better at learning from small datasets. Other research groups have had some success on the same dementia detection task using deep learning. With new papers every week on few-shot learning, one-shot learning, transfer learning, etc, small datasets may not be too much of a limitation.

Same applies to messy data, missing values, label leakage, etc. I’ll refer to this survey paper for the details, but the take-away is that these shouldn’t be thought of as barriers, but as opportunities to make a research contribution.

In summary, as a healthcare NLP researcher, you have to deal with difficulties that other machine learning researchers don’t have. However, you also have the unique opportunity to use your abilities to help sick and vulnerable people. For many people, this is an important consideration — if this is something you care deeply about, then maybe medical NLP research is right for you.

Thanks to Elaine Y. and Chloe P. for their comments on drafts of this post.