I didn’t break the bed, the bed broke: Exploring semantic roles with VerbNet / FrameNet

Some time ago, my bed fell apart, and I entered into a dispute with my landlord. “You broke the bed,” he insisted, “so you will have to pay for a new one.”

Being a poor grad student, I wasn’t about to let him have his way. “No, I didn’t break the bed,” I replied. “The bed broke.”

bedbroke

Above: My sad and broken bed. Did it break, or did I break it?

What am I implying here? It’s interesting how this argument relies on a crucial semantic difference between the two sentences:

  1. I broke the bed
  2. The bed broke

The difference is that (1) means I caused the bed to break (eg: by jumping on it), whereas (2) means the bed broke by itself (eg: through normal wear and tear).

This is intuitive to a native speaker, but maybe not so obvious why. One might guess from this example that any intransitive verb when used transitively (“X VERBed Y”) always means “X caused Y to VERB“. But this is not the case: consider the following pair of sentences:

  1. I attacked the bear
  2. The bear attacked

Even though the syntax is identical to the previous example, the semantic structure is quite different. Unlike in the bed example, sentence (1) cannot possibly mean “I caused the bear to attack”. In (1), the bear is the one being attacked, while in (2), the bear is the one attacking something.

broke-attackAbove: Semantic roles for verbs “break” and “attack”.

Sentences which are very similar syntactically can have different structures semantically. To address this, linguists assign semantic roles to the arguments of verbs. There are many semantic roles (and nobody agrees on a precise list of them), but two of the most fundamental ones are Agent and Patient.

  • Agent: entity that intentionally performs an action.
  • Patient: entity that changes state as a result of an action.
  • Many more.

The way that a verb’s syntactic arguments (eg: Subject and Object) line up with its semantic arguments (eg: Agent and Patient) is called the verb’s argument structure. Note that an agent is not simply the subject of a verb: for example, in “the bed broke“, the bed is syntactically a subject but is semantically a patient, not an agent.

Computational linguists have created several corpora to make this information accessible to computers. Two of these corpora are VerbNet and FrameNet. Let’s see how a computer would be able to understand “I didn’t break the bed; the bed broke” using these corpora.

broke-verbnet

Above: Excerpt from VerbNet entry for the verb “break”.

VerbNet is a database of verbs, containing syntactic patterns where the verb can be used. Each entry contains a mapping from syntactic positions to semantic roles, and restrictions on the arguments. The first entry for “break” has the transitive form: “Tony broke the window“.

Looking at the semantics, you can conclude that: (1) the agent “Tony” must have caused the breaking event, (2) something must have made contact with the window during this event, (3) the window must have its material integrity degraded as a result, and (4) the window must be a physical object. In the intransitive usage, the semantics is simpler: there is no agent that caused the event, and no instrument that made contact during the event.

The word “break” can take arguments in other ways, not just transitive and intransitive. VerbNet lists 10 different patterns for this word, such as “Tony broke the piggy bank open with a hammer“. This sentence contains a result (open), and also an instrument (a hammer). The entry for “break” also groups together a list of words like “fracture”, “rip”, “shatter”, etc, that have similar semantic patterns as “break”.

broke-framenet

Above: Excerpt from FrameNet entry for the verb “break”.

FrameNet is a similar database, but based on frame semantics. The idea is that in order to define a concept, you have to define it in terms of other concepts, and it’s hard to avoid a cycle in the definition graph. Instead, it’s sometimes easier to define a whole semantic frame at once, which describes a conceptual situation with many different participants. The frame then defines each participant by what role they play in the situation.

The word “break” is contained in the frame called “render nonfunctional“. In this frame, an agent affects an artifact so that it’s no longer capable of performing its function. The core (semantically obligatory) arguments are the agent and the artifact. There are a bunch of optional non-core arguments, like the manner that the event happened, the reason that the agent broke the artifact, the time and place it happened, and so on. FrameNet tries to make explicit all of the common-sense world knowledge that you need to understand the meaning of an event.

Compared to VerbNet, FrameNet is less concerned with the syntax of verbs: for instance, it does not mention that “break” can be used intransitively. Also, it has more fine-grained categories of semantic roles, and contains a description in English (rather than VerbNet’s predicate logic) of how each semantic argument participates in the frame.

An open question is: how can computers use VerbNet and FrameNet to understand language? Nowadays, deep learning has come to dominate NLP research, so that VerbNet and FrameNet are often seen as relics of a past era, when people still used rule-based systems to do NLP. It turned out to be hard to use VerbNet and FrameNet to make computers do useful tasks.

But recently, the NLP community is realizing that deep learning has limitations when it comes to common-sense reasoning, that you can’t solve just by adding more layers on to BERT and feeding it more data. So maybe deep learning systems can benefit from these lexical semantic resources.

Edge probing BERT and other language models

Recently, there has been a lot of research in the field of “BERTology”: investigating what aspects of language BERT does and doesn’t understand, using probing techniques. In this post, I will describe the “edge probing” technique developed by Tenney et al., in two papers titled “What do you learn from context? Probing for sentence structure in contextualized word representations” and “BERT Rediscovers the Classical NLP Pipeline“. On my first read through these papers, the method didn’t seem very complicated, and the authors only spend a paragraph explaining how it works. But upon closer look, the method is actually nontrivial, and took me some time to understand. The details are there, but hidden in an appendix in the first of the two papers.

The setup for edge probing is you have some classification task that takes a sentence, and a span of consecutive tokens within the sentence, and produces an N-way classification. This can be generalized to two spans in the sentence, but we’ll focus on the single-span case. The tasks cover a range of syntactic and semantic functions, varying from part-of-speech tagging, dependency parsing, coreference resolution, etc.

tenney-examples

Above: Examples of tasks where edge probing may be used. Table taken from Tenney (2019a).

Let’s go through the steps of how edge probing works. Suppose we want to probe for which parts of BERT-BASE contain the most information about named entity recognition (NER). In this NER setup, we’re given the named entity spans and only need to classify which type of entity it is (e.g: person, organization, location, etc). The first step is to feed the sentence through BERT, giving 12 layers, each layer being a 768-dimensional vector.

edge_probe

Above: Edge probing example for NER.

The probing model has several stages:

  1. Mix: learn a task-specific linear combination of the layers, giving a single 768-dimensional vector for each span token. The weights that are learned indicate how much useful information for the task is contained in each layer.
  2. Projection: learn a linear mapping from 768 down to 256 dimensions.
  3. Self-attention pooling: learn a function to generate a scalar weight for each span vector. Then, we normalize the weights to sum up to 1, and take a weighted average. The purpose of this is to collapse the variable-length sequence of span vectors into a single fixed-length 256-dimensional vector.
  4. Feedforward NN: learn a multi-layer perceptron classifier with 2 hidden layers.

For the two-span case, they use two separate sets of weights for the mix, projection, and self-attention steps. Then, the feedforward neural network takes a concatenated 512-dimensional vector instead of a 256-dimensional vector.

The edge probing model needs to be trained on a dataset of classification instances. The probe has weights that are initialized randomly and trained using gradient descent, but the BERT weights are kept constant while the probe is being trained.

This setup was more sophisticated than it looked at first glance, and it turns out the probe itself is quite powerful. In one of the experiments, the authors found that even with randomized input, the probe was able to get over 70% of the full model’s performance!

Edge probing is not the only way of probing deep language models. Other probing experiments (Liu et al., 2019) used a simple linear classifier: this is better for measuring what information can easily be recovered from representations.

References

  1. Tenney, Ian, et al. “What do you learn from context? probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019a.
  2. Tenney, Ian, Dipanjan Das, and Ellie Pavlick. “BERT Rediscovers the Classical NLP Pipeline.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019b.
  3. Liu, Nelson F., et al. “Linguistic Knowledge and Transferability of Contextual Representations.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

What if the government hadn’t released any Coronavirus economic stimulus?

It is March 23, 2020. After a month of testing fiascos, COVID-19 is ravaging the United States, with 40,000 cases in the US and 360,000 worldwide. There is a growing sense of panic as cities begin to lock down. Market circuit breakers have triggered four times in quick succession, with the stock market losing 30% of its value in mere weeks. There’s no sign that the worst is over.

covid-mar-23Above: Global Coronavirus stats on March 23, 2020, when the S&P 500 reached its lowest point during the pandemic (Source).

With businesses across the country closed and millions out of work, it’s clear that a massive financial stimulus is needed to prevent a total economic collapse. However, Congress is divided and is unable to pass the bill. Even when urgent action is needed, they squabble and debate over minor details, refusing to come to a compromise. The president denies that there’s any need for action. Both the democrats and the republicans are willing to do anything to prevent the other side from scoring a victory. The government is in a gridlock.

Let the businesses fail, they say. Don’t bail them out, they took the risk when the times were good, now you reap what you sow. Let them go bankrupt, punish the executives taking millions of dollars of bonuses. Let the free market do its job, after all, they can always start new businesses once this is all over.

April comes without any help from the government. Massive layoffs across all sectors of the economy as companies see their revenues drop to a fraction of normal levels, and layoff employees to try to preserve their cash. Retail and travel sectors are the most heavily affected, but soon, all companies are affected since people are hesitant to spend money. Unemployment numbers skyrocket to levels even greater than during the Great Depression.

Without a job, millions of people miss their rent payments, instead saving their money for food and essential items. Restaurants and other small businesses shut down. When people and businesses cannot pay rent, their landlords cannot pay the mortgages that they owe to the bank. A few small banks go bankrupt, and Wall Street waits anxiously for a government bailout. But unlike 2008, the state is in a deadlock, and there is no bailout coming. In 2020, no bank is too big to fail.

Each bank that goes down takes another bank down with it, until there is soon a cascading domino effect of bank failures. Everyone rushes to withdraw cash from their checking accounts before the bank collapses, which of course makes matters worse. Those too late to withdraw their cash lose their savings. This is devastation for businesses: even for those that escaped the pandemic, but there is no escaping systemic bank failure. Companies have no money in the bank to pay suppliers or make payroll, and thus, thousands of companies go bankrupt overnight.

Across the nation, people are angry at the government’s inaction, and take to the streets in protest. Having depleted their savings, some rob and steal from grocery stores to avoid starvation. The government finally steps in and deploys the military to keep order in the cities. They arrange for emergency supplies, enough to keep everybody fed, but just barely.

The lockdown lasts a few more months, and the virus is finally under control. Everyone is free to go back to work, but the problem is there are no jobs to go back to. In the process of all the biggest corporations going bankrupt, society has lost its complex network of dependencies and organizational knowledge. It only takes a day to lay off 100,000 employees, but to build up this structure from scratch will take decades.

A new president is elected, but it is too late, the damage has been done and cannot be reversed. The economy slowly recovers, but with less efficiency than before, and with workers employed in less productive roles, and the loss of productivity means that everyone enjoys a lower standard of living. Five years later, the virus is long gone, but the economy is nowhere close to its original state. By then, China emerges as the new dominant world power. The year 2020 goes down in history as a year of failure, where through inaction, a temporary health crisis led to societal collapse.


In our present timeline, fortunately, none of the above actually happened. The democrats and republicans put aside their differences and on March 25, swiftly passed a $2 billion dollar economic stimulus. The stock market immediately rebounded.

There was a period in March when it was seemed the government was in gridlock, and it wasn’t clear whether the US was politically capable of passing such a large stimulus bill. Is an economic collapse likely? Not really — no reasonable government would have allowed all of the banks to fail, so we would likely have a recession and not a total collapse. Banks did fail during the Great Depression, but macroeconomic theory was in its infancy at that time, and there’s no way such mistakes would’ve been repeated today. Still, this is the closest we’ve come to an economic collapse in a long time, and it’s fun to speculate about the consequences of what it would be like.

Why do polysynthetic languages all have very few speakers?

Polysynthetic languages are able to express complex ideas in one word, that in most languages would require a whole sentence. For example, in Inuktitut:

qangatasuukkuvimmuuriaqalaaqtunga

“I’ll have to go to the airport”

There’s no widely accepted definition of a polysynthetic language. Generally, polysynthetic languages have noun incorporation (where noun arguments are expressed affixes of a verb) and serial verb construction (where a single word contains multiple verbs). They are considered some of the most grammatically complex languages in the world.

Polysynthetic languages are most commonly found among the indigenous languages of North America. Only a few such languages have more than 100k speakers: Nahuatl (1.5m speakers), Navajo (170k speakers), and Cree (110k speakers). Most polysynthetic languages are spoken by a very small number of people and many are in danger of becoming extinct.

Why aren’t there more polysynthetic languages — major national languages with millions of speakers? Is it mere coincidence that the most complex languages have few speakers? According to Wray (2007), it’s not just coincidence, rather, languages spoken within a small, close-knit community with little outside contact tend to develop grammatical complexity. Languages with lots of external contact and adult learners tend to be more simplified and regular.

It’s well known that children are better language learners than adults. L1 and L2 language acquisition processes work very differently, so that children and adults have different needs when learning a language. Adult learners prefer regularity and expressions that can be decomposed into smaller parts. Anyone who has studied a foreign language has seen tables of verb conjugations like these:

french-conjugation

korean-verb

For adult learners, the ideal language is predictable and has few exceptions. The number 12 is pronounced “ten-two“, not “twelve“. A doctor who treats your teeth is a “tooth-doctor“, rather than a “dentist“. Exceptions give the adult learner difficulties since they have to be individually memorized. An example of a very predictable language is the constructed language Esperanto, designed to have as few exceptions as possible and be easy to learn for native speakers of any European language.

Children learn languages differently. At the age of 12 months (the holophrastic stage), children start producing single words that can represent complex ideas. Even though they are multiple words in the adult language, the child initially treats them as a single unit:

whasat (what’s that)

gimme (give me)

Once they reach 18-24 months of age, children pick up morphology and start using multiple words at a time. Children learn whole phrases first, then only later learn to analyze them into parts on an as-needed basis, thus they have no difficulty with opaque idioms and irregular forms. They don’t really benefit from regularity either: when children learn Esperanto as a native language, they introduce irregularities, even when the language is perfectly regular.

We see evidence of this process in English. Native speakers frequently make mistakes like using “could of” instead of “could’ve“, or using “your” instead of “you’re“. This is evidence that native English speakers think of them as a single unit, and don’t naturally analyze them into their sub-components: “could+have” and “you+are“.

According to the theory, in languages spoken in isolated communities, where few adults try to learn the language, it ends up with complex and irregular words. When lots of grown-ups try to learn the language, they struggle with the grammatical complexity and simplify it. Over time, these simplifications eventually become a standard part of the language.

Among the world’s languages, various studies have found correlations between grammatical complexity and smaller population size, supporting this theory. However, the theory is not without its problems. As with any observational study, correlation doesn’t imply causation. The European conquest of the Americas decimated the native population, and consequently, speakers of indigenous languages have declined drastically in the last few centuries. Framing it this way, the answer to “why aren’t there more polysynthetic languages with millions of speakers” is simply: “they all died of smallpox or got culturally assimilated”.

If instead, Native Americans had sailed across the ocean and colonized Europe, would more of us be speaking polysynthetic languages now? Until we can go back in time and rewrite history, we’ll never know the answer for sure.

Further reading

  • Atkinson, Mark David. “Sociocultural determination of linguistic complexity.” (2016). PhD Thesis. Chapter 1 provides a good overview of how languages are affected by social structure.
  • Kelly, Barbara, et al. “The acquisition of polysynthetic languages.” Language and Linguistics Compass 8.2 (2014): 51-64.
  • Trudgill, Peter. Sociolinguistic typology: Social determinants of linguistic complexity. Oxford University Press, 2011.
  • Wray, Alison, and George W. Grace. “The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form.” Lingua 117.3 (2007): 543-578. This paper proposes the theory and explains it in detail.

Predictions for 2030

Now that it’s Jan 1, 2020, I’m going to make some predictions about what we will see in the next decade. By the year 2030:

  • Deep learning will be a standard tool and integrated into workflows of many professions, eg: code completion for programmers, note taking during meetings. Speech recognition will surpass human accuracy. Machine translation will still be inferior to human professionals.

  • Open-domain conversational dialogue (aka the Turing Test) will be on par with an average human, using a combination of deep learning and some new technique not available today. It will be regarded as more of a “trick” than strong AI; the bar for true AGI will be shifted higher.

  • Driverless cars will be in commercial use in a few limited scenarios. Most cars will have some autonomous features, but full autonomy still not widely deployed.

  • S&P 500 index (a measure of the US economy, currently at 3230) will double to between 6000-7000. Bitcoin will still exist but its price will fall under 1000 USD (currently ~7000 USD).

  • Real estate prices in Toronto will either have a sharp fall or flatten out; overall increase in 2020-2030 period will not exceed inflation.

  • All western nations will have implemented some kind of carbon tax as political pressure increases from young people; no serious politician will suggest removing carbon tax.

  • About half of my Waterloo cohort will be married, but majority will not have any kids, at the age of 35.

  • China will overtake USA as world’s biggest economy, but growth will slow down, and PPP per capita will still be well below USA.

Directionality of word class conversion

Many nouns (like google, brick, bike) can be used as verbs:

  • Let me google that for you.
  • The software update bricked my phone.
  • Bob biked to work yesterday.

Conversely, many verbs (like talk, call) can be used as nouns:

  • She gave a talk at the conference.
  • I’m on a call with my boss.

Here, we just assumed that {google, brick, bike} are primarily nouns and {talk, call} are primarily verbs — but is this justified? After all, all five of these words can be used as either a noun or a verb. Then, what’s the difference between the first group {google, brick, bike} and the second group {talk, call}?

These are examples of word class flexibility: words that can be used across multiple part-of-speech classes. In this blog post, I’ll describe some objective criteria to determine if a random word like “sleep” is primarily a noun or a verb.

Five criteria for deciding directionality

Linguists have studied the problem of deciding what is the base / dominant part-of-speech category (equivalently, deciding the directionality of conversion). Five methods are commonly listed in the literature: frequency of occurrence, attestation date, semantic range, semantic dependency, and semantic pattern (Balteiro, 2007; Bram, 2011).

  1. Frequency of occurrence: a word is noun-dominant if it occurs more often as a noun than a verb. This is the easiest to compute since all you need is a POS-tagged corpus. The issue is the direction now depends on which corpus you use, and there can be big differences between genres.
  2. Attestation date: a word is noun-dominant if it was used first as a noun and only later as a verb. This works for newer words, Google (the company) existed for a while before anyone started “googling” things. But we run into problems with older words, and the direction then depends on the precise dating of Middle English manuscripts. If the word is from Proto-Germanic / Proto-Indo-European then finding the attestation date becomes impossible. This method is also philosophically questionable because you shouldn’t need to know the history of a language to describe its current form.
  3. Semantic range: if a dictionary lists more noun meanings than verb meanings for a word, then it’s noun-dominant. This is not so reliable because different dictionaries disagree on how many senses to include, and how different must two senses be in order to have separate entries. Also, some meanings are rare or domain specific (eg: “call option” in finance) and it doesn’t seem right to count them equally.
  4. Semantic dependency: if the definition of the verb meaning refers to the noun meaning, then the word is noun-dominant. For example, “to bottle” means “to put something into a bottle”. This criterion is not always clear to decide, sometimes you can define it either way, or have neither definition refer to the other.
  5. Semantic pattern: a word is noun-dominant if it refers to an entity / object, and verb-dominant if refers to an action. A bike is something that you can touch and feel; a walk is not. Haspelmath (2012) encourages distinguishing {entity, action, property} rather than {noun, verb, adjective}. However, it’s hard to determine without subjective judgement (especially for abstract nouns like “test” or “work”), whether the entity or action sense is more primary.

Comparisons using corpus methods

How do we make sense of all these competing criteria? To answer this question, Balteiro (2007) compare 231 pairs of flexible noun/verb pairs and rated them all according to the five criteria I listed above, as well as a few more that I didn’t include. Later, Bram (2011) surveyed a larger set of 2048 pairs.

The details are quite messy, because applying the criteria are not so straightforward. For example, polysemy: the word “call” has more than 40 definitions in the OED, and some of them are obsolete, so which one do you use for attestation date? How do you deal with homonyms like “bank” that have two unrelated meanings? With hundreds of pages of painstaking analysis, the researchers came to a judgement for each word. Then, they measured the agreement between each pair of criteria:

bram-thesis-tableTable of pairwise agreement (adapted from Table 5.2 of Bram’s thesis)

There is only a moderate level of agreement between the different criteria, on average about 65% — better than random, but not too convincing either. Only frequency and attestation date agree more than 80% of the time. Only a small minority of words have all of the criteria agree.

Theoretical ramifications

This puts us in a dilemma: how do we make sense of these results? What’s the direction of conversion if these criteria don’t agree? Are some of the criteria better than others, perhaps take a majority vote? Is it even possible to determine a direction at all?

Linguists have disagreed for decades over what to do with this situation. Van Lier and Rijkhoff (2013) gives a survey of the various views. Some linguists maintain that flexible words must be either noun-dominant or verb-dominant, and is converted to the other category. Other linguists note the disagreements between criteria and propose instead that words are underspecified. Just like a stem cell that can morph into a skin or lung cell as needed, a word like “sleep” is neither a noun or verb, but a pre-categorical form that can morph into either a noun or verb depending on context.

Can we really determine the dominant category of a conversion pair? It seems doubtful that this issue will ever be resolved. Presently, none of the theories make any scientific predictions that can be tested and falsified. Until then, the theories co-exist as different ways to view and analyze the same data.

The idea of a “dominant” category doesn’t exist in nature, it is merely an artificial construct to help explain the data. In mathematics, it’s nonsensical to ask if imaginary numbers really “exist”. Nobody has seen an imaginary number, but mathematicians use them because they’re good for describing a lot of things. Likewise, it doesn’t make sense to ask if flexible words really have a dominant category. We can only ask whether a theory that assumes the existence of a dominant category is simpler than a theory that does not.

References

  1. Balteiro, Isabel. The directionality of conversion in English: A dia-synchronic study. Vol. 59. Peter Lang, 2007.
  2. Bram, Barli. “Major total conversion in English: The question of directionality.” (2011). PhD Thesis.
  3. Haspelmath, Martin. “How to compare major word-classes across the world’s languages.” Theories of everything: In honor of Edward Keenan 17 (2012): 109-130.
  4. Van Lier, Eva, and Jan Rijkhoff. “Flexible word classes in linguistic typology and grammatical theory.” Flexible word classes: a typological study of underspecified parts-of-speech (2013): 1-30.

Explaining chain-shift tone sandhi in Min Nan Chinese

In my previous post on the Teochew dialect, I noted that Teochew has a complex system of tone sandhi. The last syllable of a word keeps its citation (base) form, while all preceding syllables undergo sandhi. For example:

gu5 (cow) -> gu1 nek5 (cow-meat = beef)

seng52 (play) -> seng35 iu3 hi1 (play a game)

The sandhi system is quite regular — for instance, if a word’s base tone is 52 (falling tone), then its sandhi tone will be 35 (rising tone), across many words:

toin52 (see) -> toin35 dze3 (see-book = read)

mang52 (mosquito) -> mang35 iu5 (mosquito-oil)

We can represent this relationship as an edge in a directed graph 52 -> 35. Similarly, words with base tone 5 have sandhi tone 1, so we have an edge 5 -> 1. In Teochew, the sandhi graph of the six non-checked tones looks like this:

teochew-sandhi

Above: Teochew tone sandhi, Jieyang dialect, adapted from Xu (2007). For simplicity, we ignore checked tones (ending in -p, -t, -k), which have different sandhi patterns.

This type of pattern is not unique to Teochew, but exists in many dialects of Min Nan. Other dialects have different tones but a similar system. It’s called right-dominant chain-shift, because the rightmost syllable of a word keeps its base tone. It’s also called a “tone circle” when the graph has a cycle. Most notably, the sandhi pattern where A -> B, and B -> C, yet A !-> C is quite rare cross-linguistically, and does not occur in any Chinese dialect other than in the Min family.

Is there any explanation for this unusual tone sandhi system? In this blog post, I give an overview of some attempts at an explanation from theoretical phonology and historical linguistics.

Xiamen tone circle and Optimality Theory

The Xiamen / Amoy dialect is perhaps the most studied variety of Min Nan. Its sandhi system looks like this:

xiamen-sandhi

Barrie (2006) and Thomas (2008) attempt to explain this system with Optimality Theory (OT). In modern theoretical phonology, OT is a framework that describes how the underlying phonemes are mapped to the output phonemes, not using rules, but rather with a set of constraints. The constraints dictate what kinds of patterns that are considered “bad” in the language, but some violations are worse than others, so the constraints are ranked in a hierarchy. Then, the output is the solution that is “least bad” according to the ranking.

To explain the Xiamen tone circle sandhi, Thomas begins by introducing the following OT constraints:

  • *RISE: incur a penalty for every sandhi tone that has a rising contour.
  • *MERGE: incur a penalty when two citation tones are mapped to the same sandhi tone.
  • DIFFER: incur penalty when a base tone is mapped to itself as a sandhi tone.

Without any constraints, there are 5^5 = 3125 possible sandhi systems in a 5-tone language. With these constraints, most of the hypothetical systems are eliminated — for example, the null system (where every tone is mapped to itself) incurs 5 violations of the DIFFER constraint.

These 3 rules aren’t quite enough to fully explain the Xiamen tone system: there are still 84 hypothetical systems that are equally good as the actual system. With the aid of a Perl script, Thomas then introduces more rules until only one system (the actual observed one) emerges as the best under the constraints.

Problems with the OT explanation

There are several reasons why I didn’t find this explanation very satisfying. First, it’s not falsifiable: if your constraints don’t generate the right result, you can keep adding more and more constraints, and tweak the ranking, until they produce the result you want.

Second, the constraints are very arbitrary and lack any cognitive-linguistic motivation. You can explain the *MERGE constraint as trying to preserve contrasts, which makes sense from an information theory point of view, but what about DIFFER? It’s unclear why base tones shouldn’t be mapped to the same sandhi tone, especially since many languages (like Cantonese) manage fine with no sandhi at all.

Even considering Teochew, which is more closely related to the Xiamen dialect, we see that all three constraints are violated. I’m not aware of any analysis of Teochew sandhi using OT, and it would be interesting to see, but surely it would have a very different set of constraints from the Xiamen system.

Nevertheless, OT has been an extremely successful framework in modern phonology. In some cases, OT can describe a pattern very cleanly, where you’d need very complicated rules to describe them. In that case, the set of OT constraints would be a good explanation for the pattern.

Also, if the same constraint shows up in a lot of languages, then that increases its credibility that it’s a true cross-language tendency, rather than a just a made-up rule to explain the data. For example, if the *RISE constraint shows up in OT grammars for many languages, then you could claim that there’s a general tendency for languages to prefer falling tones over rising tones.

Evidence from Middle Chinese

Chen (2000) gives a different perspective. Essentially, he claims that it’s impossible to make sense of the data in any particular modern-day dialect. Instead, we should compare multiple dialects together in the context of historical sound changes.

The evidence he gives is from the Zhangzhou dialect, located about 40km inland from Xiamen. The Zhangzhou dialect has a similar tone circle as Xiamen, but with different values!

3sandhi

It’s not obvious how the two systems are related, until you consider the mapping to Middle Chinese tone categories:

mc-circle

The roman numerals I, II, III denote tones of Middle Chinese, spoken during ~600AD. Middle Chinese had four tones, but none of the present day Chinese dialects retain this system, after centuries of tone splits and merges. In many dialects, a Middle Chinese tone splits into two tones depending on whether the initial is voiced or voiceless. When comparing tones from different dialects, it’s often useful to refer to historical tone categories like “IIIa”, which roughly means “syllables that were tone III in Middle Chinese and the initial consonant is voiceless”.

It’s unlikely that both Xiamen and Zhangzhou coincidentally developed sandhi patterns that map to the same Middle Chinese tone categories. It’s far more likely that the tone circle developed in a common ancestral language, then their phonetic values diverged afterwards in the respective present-day dialects.

That still leaves open the question of: how exactly did the tone circle develop in the first place? It’s likely that we’ll never know for sure: the details are lost to time, and the processes driving historical tone change are not very well understood.

In summary, theoretical phonology and historical linguistics offer complementary insights that explain the chain-shift sandhi patterns in Min Nan languages. Optimality Theory proposes tendencies for languages to prefer certain structures over others. This partially explains the pattern; a lot of it is simply due to historical accident.

References

  1. Barrie, Michael. “Tone circles and contrast preservation.” Linguistic Inquiry 37.1 (2006): 131-141.
  2. Chen, Matthew Y. Tone sandhi: Patterns across Chinese dialects. Vol. 92. Cambridge University Press, 2000. Pages 38-49.
  3. Thomas, Guillaume. “An analysis of Xiamen tone circle.” Proceedings of the 27th West Coast Conference on Formal Linguistics. Cascadilla Proceedings Project, Somerville, MA. 2008.
  4. Xu, Hui Ling. “Aspect of Chaozhou grammar: a synchronic description of the Jieyang variety.” (2007).

Non-technical challenges of medical NLP research

Machine learning has recently made a lot of headlines in healthcare applications, like identifying tumors from images, or technology for personalized treatment. In this post, I describe my experiences as a healthcare ML researcher: the difficulties in doing research in this field, as well as reasons for optimism.

My research group focuses on applications of NLP to healthcare. For a year or two, I was involved in a number of projects in this area (specifically, detecting dementia through speech). From my own projects and from talking to others in my research group, I noticed that a few recurring difficulties frequently came up in healthcare NLP research — things that rarely occurred in other branches of ML. These are non-technical challenges that take up time and impede progress, and generally considered not very interesting to solve. I’ll give some examples of what I mean.

Collecting datasets is hard. Any time you want to do anything involving patient data, you have to undergo a lengthy ethics approval process. Even with something as innocent as an anonymous online questionnaire, there is a mandatory review by an ethics board before the experiment is allowed to proceed. As a result, most datasets in healthcare ML are small: a few dozen patient samples is common, and you’re lucky to have more than a hundred samples to work with. This is tiny compared to other areas of ML where you can easily find thousands of samples.

In my master’s research project, where I studied dementia detection from speech, the largest available corpus had about 300 patients, and other corpora had less than 100. This constrained the types of experiments that were possible. Prior work in this area used a lot of feature engineering approaches, because it was commonly believed that you needed at least a few thousand examples to do deep learning. With less data than that, deep learning would just learn to overfit.

Even after the data has been collected, it is difficult to share with others. This is again due to the conservative ethics processes required to share data. Data transfer agreements need to be reviewed and signed, and in some cases, data must remain physically on servers in a particular hospital. Researchers rarely open-source their code along with the paper, since there’s no point of doing so without giving access to the data; this makes it hard to reproduce any experimental results.

Medical data is messy. Data access issues aside, healthcare NLP has some of the messiest datasets in machine learning. Many datasets in ML are carefully constructed and annotated for the purpose of research, but this is not the case for medical data. Instead, data comes from real patients and hospitals, which are full of shorthand abbreviations of medical terms written by doctors, which mean different things depending on context. Unsurprisingly, many NLP techniques fail to work. Missing values and otherwise unreliable data are common, so a lot of not-so-glamorous data preprocessing is often needed.


I’ve so far painted a bleak picture of medical NLP, but I don’t want to give off such a negative image of my field. In the second part of this post, I give some counter-arguments to the above points as well as some of the positive aspects of research.

On difficulties in data access. There are good reasons for caution — patient data is sensitive and real people can be harmed if the data falls into the wrong hands. Even after removing personally identifiable information, there’s still a risk of a malicious actor deanonymizing the data and extracting information that’s not intended to be made public.

The situation is improving though. The community recognizes the need to share clinical data, to strike a balance between protecting patient privacy and allowing research. There have been efforts like the relatively open MIMIC critical care database to promote more collaborative research.

On small / messy datasets. With every challenge, there comes an opportunity. In fact, my own master’s research was driven by lack of data. I was trying to extend dementia detection to Chinese, but there wasn’t much data available. So I proposed a way to transfer knowledge from the much larger English dataset to Chinese, and got a conference paper and a master’s thesis from it. If it wasn’t for lack of data, then you could’ve just taken the existing algorithm and applied it to Chinese, which wouldn’t be as interesting.

Also, deep learning in NLP has recently gotten a lot better at learning from small datasets. Other research groups have had some success on the same dementia detection task using deep learning. With new papers every week on few-shot learning, one-shot learning, transfer learning, etc, small datasets may not be too much of a limitation.

Same applies to messy data, missing values, label leakage, etc. I’ll refer to this survey paper for the details, but the take-away is that these shouldn’t be thought of as barriers, but as opportunities to make a research contribution.

In summary, as a healthcare NLP researcher, you have to deal with difficulties that other machine learning researchers don’t have. However, you also have the unique opportunity to use your abilities to help sick and vulnerable people. For many people, this is an important consideration — if this is something you care deeply about, then maybe medical NLP research is right for you.

Thanks to Elaine Y. and Chloe P. for their comments on drafts of this post.

Learning the Teochew (Chaozhou) Dialect

Lately I’ve been learning my girlfriend’s dialect of Chinese, called the Teochew dialect.  Teochew is spoken in the eastern part of the Guangdong province by about 15 million people, including the cities of Chaozhou, Shantou, and Jieyang. It is part of the Min Nan (闽南) branch of Chinese languages.

teochew-map

Above: Map of major dialect groups of Chinese, with Teochew circled. Teochew is part of the Min branch of Chinese. Source: Wikipedia.

Although the different varieties of Chinese are usually refer to as “dialects”, linguists consider them different languages as they are not mutually intelligible. Teochew is not intelligible to either Mandarin or Cantonese speakers. Teochew and Mandarin diverged about 2000 years ago, so today they are about as similar as French is to Portuguese. Interestingly, linguists claim that Teochew is one of the most conservative Chinese dialects, preserving many archaic words and features from Old Chinese.

Above: Sample of Teochew speech from entrepreneur Li Ka-shing.

Since I like learning languages, naturally I started learning my girlfriend’s native tongue soon after we started dating. It helped that I spoke Mandarin, but Teochew is not close enough to simply pick up by osmosis, it still requires deliberate study. Compared to other languages I’ve learned, Teochew is challenging because very few people try to learn it as a foreign language, thus there are few language-learning resources for it.

Writing System

The first hurdle is that Teochew is primarily spoken, not written, and does not have a standard writing system. This is the case with most Chinese dialects. Almost all Teochews are bilingual in Standard Chinese, which they are taught in school to read and write.

Sometimes people try to write Teochew using Chinese characters by finding the equivalent Standard Chinese cognates, but there are many dialectal words which don’t have any Mandarin equivalent. In these cases, you can invent new characters or substitute similar sounding characters, but there’s no standard way of doing this.

Still, I needed a way to write Teochew, to take notes on new vocabulary and grammar. At first, I used IPA, but as I became more familiar with the language, I devised my own romanization system that captured the sound differences.

Cognates with Mandarin

Note (Jul 2020): People in the comments have pointed out that some of these examples are incorrect. I’ll keep this section the way it is because I think the high-level point still stands, but these are not great examples.

Knowing Mandarin was very helpful for learning Teochew, since there are lots of cognates. Some cognates are obviously recognizable:

  • Teochew: kai shim, happy. Cognate to Mandarin: kai xin, 开心.
  • Teochew: ing ui, because. Cognate to Mandarin: ying wei, 因为

Some words have cognates in Mandarin, but mean something slightly different, or aren’t commonly used:

  • Teochew: ou, black. Cognate to Mandarin: wu, 乌 (dark). The usual Mandarin word is hei, 黑 (black).
  • Teochew: dze: book. Cognate to Mandarin: ce, 册 (booklet). The usual Mandarin word is shu, 书 (book).

Sometimes, a word has a cognate in Mandarin, but sound quite different due to centuries of sound change:

  • Teochew: hak hau, school. Cognate to Mandarin: xue xiao, 学校.
  • Teochew: de, pig. Cognate to Mandarin: zhu, 猪.
  • Teochew: dung: center. Cognate to Mandarin: zhong, 中.

In the last two examples, we see a fairly common sound change, where a dental stop initial (d- and t-) in Teochew corresponds to an affricate (zh- or ch-) in Mandarin. It’s not usually enough to guess the word, but serves as a useful memory aid.

Finally, a lot of dialectal Teochew words (I’d estimate about 30%) don’t have any recognizable cognate in Mandarin. Examples:

  • da bo: man
  • no gya: child
  • ge lai: home

Grammatical Differences

Generally, I found Teochew grammar to be fairly similar to Mandarin, with only minor differences. Most grammatical constructions can transfer cognate by cognate and still make sense in the other language.

One significant difference in Teochew is the many fused negation markers. Here, a syllable starts with the initial b- or m- joined with a final to negate something. Some examples:

  • bo: not have
  • boi: will not
  • bue: not yet
  • mm: not
  • mai: not want
  • ming: not have to

Phonology and Tone Sandhi

The sound structure of Teochew is not too different from Mandarin, and I didn’t find it difficult to pronounce. The biggest difference is that syllables may end with a stop: -t, -k, -p, and -m, whereas Mandarin syllables can only end with a vowel or nasal. The characteristic of a Teochew accent in Mandarin is replacing /f/ with /h/, and indeed there is no /f/ sound in Teochew.

The hardest part of learning Teochew for me were the tones. Teochew has either six or eight tones depending on how you count them, which isn’t difficult to produce in isolation. However, Teochew has a complex system of tone sandhi rules, where the tone of each syllable changes depending on the tone of the following syllable. Mandarin has tone sandhi to some extent (for example, the third tone sandhi rule where nǐ + hǎo is pronounced níhǎo rather than nǐhǎo). But Teochew takes this to a whole new level, where nearly every syllable undergoes contextual tone change.

Some examples (the numbers are Chao tone numerals, with 1 meaning lowest and 5 meaning highest tone):

  • gu5: cow
  • gu1 nek5: beef

Another example, where a falling tone changes to a rising tone:

  • seng52: to play
  • seng35 iu3 hi1: to play a game

There are tables of tone sandhi rules describing in detail how each tone gets converted to what other tone, but this process is not entirely regular and there are exceptions. As a result, I frequently get the tone wrong by mistake.

Update: In this blog post, I explore Teochew tone sandhi in more detail.

Resources for Learning Teochew

Teochew is seldom studied as a foreign language, so there aren’t many language learning resources for it. Even dictionaries are hard to find. One helpful dictionary is Wiktionary, which has the Teochew pronunciation for most Chinese characters.

Also helpful were formal linguistic grammars:

  1. Xu, Huiling. “Aspects of Chaoshan grammar: A synchronic description of the Jieyang dialect.” Monograph Series Journal of Chinese Linguistics 22 (2007).
  2. Yeo, Pamela Yu Hui. “A sketch grammar of Singapore Teochew.” (2011).

The first is a massively detailed, 300-page description of Teochew grammar, while the second is a shorter grammar sketch on a similar variety spoken in Singapore. They require some linguistics background to read. Of course, the best resource is my girlfriend, a native speaker of Teochew.

Visiting the Chaoshan Region

After practicing my Teochew for a few months with my girlfriend, we paid a visit to her hometown and relatives in the Chaoshan region. More specifically, Raoping County located on the border between Guangdong and Fujian provinces.

 

Left: Chaoshan railway station, China. Right: Me learning the Gongfu tea ceremony, an essential aspect of Teochew culture.

Teochew people are traditional and family oriented, very much unlike the individualistic Western values that I’m used to. In Raoping and Guangzhou, we attended large family gatherings in the afternoon, chatting and gossiping while drinking tea. Although they are still Han Chinese, the Teochew consider themselves a distinct subgroup within Chinese, with their unique culture and language. The Teochew are especially proud of their language, which they consider to be extremely hard for outsiders to learn. Essentially, speaking Teochew is what separates “ga gi nang” (roughly translated as “our people”) from the countless other Chinese.

My Teochew is not great. Sometimes I struggle to get the tones right and make myself understood. But at a large family gathering, a relative asked me why I was learning Teochew, and I was able to reply, albeit with a Mandarin accent: “I want to learn Teochew so that I can be part of your family”.

raoping-sea

Above: Me, Elaine, and her grandfather, on a quiet early morning excursion to visit the sea. Raoping County, Guangdong Province, China.

Thanks to my girlfriend Elaine Ye for helping me write this post. Elaine is fluent in Teochew, Mandarin, Cantonese, and English.

Clustering Autoencoders: Comparing DEC and DCN

Deep autoencoders are a good way to learn representations and structure from unlabelled data. There are many variations, but the main idea is simple: the network consists of an encoder, which converts the input into a low-dimensional latent vector, and a decoder, which reconstructs the original input. Then, the latent vector captures the most essential information in the input.

autoencoder

Above: Diagram of a simple autoencoder (Source)

One of the uses of autoencoders is to discover clusters of similar instances in an unlabelled dataset. In this post, we examine some ways of clustering with autoencoders. That is, we are given a dataset and K, the number of clusters, and need to find a low-dimensional representation that contains K clusters.

Problem with Naive Method

An naive and obvious solution is to take the autoencoder, and run K-means on the latent points generated by the encoder. The problem is that the autoencoder is only trained to reconstruct the input, with no constraints on the latent representation, and this may not produce a representation suitable for K-means clustering.

naive-failure

Above: Failure example with naive autoencoder clustering — K-means fails to find the appropriate clusters

Above is an example from one of my projects. The left diagram shows the hidden representation, and the four classes are generally well-separated. This representation is reasonable and the reconstruction error is low. However, when we run K-means (right), it fails spectacularly because the two latent dimensions are highly correlated.

Thus, our autoencoder can’t trivially be used for clustering. Fortunately, there’s been some research in clustering autoencoders; in this post, we study two main approaches: Deep Embedded Clustering (DEC), and Deep Clustering Network (DCN).

DEC: Deep Embedded Clustering

DEC was proposed by Xie et al. (2016), perhaps the first model to use deep autoencoders for clustering. The training consists of two stages. In the first stage, we initialize the autoencoder by training it the usual way, without clustering. In the second stage, we throw away the decoder, and refine the encoder to produce better clusters with a “cluster hardening” procedure.

dec-model

Above: Diagram of DEC model (Xie et al., 2016)

Let’s examine the second stage in more detail. After training the autoencoder, we run K-means on the hidden layer to get the initial centroids \{\mu_i\}_{i=1}^K. The assumption is the initial cluster assignments are mostly correct, but we can still refine them to be more distinct and separated.

First, we soft-assign each latent point z_i to the cluster centroids \{\mu_i\}_{i=1}^K using the Student’s t-distribution as a kernel:

q_{ij} = \frac{(1 + ||z_i - \mu_j||^2 / \alpha)^{-\frac{\alpha+1}{2}}}{\sum_{j'} (1 + ||z_i - \mu_{j'}||^2 / \alpha)^{-\frac{\alpha+1}{2}}}

In the paper, they fix \alpha=1 (the degrees of freedom), so the above can be simplified to:

q_{ij} = \frac{(1 + ||z_i - \mu_j||^2)^{-1}}{\sum_{j'} (1 + ||z_i - \mu_{j'}||^2)^{-1}}

Next, we define an auxiliary distribution P by:

p_{ij} = \frac{q_{ij}^2/f_j}{\sum_{j'} q_{ij'}^2 / f_{j'}}

where f_j = \sum_i q_{ij} is the soft cluster frequency of cluster j. Intuitively, squaring q_{ij} draws the probability distribution closer to the centroids.

p-and-q-distributions

Above: The auxiliary distribution P is derived from Q, but more concentrated around the centroids

Finally, we define the objective to minimize as the KL divergence between the soft assignment distribution Q and the auxiliary distribution P:

L = KL(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}

Using standard backpropagation and stochastic gradient descent, we can train the encoder to produce latent points z_i to minimize the KL divergence L. We repeat this until the cluster assignments are stable.

DCN: Deep Clustering Network

DCN was proposed by Yang et al. (2017) at around the same time as DEC. Similar to DEC, it initializes the network by training the autoencoder to only reconstruct the input, and initialize K-means on the hidden representations. But unlike DEC, it then alternates between training the network and improving the clusters, using a joint loss function.

dcn-model

Above: Diagram of DCN model (Yang et al., 2017)

We define the optimization objective as a combination of reconstruction error (first term below) and clustering error (second term below). There’s a hyperparameter \lambda to balance the two terms:

dcn-loss

This function is complicated and difficult to optimize directly. Instead, we alternate between fixing the clusters while updating the network parameters, and fixing the network while updating the clusters. When we fix the clusters (centroid locations and point assignments), then the gradient of L with respect to the network parameters can be computed with backpropagation.

Next, when we fix the network parameters, we can update the cluster assignments and centroid locations. The paper uses a rolling average trick to update the centroids in an online manner, but I won’t go into the details here. The algorithm as presented in the paper looks like this:

dcn-algorithm

Comparisons and Further Reading

To recap, DEC and DCN are both models to perform unsupervised clustering using deep autoencoders. When evaluated on MNIST clustering, their accuracy scores are comparable. For both models, the scores depend a lot on initialization and hyperparameters, so it’s hard to say which is better.

One theoretical disadvantage of DEC is that in the cluster refinement phase, there is no longer any reconstruction loss to force the representation to remain reasonable. So the theoretical global optimum can be achieved trivially by mapping every input to the zero vector, but this does not happen in practice when using SGD for optimization.

Recently, there have been lots of innovations in deep learning for clustering, which I won’t be covering in this post; the review papers by Min et al. (2018) and Aljalbout et al. (2018) provide a good overview of the topic. Still, DEC and DCN are strong baselines for the clustering task, which newer models are compared against.

References

  1. Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.
  2. Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.
  3. Min, Erxue, et al. “A survey of clustering with deep learning: From the perspective of network architecture.” IEEE Access 6 (2018): 39501-39514.
  4. Aljalbout, Elie, et al. “Clustering with deep learning: Taxonomy and new methods.” arXiv preprint arXiv:1801.07648 (2018).