How can deep learning be combined with theoretical linguistics?

Natural language processing is mostly done using deep learning and neural networks nowadays. In a typical NLP paper, you might see some Transformer models, some RNNs built using linear algebra and statistics, but very little linguistic theory. Is linguistics irrelevant to NLP now, or can the two fields still contribute to each other?

In a series of articles in the Language journal, Joe Pater discussed the history of neural networks and generative linguistics, and invited experts to give their perspectives of how the two may be combined going forward. I found their discussion very interesting, although a bit long (almost 100 pages). In this blog post, I will give a brief summary of it.

Generative Linguistics and Neural Networks at 60: Foundation, Friction, and Fusion

Research in generative syntax and neural networks began at the same time in 1957, and were both broadly considered under AI, but the two schools mostly stayed separate, at least for a few decades. In neural network research, Rosenblatt proposed the perceptron learning algorithm and realized that you needed hidden layers to learn XOR, but didn’t know of a procedure to train multi-layer NNs (backpropagation wasn’t invented yet). In generative grammar, Chomsky studied natural language like formal languages, and proposed controversial transformational rules. Interestingly, both schools faced challenges from learnability of their systems.

Above: Frank Rosenblatt and Noam Chomsky, two pioneers of neural networks and generative grammar, respectively.

The first time these two schools were combined was in 1986, when a RNN was used to learn a probabilistic model of past tense. This shows that neural networks and generative grammar are not incompatible, and the dichotomy is a false one. Another method of combining them comes from Harmonic Optimality Theory in theoretical phonology, which extends OT to continuous constraints and the procedure for learning them is similar to gradient descent.

Neural models have proved to be capable of learning a remarkable amount of syntax, despite having a lot less structural priors than Chomsky’s model of Universal Grammar. At the same time, they fail with certain complex examples, so maybe it’s time to add back some linguistic structure.

Linzen’s Response

Linguistics and DL can be combined in two ways. First, linguistics is useful for constructing minimal pairs for evaluating neural models, when such examples are hard to find in natural corpora. Second, neural models can be quickly trained on data, so they’re useful for testing learnability. By comparing human language acquisition data with various neural architectures, we can gain insights about how human language acquisition works. (But I’m not sure how such a deduction would logically work.)

Potts’s Response

Formal semantics has not had much contact with DL, as formal semantics is based around higher-order logic, while deep learning is based on matrices of numbers. Socher did some work of representing tree-based semantic composition as operations on vectors.

Above: Formal semantics uses higher-order logic to build representations of meaning. Is this compatible with deep learning?

In several ways, semanticists make different assumptions from deep learning. Semantics likes to distinguish meaning from use, and consider compositional meaning separately from pragmatics and context, whereas DL cares most of all about generalization, and has no reason to discard context or separate semantics and pragmatics. Compositional semantics does not try to analyze meaning of lexical items, leaving them as atoms; DL has word vectors, but linguists criticize that individual dimensions of word vectors are not easily interpretable.

Rawski and Heinz’s Response

Above: Natural languages exhibit features that span various levels of the Chomsky hierarchy.

The “no free lunch” theorem in machine learning says that you can’t get better performance for free, any gains in some problems must be compensated by decreases in performance on other problems. A model performs well if it has an inductive bias well-suited for the type of problems it applies to. This is true for neural networks as well, and we need to study the inductive biases in neural networks: which classes of languages in the Chomsky hierarchy are NNs capable of learning? We must not confuse ignorance of bias with absence of bias.

Berent and Marcus’s Response

There are significant differences between how generative syntax and neural networks view language, that must be resolved before the fields can make progress with integration. The biggest difference is the “algebraic hypothesis” — the assumption that there exists abstract algebraic categories like Noun, that’s distinct from their instances. This allows you to make powerful generalizations using rules that operate on abstract categories. On the other hand, neural models try to process language without structural representations, and this results in failures in generalizations.

Dunbar’s Response

The central problem in connecting neural networks to generative grammar is the implementational mapping problem: how do you decide if a neural network N is implementing a linguistic theory T? The physical system might not look anything like the abstract theory, eg: implementing addition can look like squiggles on a piece of paper. Some limited classes of NNs may be mapped to harmonic grammar, but most NNs cannot, and the success criterion is unclear right now. Future work should study this problem.

Pearl’s Response

Neural networks learn language but don’t really try to model human neural processes. This could be an advantage, as neural models might find generalizations and building blocks that a human would never have thought of, and new tools in interpretability can help us discover these building blocks contained within the model.

Representation Learning for Discovering Phonemic Tone Contours

My paper titled “Representation Learning for Discovering Phonemic Tone Contours” was recently presented at the SIGMORPHON workshop, held concurrently with ACL 2020. This is joint work with Jing Yi Xie and Frank Rudzicz.

Problem: Can an algorithm learn the shapes of phonemic tones in a tonal language, given a list of spoken words?

Answer: We train a convolutional autoencoder to learn a representation for each contour, then use the mean shift algorithm to find clusters in the latent space.

sigmorphon1

By feeding the centers of each cluster into the decoder, we produce a prototypical contour that represents each cluster. Here are the results for Mandarin and Chinese.

sigmorphon2

We evaluate on mutual information with the ground truth tones, and the method is partially successful, but contextual effects and allophonic variation present considerable difficulties.

For the full details, read my paper here!

I didn’t break the bed, the bed broke: Exploring semantic roles with VerbNet / FrameNet

Some time ago, my bed fell apart, and I entered into a dispute with my landlord. “You broke the bed,” he insisted, “so you will have to pay for a new one.”

Being a poor grad student, I wasn’t about to let him have his way. “No, I didn’t break the bed,” I replied. “The bed broke.”

bedbroke

Above: My sad and broken bed. Did it break, or did I break it?

What am I implying here? It’s interesting how this argument relies on a crucial semantic difference between the two sentences:

  1. I broke the bed
  2. The bed broke

The difference is that (1) means I caused the bed to break (eg: by jumping on it), whereas (2) means the bed broke by itself (eg: through normal wear and tear).

This is intuitive to a native speaker, but maybe not so obvious why. One might guess from this example that any intransitive verb when used transitively (“X VERBed Y”) always means “X caused Y to VERB“. But this is not the case: consider the following pair of sentences:

  1. I attacked the bear
  2. The bear attacked

Even though the syntax is identical to the previous example, the semantic structure is quite different. Unlike in the bed example, sentence (1) cannot possibly mean “I caused the bear to attack”. In (1), the bear is the one being attacked, while in (2), the bear is the one attacking something.

broke-attackAbove: Semantic roles for verbs “break” and “attack”.

Sentences which are very similar syntactically can have different structures semantically. To address this, linguists assign semantic roles to the arguments of verbs. There are many semantic roles (and nobody agrees on a precise list of them), but two of the most fundamental ones are Agent and Patient.

  • Agent: entity that intentionally performs an action.
  • Patient: entity that changes state as a result of an action.
  • Many more.

The way that a verb’s syntactic arguments (eg: Subject and Object) line up with its semantic arguments (eg: Agent and Patient) is called the verb’s argument structure. Note that an agent is not simply the subject of a verb: for example, in “the bed broke“, the bed is syntactically a subject but is semantically a patient, not an agent.

Computational linguists have created several corpora to make this information accessible to computers. Two of these corpora are VerbNet and FrameNet. Let’s see how a computer would be able to understand “I didn’t break the bed; the bed broke” using these corpora.

broke-verbnet

Above: Excerpt from VerbNet entry for the verb “break”.

VerbNet is a database of verbs, containing syntactic patterns where the verb can be used. Each entry contains a mapping from syntactic positions to semantic roles, and restrictions on the arguments. The first entry for “break” has the transitive form: “Tony broke the window“.

Looking at the semantics, you can conclude that: (1) the agent “Tony” must have caused the breaking event, (2) something must have made contact with the window during this event, (3) the window must have its material integrity degraded as a result, and (4) the window must be a physical object. In the intransitive usage, the semantics is simpler: there is no agent that caused the event, and no instrument that made contact during the event.

The word “break” can take arguments in other ways, not just transitive and intransitive. VerbNet lists 10 different patterns for this word, such as “Tony broke the piggy bank open with a hammer“. This sentence contains a result (open), and also an instrument (a hammer). The entry for “break” also groups together a list of words like “fracture”, “rip”, “shatter”, etc, that have similar semantic patterns as “break”.

broke-framenet

Above: Excerpt from FrameNet entry for the verb “break”.

FrameNet is a similar database, but based on frame semantics. The idea is that in order to define a concept, you have to define it in terms of other concepts, and it’s hard to avoid a cycle in the definition graph. Instead, it’s sometimes easier to define a whole semantic frame at once, which describes a conceptual situation with many different participants. The frame then defines each participant by what role they play in the situation.

The word “break” is contained in the frame called “render nonfunctional“. In this frame, an agent affects an artifact so that it’s no longer capable of performing its function. The core (semantically obligatory) arguments are the agent and the artifact. There are a bunch of optional non-core arguments, like the manner that the event happened, the reason that the agent broke the artifact, the time and place it happened, and so on. FrameNet tries to make explicit all of the common-sense world knowledge that you need to understand the meaning of an event.

Compared to VerbNet, FrameNet is less concerned with the syntax of verbs: for instance, it does not mention that “break” can be used intransitively. Also, it has more fine-grained categories of semantic roles, and contains a description in English (rather than VerbNet’s predicate logic) of how each semantic argument participates in the frame.

An open question is: how can computers use VerbNet and FrameNet to understand language? Nowadays, deep learning has come to dominate NLP research, so that VerbNet and FrameNet are often seen as relics of a past era, when people still used rule-based systems to do NLP. It turned out to be hard to use VerbNet and FrameNet to make computers do useful tasks.

But recently, the NLP community is realizing that deep learning has limitations when it comes to common-sense reasoning, that you can’t solve just by adding more layers on to BERT and feeding it more data. So maybe deep learning systems can benefit from these lexical semantic resources.

Edge probing BERT and other language models

Recently, there has been a lot of research in the field of “BERTology”: investigating what aspects of language BERT does and doesn’t understand, using probing techniques. In this post, I will describe the “edge probing” technique developed by Tenney et al., in two papers titled “What do you learn from context? Probing for sentence structure in contextualized word representations” and “BERT Rediscovers the Classical NLP Pipeline“. On my first read through these papers, the method didn’t seem very complicated, and the authors only spend a paragraph explaining how it works. But upon closer look, the method is actually nontrivial, and took me some time to understand. The details are there, but hidden in an appendix in the first of the two papers.

The setup for edge probing is you have some classification task that takes a sentence, and a span of consecutive tokens within the sentence, and produces an N-way classification. This can be generalized to two spans in the sentence, but we’ll focus on the single-span case. The tasks cover a range of syntactic and semantic functions, varying from part-of-speech tagging, dependency parsing, coreference resolution, etc.

tenney-examples

Above: Examples of tasks where edge probing may be used. Table taken from Tenney (2019a).

Let’s go through the steps of how edge probing works. Suppose we want to probe for which parts of BERT-BASE contain the most information about named entity recognition (NER). In this NER setup, we’re given the named entity spans and only need to classify which type of entity it is (e.g: person, organization, location, etc). The first step is to feed the sentence through BERT, giving 12 layers, each layer being a 768-dimensional vector.

edge_probe

Above: Edge probing example for NER.

The probing model has several stages:

  1. Mix: learn a task-specific linear combination of the layers, giving a single 768-dimensional vector for each span token. The weights that are learned indicate how much useful information for the task is contained in each layer.
  2. Projection: learn a linear mapping from 768 down to 256 dimensions.
  3. Self-attention pooling: learn a function to generate a scalar weight for each span vector. Then, we normalize the weights to sum up to 1, and take a weighted average. The purpose of this is to collapse the variable-length sequence of span vectors into a single fixed-length 256-dimensional vector.
  4. Feedforward NN: learn a multi-layer perceptron classifier with 2 hidden layers.

For the two-span case, they use two separate sets of weights for the mix, projection, and self-attention steps. Then, the feedforward neural network takes a concatenated 512-dimensional vector instead of a 256-dimensional vector.

The edge probing model needs to be trained on a dataset of classification instances. The probe has weights that are initialized randomly and trained using gradient descent, but the BERT weights are kept constant while the probe is being trained.

This setup was more sophisticated than it looked at first glance, and it turns out the probe itself is quite powerful. In one of the experiments, the authors found that even with randomized input, the probe was able to get over 70% of the full model’s performance!

Edge probing is not the only way of probing deep language models. Other probing experiments (Liu et al., 2019) used a simple linear classifier: this is better for measuring what information can easily be recovered from representations.

References

  1. Tenney, Ian, et al. “What do you learn from context? probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019a.
  2. Tenney, Ian, Dipanjan Das, and Ellie Pavlick. “BERT Rediscovers the Classical NLP Pipeline.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019b.
  3. Liu, Nelson F., et al. “Linguistic Knowledge and Transferability of Contextual Representations.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

Directionality of word class conversion

Many nouns (like google, brick, bike) can be used as verbs:

  • Let me google that for you.
  • The software update bricked my phone.
  • Bob biked to work yesterday.

Conversely, many verbs (like talk, call) can be used as nouns:

  • She gave a talk at the conference.
  • I’m on a call with my boss.

Here, we just assumed that {google, brick, bike} are primarily nouns and {talk, call} are primarily verbs — but is this justified? After all, all five of these words can be used as either a noun or a verb. Then, what’s the difference between the first group {google, brick, bike} and the second group {talk, call}?

These are examples of word class flexibility: words that can be used across multiple part-of-speech classes. In this blog post, I’ll describe some objective criteria to determine if a random word like “sleep” is primarily a noun or a verb.

Five criteria for deciding directionality

Linguists have studied the problem of deciding what is the base / dominant part-of-speech category (equivalently, deciding the directionality of conversion). Five methods are commonly listed in the literature: frequency of occurrence, attestation date, semantic range, semantic dependency, and semantic pattern (Balteiro, 2007; Bram, 2011).

  1. Frequency of occurrence: a word is noun-dominant if it occurs more often as a noun than a verb. This is the easiest to compute since all you need is a POS-tagged corpus. The issue is the direction now depends on which corpus you use, and there can be big differences between genres.
  2. Attestation date: a word is noun-dominant if it was used first as a noun and only later as a verb. This works for newer words, Google (the company) existed for a while before anyone started “googling” things. But we run into problems with older words, and the direction then depends on the precise dating of Middle English manuscripts. If the word is from Proto-Germanic / Proto-Indo-European then finding the attestation date becomes impossible. This method is also philosophically questionable because you shouldn’t need to know the history of a language to describe its current form.
  3. Semantic range: if a dictionary lists more noun meanings than verb meanings for a word, then it’s noun-dominant. This is not so reliable because different dictionaries disagree on how many senses to include, and how different must two senses be in order to have separate entries. Also, some meanings are rare or domain specific (eg: “call option” in finance) and it doesn’t seem right to count them equally.
  4. Semantic dependency: if the definition of the verb meaning refers to the noun meaning, then the word is noun-dominant. For example, “to bottle” means “to put something into a bottle”. This criterion is not always clear to decide, sometimes you can define it either way, or have neither definition refer to the other.
  5. Semantic pattern: a word is noun-dominant if it refers to an entity / object, and verb-dominant if refers to an action. A bike is something that you can touch and feel; a walk is not. Haspelmath (2012) encourages distinguishing {entity, action, property} rather than {noun, verb, adjective}. However, it’s hard to determine without subjective judgement (especially for abstract nouns like “test” or “work”), whether the entity or action sense is more primary.

Comparisons using corpus methods

How do we make sense of all these competing criteria? To answer this question, Balteiro (2007) compare 231 pairs of flexible noun/verb pairs and rated them all according to the five criteria I listed above, as well as a few more that I didn’t include. Later, Bram (2011) surveyed a larger set of 2048 pairs.

The details are quite messy, because applying the criteria are not so straightforward. For example, polysemy: the word “call” has more than 40 definitions in the OED, and some of them are obsolete, so which one do you use for attestation date? How do you deal with homonyms like “bank” that have two unrelated meanings? With hundreds of pages of painstaking analysis, the researchers came to a judgement for each word. Then, they measured the agreement between each pair of criteria:

bram-thesis-tableTable of pairwise agreement (adapted from Table 5.2 of Bram’s thesis)

There is only a moderate level of agreement between the different criteria, on average about 65% — better than random, but not too convincing either. Only frequency and attestation date agree more than 80% of the time. Only a small minority of words have all of the criteria agree.

Theoretical ramifications

This puts us in a dilemma: how do we make sense of these results? What’s the direction of conversion if these criteria don’t agree? Are some of the criteria better than others, perhaps take a majority vote? Is it even possible to determine a direction at all?

Linguists have disagreed for decades over what to do with this situation. Van Lier and Rijkhoff (2013) gives a survey of the various views. Some linguists maintain that flexible words must be either noun-dominant or verb-dominant, and is converted to the other category. Other linguists note the disagreements between criteria and propose instead that words are underspecified. Just like a stem cell that can morph into a skin or lung cell as needed, a word like “sleep” is neither a noun or verb, but a pre-categorical form that can morph into either a noun or verb depending on context.

Can we really determine the dominant category of a conversion pair? It seems doubtful that this issue will ever be resolved. Presently, none of the theories make any scientific predictions that can be tested and falsified. Until then, the theories co-exist as different ways to view and analyze the same data.

The idea of a “dominant” category doesn’t exist in nature, it is merely an artificial construct to help explain the data. In mathematics, it’s nonsensical to ask if imaginary numbers really “exist”. Nobody has seen an imaginary number, but mathematicians use them because they’re good for describing a lot of things. Likewise, it doesn’t make sense to ask if flexible words really have a dominant category. We can only ask whether a theory that assumes the existence of a dominant category is simpler than a theory that does not.

References

  1. Balteiro, Isabel. The directionality of conversion in English: A dia-synchronic study. Vol. 59. Peter Lang, 2007.
  2. Bram, Barli. “Major total conversion in English: The question of directionality.” (2011). PhD Thesis.
  3. Haspelmath, Martin. “How to compare major word-classes across the world’s languages.” Theories of everything: In honor of Edward Keenan 17 (2012): 109-130.
  4. Van Lier, Eva, and Jan Rijkhoff. “Flexible word classes in linguistic typology and grammatical theory.” Flexible word classes: a typological study of underspecified parts-of-speech (2013): 1-30.

Non-technical challenges of medical NLP research

Machine learning has recently made a lot of headlines in healthcare applications, like identifying tumors from images, or technology for personalized treatment. In this post, I describe my experiences as a healthcare ML researcher: the difficulties in doing research in this field, as well as reasons for optimism.

My research group focuses on applications of NLP to healthcare. For a year or two, I was involved in a number of projects in this area (specifically, detecting dementia through speech). From my own projects and from talking to others in my research group, I noticed that a few recurring difficulties frequently came up in healthcare NLP research — things that rarely occurred in other branches of ML. These are non-technical challenges that take up time and impede progress, and generally considered not very interesting to solve. I’ll give some examples of what I mean.

Collecting datasets is hard. Any time you want to do anything involving patient data, you have to undergo a lengthy ethics approval process. Even with something as innocent as an anonymous online questionnaire, there is a mandatory review by an ethics board before the experiment is allowed to proceed. As a result, most datasets in healthcare ML are small: a few dozen patient samples is common, and you’re lucky to have more than a hundred samples to work with. This is tiny compared to other areas of ML where you can easily find thousands of samples.

In my master’s research project, where I studied dementia detection from speech, the largest available corpus had about 300 patients, and other corpora had less than 100. This constrained the types of experiments that were possible. Prior work in this area used a lot of feature engineering approaches, because it was commonly believed that you needed at least a few thousand examples to do deep learning. With less data than that, deep learning would just learn to overfit.

Even after the data has been collected, it is difficult to share with others. This is again due to the conservative ethics processes required to share data. Data transfer agreements need to be reviewed and signed, and in some cases, data must remain physically on servers in a particular hospital. Researchers rarely open-source their code along with the paper, since there’s no point of doing so without giving access to the data; this makes it hard to reproduce any experimental results.

Medical data is messy. Data access issues aside, healthcare NLP has some of the messiest datasets in machine learning. Many datasets in ML are carefully constructed and annotated for the purpose of research, but this is not the case for medical data. Instead, data comes from real patients and hospitals, which are full of shorthand abbreviations of medical terms written by doctors, which mean different things depending on context. Unsurprisingly, many NLP techniques fail to work. Missing values and otherwise unreliable data are common, so a lot of not-so-glamorous data preprocessing is often needed.


I’ve so far painted a bleak picture of medical NLP, but I don’t want to give off such a negative image of my field. In the second part of this post, I give some counter-arguments to the above points as well as some of the positive aspects of research.

On difficulties in data access. There are good reasons for caution — patient data is sensitive and real people can be harmed if the data falls into the wrong hands. Even after removing personally identifiable information, there’s still a risk of a malicious actor deanonymizing the data and extracting information that’s not intended to be made public.

The situation is improving though. The community recognizes the need to share clinical data, to strike a balance between protecting patient privacy and allowing research. There have been efforts like the relatively open MIMIC critical care database to promote more collaborative research.

On small / messy datasets. With every challenge, there comes an opportunity. In fact, my own master’s research was driven by lack of data. I was trying to extend dementia detection to Chinese, but there wasn’t much data available. So I proposed a way to transfer knowledge from the much larger English dataset to Chinese, and got a conference paper and a master’s thesis from it. If it wasn’t for lack of data, then you could’ve just taken the existing algorithm and applied it to Chinese, which wouldn’t be as interesting.

Also, deep learning in NLP has recently gotten a lot better at learning from small datasets. Other research groups have had some success on the same dementia detection task using deep learning. With new papers every week on few-shot learning, one-shot learning, transfer learning, etc, small datasets may not be too much of a limitation.

Same applies to messy data, missing values, label leakage, etc. I’ll refer to this survey paper for the details, but the take-away is that these shouldn’t be thought of as barriers, but as opportunities to make a research contribution.

In summary, as a healthcare NLP researcher, you have to deal with difficulties that other machine learning researchers don’t have. However, you also have the unique opportunity to use your abilities to help sick and vulnerable people. For many people, this is an important consideration — if this is something you care deeply about, then maybe medical NLP research is right for you.

Thanks to Elaine Y. and Chloe P. for their comments on drafts of this post.

NAACL 2019, my first conference talk, and general impressions

Last week, I attended my first NLP conference, NAACL, which was held in Minneapolis. My paper was selected for a short talk of 12 minutes in length, plus 3 minutes for questions. I presented my research on dementia detection in Mandarin Chinese, which I did during my master’s.

Here’s a video of my talk:

Visiting Minneapolis

Going to conferences is a good way as a grad student to travel for free. Some of my friends balked at the idea of going to Minneapolis rather than somewhere more “interesting”. However, I had never been there before, and in the summer, Minneapolis was quite nice.

Minneapolis is very flat and good for biking — you can rent a bike for $2 per 30 minutes. I took the light rail to Minnehaha falls (above) and biked along the Mississippi river to the city center. The downside is that compared to Toronto, the food choices are quite limited. The majority of restaurants serve American food (burgers, sandwiches, pasta, etc).

Meeting people

It’s often said that most of the value of a conference happens in the hallways, not in the scheduled talks (which you can often find on YouTube for free). For me, this was a good opportunity to finally meet some of my previous collaborators in person. Previously, we had only communicated via Skype and email. I also ran into people whose names I recognize from reading their papers, but had never seen in person.

Despite all the advances in video conferencing technology, nothing beats face-to-face interaction over lunch. There’s a reason why businesses spend so much money to send employees abroad to conduct their meetings.

Talks and posters

The accepted papers were split roughly 50-50 into talks and poster presentations. I preferred the poster format, because you get to have a 1-on-1 discussion with the author about their work, and ask clarifying questions.

Talks were a mixed bag — some were great, but for many it was difficult to make sense of anything. The most common problem was that speakers tended to dive into complex technical details, and lost sense of the “big picture”. The better talks spent a good chunk of time covering the background and motivation, with lots of examples, before describing their own contribution.

It’s difficult to make a coherent talk in only 12 minutes. A research paper is inherently a very narrow and focused contribution, while the audience come from all areas of NLP, and have probably never seen your problem before. The organizers tried to group talks into related topics like “Speech” or “Multilingual NLP”, but even then, the subfields of NLP are so diverse that two random papers had very little in common.

Research trends in NLP

Academia has a notorious reputation for inventing impractically complex models to squeeze out a 0.2% improvement on a benchmark. This may be true in some areas of ML, but it certainly wasn’t the case here. There was a lot of variety in the problems people were solving. Many papers worked with new datasets, and even those using existing datasets often proposed new tasks that weren’t considered before.

A lot of papers used similar model architectures, like some sort of Bi-LSTM with attention, perhaps with a CRF on top. None of it is directly comparable to one another because everybody is solving a different problem. I guess it shows the flexibility of Bi-LSTMs to be so widely applicable. For me, the papers that did something different (like applying quantum physics to NLP) really stood out.

Interestingly, many papers did experiments with BERT, which was presented at this conference! Last October, the BERT paper bypassed the usual conventions and announced their results without peer review, so the NLP community knew about it for a long time, but only now it’s officially presented at a conference.

MSc Thesis: Automatic Detection of Dementia in Mandarin Chinese

My master’s thesis is done! Read it here:

MSc Thesis (PDF)

Video

Slides

Talk Slides (PDF)

Part of this thesis is replicated in my paper “Detecting dementia in Mandarin Chinese using transfer learning from a parallel corpus” which I will be presenting at NAACL 2019. However, the thesis contains more details and background information that were omitted in the paper.

Onwards to PhD!

I trained a neural network to describe images, then I gave it dementia

This blog post is a summary of my work from earlier this year: Dropout during inference as a model for neurological degeneration in an image captioning network.

For a long time, deep learning has had an interesting connection to neuroscience. The definition of the neuron in neural networks was inspired by early models of the neuron. Later, convolutional neural networks were inspired by the structure of neurons in the visual cortex. Many other models also drew inspiration from how the brain functions, like visual attention which replicated how humans looked at different areas of an image when interpreting it.

The connection was always a loose and superficial, however. Despite advances in neuroscience about better models of neurons, these never really caught on among deep learning researchers. Real neurons obviously don’t learn by gradient back-propagation and stochastic gradient descent.

In this work, we study how human neurological degeneration can have a parallel in the universe of deep neural networks. In humans, neurodegeneration can occur by several mechanisms, such as Alzheimer’s disease (which affects connections between individual neurons) or stroke (in which large sections of brain tissue die). The effect of Alzheimer’s disease is dementia, where language, motor, and other cognitive abilities gradually become impaired.

To simulate this effect, we give our neural network a sort of dementia, by interfering with connections between neurons using a method called dropout.

robot_apocalypse.jpg

Yup, this probably puts me high up on the list of humans to exact revenge in the event of an AI apocalypse.

The Model

We started with an encoder-decoder style image captioning neural network (described in this post), which looks at an image and outputs a sentence that describes it. This is inspired by a picture description task that we give to patients suspected of having dementia: given a picture, describe it in as much detail as possible. Patients with dementia typically exhibit patterns of language different from healthy patients, which we can detect using machine learning.

To simulate neurological degeneration in the neural network, we apply dropout in the inference mode, which randomly selects a portion of the neurons in a layer and sets their outputs to zero. Dropout is a common technique during training to regularize neural networks to prevent overfitting, but usually you turn it off during evaluation for the best possible accuracy. To our knowledge, nobody’s experimented with applying dropout in the evaluation stage in a language model before.

We train the model using a small amount of dropout, then apply a larger amount of dropout during inference. Then, we evaluate the quality of the sentences produced by BLEU-4 and METEOR metrics, as well as sentence length and similarity of vocabulary distribution to the training corpus.

Results

When we applied dropout during inference, the accuracy of the captions (measured by BLEU-4 and METEOR) decreased with more dropout. However, the vocabulary generated was more diverse, and the word frequency distribution was more similar (measured by KL-divergence to the training set) when a moderate amount of dropout was applied.

metrics.png

When the dropout was too high, the model degenerated into essentially generating random words. Here are some examples of sentences that were generated, at various levels of dropout:

sample.png

Qualitatively, the effects of dropout seemed to cause two types of errors:

  • Caption starts out normally, then repeats the same word several times: “a small white kitten with red collar and yellow chihuahua chihuahua chihuahua”
  • Caption starts out normally, then becomes nonsense: “a man in a baseball bat and wearing a uniform helmet and glove preparing their handles won while too frown”

This was not that similar to speech produced by people with Alzheimer’s, but kind of resembled fluent aphasia (caused by damage to the part of the brain responsible for understanding language).

Challenges and Difficulties

Excited with our results, we submitted the paper to EMNLP 2018. Unfortunately, our paper was rejected. Despite the novelty of our approach, the reviewers pointed out that our work had some serious drawbacks:

  1. Unclear connection to neuroscience. Adding dropout during inference mode has no connections to any biological models of what happens to the brain during atrophy.
  2. Only superficial resemblance to aphasic speech. A similar result could have been generated by sampling words randomly from a dictionary, without any complicated RNN models.
  3. Not really useful for anything. We couldn’t think of any situations where this model would be useful, such as detecting aphasia.

We decided that there was no way around these roadblocks, so we scrapped the idea, put the paper up on arXiv and worked on something else.

For more technical details, refer to our paper:

How to read research papers for fun and profit

One skill that I’ve learned after a year in grad school is how to effectively read research papers. Previously I had found them impenetrable, but now I find them a great source of information about cutting-edge science while it is being done and before it’s made its way into textbooks. Now I read about 4-5 of them every week.

My research area is natural language processing and machine learning, but I read papers in lots of fields, not just in AI and computer science. Papers are my go-to source for a myriad of scientific inquiries, for example: does drinking alcohol cause cancer? Are women more talkative than men? Was winter in Toronto abnormally cold this year? Etc.

Why read scientific papers?

If you try to Google questions like these, you typically end up on Wikipedia or some random article on the internet. Research papers are an underutilized resource that have several advantages over other common sources of information on the internet.

Advantages over articles on the internet: no matter what topic, you will undoubtedly find articles on it on the internet. Some of these articles are excellent, but others are opinionated nonsense. Without being an expert yourself, it can be difficult to decide what information to trust. Peer-reviewed research papers are held to a much higher minimum quality standard, and for every claim they make, they have to clearly state their evidence, assumptions, how they arrived at the conclusion, and their degree of confidence in their result. You can examine the paper for yourself and decide if the assumptions are reasonable and the conclusions follow logically, rather than trust someone else’s word for it. With some digging deeper and some critical thinking, you can avoid a lot of misinformation on the internet.

Advantages over Wikipedia: Wikipedia is a pretty reliable source of truth; in fact, it often cites scientific papers as its sources. However, Wikipedia is written to be concise, so that oftentimes, a 30-page research paper is summarized to 1-2 sentences. If you only read Wikipedia, you will miss a lot of the nuances contained in the original paper, and only develop a cursory understanding compared to going directly to the source.

Finding the right paper to read

If your professor or colleague has assigned you a specific paper to read, then you can skip this section.

A big part of the challenge of reading papers is deciding which ones to read. There are a lot of papers out there, and only a few will be relevant to you. Therefore, deciding what to read is a nontrivial skill in itself.

Research papers are the most useful when you have a specific problem or question in mind. When I first started out reading papers, I approached this the wrong way. One day, I’d suddenly decide “hmm, complexity theory is pretty interesting, let’s go on arXiv and look at some recent complexity theory papers“. Then, I’d open a few, attempt to read them, get confused, and conclude I’m not smart enough to read complexity theory papers. Why is this a bad idea? A research paper exists to answer a very specific question, so it makes no sense to pick up a random paper without the background context. What is the problem? What approaches have been tried in the past, and how have they failed? Without understanding background information like this, it’s impossible to appreciate the contribution of a specific paper.

2.pngAbove: Use the forward citation and related article buttons on Google Scholar to explore relevant papers.

It’s helpful to think of each research paper as a node in a massive, interconnected graph. Rather than each paper existing as a standalone item, a paper is deeply connected to the research that came before and after it.

Google Scholar is your best friend for exploring this graph. Begin by entering a few keywords and picking a few promising hits from the first 2-3 pages. Good, this is your starting point. Here are some heuristics for traversing the paper graph:

  • To go forward in time, look at works that cited this paper. A paper being cited usually means one of two things: (1) the future paper uses some technique or result developed in the current paper for some other purpose, or (2) the future paper improves on the techniques in the current paper. Citations of the second type are more useful.
  • To go backward in time, look at the paper’s introduction and related work. This puts the paper in context of previous work. Occasionally, you find a survey paper that doesn’t contribute anything novel of its own, but summarizes a bunch of previous related work; these are really helpful when you’re beginning your research in a topic.
  • Citation count is a good indicator of a paper’s importance and merit. If the paper has under 10 citations, take its claims with a grain of salt (even more so if it’s an arXiv preprint and not a peer-reviewed paper). Over 100 citations means the paper has made a significant contribution; over 1000 citations indicates a landmark paper in the field and is probably worth reading. Citation count is not a perfect metric, especially for very recent work, but it’s a useful heuristic that’s applicable across disciplines.

The first pass: High level overview

Great, you’ve decided on a paper to read. Now how to read it effectively?

Reading a paper is not like reading a novel. When you read a novel, you start at the beginning and read linearly until you reach the end. However, reading a paper is most efficient by hopping around the sections as appropriate, rather than read linearly from beginning to end.

The goal of your first reading of a paper is to first get a high level overview of the paper, before diving into the details. As you go through the paper, here are some good questions that you should be asking yourself:

  • What is the problem being solved?
  • What approaches have been tried before, and what are their limitations?
  • What is this paper’s novel contribution?
  • What experiments were done, using what dataset? How successful were the results?
  • Can the method in this paper be applied to my problem?
  • If not, what assumptions are needed for this method to work?

3.pngAbove: Treat each paper as a node in a massive graph of research, rather than a standalone item in a vacuum.

When I read a paper, I usually proceed in the following order:

  1. Abstract: a long paragraph that summarizes the entire paper. Read this to decide if the rest of the paper is worth reading or not.
  2. Introduction, diagrams, tables, and conclusion. Often, reading the diagrams and captions gives you a good idea of what’s going on with minimal effort.
  3. If the field is unfamiliar to you, then note down any interesting references in the introduction and related works sections to explore later. If the field is familiar, then just skim these sections.
  4. Read the main body of the paper: model, experiment, and discussion, without getting too bogged down in the details. If a section is confusing, skip it for now and come back to it on a second reading.

That’s it — you’ve finished reading a paper! Now you can either go back and read it again, focusing on the details you skimmed over the first pass, or move on to a different paper that you’ve added to your backlog.

When reading a paper, you should not expect to understand every aspect of the paper by the time you’re done. You can always refer back to the paper at a later time, as needed. Generally, you don’t need to understand all the details, unless you’re trying to replicate or extend the paper.

Help, I’m stuck!

Sometimes, despite your best efforts, you find that a paper is impenetrable. It’s not necessarily your fault — some papers are hastily written hours before a conference deadline. What do you do now?

Look for a video or blog post explaining the paper. If you’re lucky, someone may have recorded a lecture where the author presents the paper at a conference. Maybe somebody wrote a blog post summarizing the paper (Colah’s blog has great summaries of machine learning research). These are often better at explaining things than the actual paper.

If there’s a lot of background terminology that don’t make sense, it may be better to consult other sources like textbooks and course lectures rather than papers. This is especially true if the research is not new (>10 years old). Research papers are not always the best at explaining a concept clearly: by their nature, they document research as it’s being done. Sometimes, the paper paints an incomplete picture of something that’s better understood later. Textbook writers can look back on research after it’s already done, and thereby benefit from hindsight knowledge that didn’t exist when the paper was written.

Basic statistics is useful in many experimental fields — concepts like linear / logistic regression, p-values, hypothesis testing, and common statistical distribution. Any paper that deals with experimental data will use at least some statistics, so it’s worthwhile to be comfortable with basic stats.


That’s it for my advice. The densely packed two-column pages of text may appear daunting to the uninitiated reader, but they can be conquered with a bit of practice. Whether it’s for work or for fun, you definitely don’t need a PhD to read papers.