How can deep learning be combined with theoretical linguistics?

Natural language processing is mostly done using deep learning and neural networks nowadays. In a typical NLP paper, you might see some Transformer models, some RNNs built using linear algebra and statistics, but very little linguistic theory. Is linguistics irrelevant to NLP now, or can the two fields still contribute to each other?

In a series of articles in the Language journal, Joe Pater discussed the history of neural networks and generative linguistics, and invited experts to give their perspectives of how the two may be combined going forward. I found their discussion very interesting, although a bit long (almost 100 pages). In this blog post, I will give a brief summary of it.

Generative Linguistics and Neural Networks at 60: Foundation, Friction, and Fusion

Research in generative syntax and neural networks began at the same time in 1957, and were both broadly considered under AI, but the two schools mostly stayed separate, at least for a few decades. In neural network research, Rosenblatt proposed the perceptron learning algorithm and realized that you needed hidden layers to learn XOR, but didn’t know of a procedure to train multi-layer NNs (backpropagation wasn’t invented yet). In generative grammar, Chomsky studied natural language like formal languages, and proposed controversial transformational rules. Interestingly, both schools faced challenges from learnability of their systems.

Above: Frank Rosenblatt and Noam Chomsky, two pioneers of neural networks and generative grammar, respectively.

The first time these two schools were combined was in 1986, when a RNN was used to learn a probabilistic model of past tense. This shows that neural networks and generative grammar are not incompatible, and the dichotomy is a false one. Another method of combining them comes from Harmonic Optimality Theory in theoretical phonology, which extends OT to continuous constraints and the procedure for learning them is similar to gradient descent.

Neural models have proved to be capable of learning a remarkable amount of syntax, despite having a lot less structural priors than Chomsky’s model of Universal Grammar. At the same time, they fail with certain complex examples, so maybe it’s time to add back some linguistic structure.

Linzen’s Response

Linguistics and DL can be combined in two ways. First, linguistics is useful for constructing minimal pairs for evaluating neural models, when such examples are hard to find in natural corpora. Second, neural models can be quickly trained on data, so they’re useful for testing learnability. By comparing human language acquisition data with various neural architectures, we can gain insights about how human language acquisition works. (But I’m not sure how such a deduction would logically work.)

Potts’s Response

Formal semantics has not had much contact with DL, as formal semantics is based around higher-order logic, while deep learning is based on matrices of numbers. Socher did some work of representing tree-based semantic composition as operations on vectors.

Above: Formal semantics uses higher-order logic to build representations of meaning. Is this compatible with deep learning?

In several ways, semanticists make different assumptions from deep learning. Semantics likes to distinguish meaning from use, and consider compositional meaning separately from pragmatics and context, whereas DL cares most of all about generalization, and has no reason to discard context or separate semantics and pragmatics. Compositional semantics does not try to analyze meaning of lexical items, leaving them as atoms; DL has word vectors, but linguists criticize that individual dimensions of word vectors are not easily interpretable.

Rawski and Heinz’s Response

Above: Natural languages exhibit features that span various levels of the Chomsky hierarchy.

The “no free lunch” theorem in machine learning says that you can’t get better performance for free, any gains in some problems must be compensated by decreases in performance on other problems. A model performs well if it has an inductive bias well-suited for the type of problems it applies to. This is true for neural networks as well, and we need to study the inductive biases in neural networks: which classes of languages in the Chomsky hierarchy are NNs capable of learning? We must not confuse ignorance of bias with absence of bias.

Berent and Marcus’s Response

There are significant differences between how generative syntax and neural networks view language, that must be resolved before the fields can make progress with integration. The biggest difference is the “algebraic hypothesis” — the assumption that there exists abstract algebraic categories like Noun, that’s distinct from their instances. This allows you to make powerful generalizations using rules that operate on abstract categories. On the other hand, neural models try to process language without structural representations, and this results in failures in generalizations.

Dunbar’s Response

The central problem in connecting neural networks to generative grammar is the implementational mapping problem: how do you decide if a neural network N is implementing a linguistic theory T? The physical system might not look anything like the abstract theory, eg: implementing addition can look like squiggles on a piece of paper. Some limited classes of NNs may be mapped to harmonic grammar, but most NNs cannot, and the success criterion is unclear right now. Future work should study this problem.

Pearl’s Response

Neural networks learn language but don’t really try to model human neural processes. This could be an advantage, as neural models might find generalizations and building blocks that a human would never have thought of, and new tools in interpretability can help us discover these building blocks contained within the model.

Non-technical challenges of medical NLP research

Machine learning has recently made a lot of headlines in healthcare applications, like identifying tumors from images, or technology for personalized treatment. In this post, I describe my experiences as a healthcare ML researcher: the difficulties in doing research in this field, as well as reasons for optimism.

My research group focuses on applications of NLP to healthcare. For a year or two, I was involved in a number of projects in this area (specifically, detecting dementia through speech). From my own projects and from talking to others in my research group, I noticed that a few recurring difficulties frequently came up in healthcare NLP research — things that rarely occurred in other branches of ML. These are non-technical challenges that take up time and impede progress, and generally considered not very interesting to solve. I’ll give some examples of what I mean.

Collecting datasets is hard. Any time you want to do anything involving patient data, you have to undergo a lengthy ethics approval process. Even with something as innocent as an anonymous online questionnaire, there is a mandatory review by an ethics board before the experiment is allowed to proceed. As a result, most datasets in healthcare ML are small: a few dozen patient samples is common, and you’re lucky to have more than a hundred samples to work with. This is tiny compared to other areas of ML where you can easily find thousands of samples.

In my master’s research project, where I studied dementia detection from speech, the largest available corpus had about 300 patients, and other corpora had less than 100. This constrained the types of experiments that were possible. Prior work in this area used a lot of feature engineering approaches, because it was commonly believed that you needed at least a few thousand examples to do deep learning. With less data than that, deep learning would just learn to overfit.

Even after the data has been collected, it is difficult to share with others. This is again due to the conservative ethics processes required to share data. Data transfer agreements need to be reviewed and signed, and in some cases, data must remain physically on servers in a particular hospital. Researchers rarely open-source their code along with the paper, since there’s no point of doing so without giving access to the data; this makes it hard to reproduce any experimental results.

Medical data is messy. Data access issues aside, healthcare NLP has some of the messiest datasets in machine learning. Many datasets in ML are carefully constructed and annotated for the purpose of research, but this is not the case for medical data. Instead, data comes from real patients and hospitals, which are full of shorthand abbreviations of medical terms written by doctors, which mean different things depending on context. Unsurprisingly, many NLP techniques fail to work. Missing values and otherwise unreliable data are common, so a lot of not-so-glamorous data preprocessing is often needed.


I’ve so far painted a bleak picture of medical NLP, but I don’t want to give off such a negative image of my field. In the second part of this post, I give some counter-arguments to the above points as well as some of the positive aspects of research.

On difficulties in data access. There are good reasons for caution — patient data is sensitive and real people can be harmed if the data falls into the wrong hands. Even after removing personally identifiable information, there’s still a risk of a malicious actor deanonymizing the data and extracting information that’s not intended to be made public.

The situation is improving though. The community recognizes the need to share clinical data, to strike a balance between protecting patient privacy and allowing research. There have been efforts like the relatively open MIMIC critical care database to promote more collaborative research.

On small / messy datasets. With every challenge, there comes an opportunity. In fact, my own master’s research was driven by lack of data. I was trying to extend dementia detection to Chinese, but there wasn’t much data available. So I proposed a way to transfer knowledge from the much larger English dataset to Chinese, and got a conference paper and a master’s thesis from it. If it wasn’t for lack of data, then you could’ve just taken the existing algorithm and applied it to Chinese, which wouldn’t be as interesting.

Also, deep learning in NLP has recently gotten a lot better at learning from small datasets. Other research groups have had some success on the same dementia detection task using deep learning. With new papers every week on few-shot learning, one-shot learning, transfer learning, etc, small datasets may not be too much of a limitation.

Same applies to messy data, missing values, label leakage, etc. I’ll refer to this survey paper for the details, but the take-away is that these shouldn’t be thought of as barriers, but as opportunities to make a research contribution.

In summary, as a healthcare NLP researcher, you have to deal with difficulties that other machine learning researchers don’t have. However, you also have the unique opportunity to use your abilities to help sick and vulnerable people. For many people, this is an important consideration — if this is something you care deeply about, then maybe medical NLP research is right for you.

Thanks to Elaine Y. and Chloe P. for their comments on drafts of this post.

I trained a neural network to describe pictures and it’s hilariously bad

This month, I’ve been working on a neural network to describe in a sentence what’s happening in a picture, otherwise known as image captioning. My model roughly follows the architecture outlined in the paper “Show and Tell: A Neural Image Caption Generator” by Vinyals et al., 2014.

A high level overview: the neural network first uses a convolutional neural network to turn the picture into an abstract representation. Then, it uses this representation as the initial hidden state of a recurrent neural network or LSTM, which generates a natural language sentence. This type of neural network is called an encoder-decoder network and is commonly used for a lot of NLP tasks like machine translation.

1.pngAbove: Encoder-decoder image captioning neural network (Figure 1 of paper)

When I first encountered LSTMs, I was really confused about how they worked, and how to train them. If your output is a sequence of words, what is your loss function and how do you backpropagate it? In fact, the training and inference passes of an LSTM are quite different. In this blog post, I’ll try to explain this difference.

2.pngAbove: Training procedure for caption LSTM, given known image and caption

During training mode, we train the neural network to minimize perplexity of the image-caption pair. Perplexity measures how the likelihood that the neural network would generate the given caption when it sees the given image. If we’re training it to output the caption “a cute cat”, the perplexity is:

P(“a” | image) *

P(“cute” | image, “a”) *

P(“cat” | image, “a”, “cute”) * 

(Note: for numerical stability reasons, we typically work with sums of negative log likelihoods rather than products of likelihood probabilities, so perplexity is actually the negative log of that whole thing)

After passing the whole sequence through the LSTM one word at a time, we get a single value, the perplexity, which we can minimize using backpropagation and gradient descent. As perplexity gets lower and lower, the LSTM is more likely to produce similar captions to the ground truth when it sees a similar image. This is how the network learns to caption images.

3.pngAbove: Inference procedure for caption LSTM, given only the image but no caption

During inference mode, we repeatedly sample the neural network, one word at a time, to produce a sentence. On each step, the LSTM outputs a probability distribution for the next word, over the entire vocabulary. We pick the highest probability word, add it to the caption, and feed it back into the LSTM. This is repeated until the LSTM generates the end marker. Hopefully, if we trained it properly, the resulting sentence will actually describe what’s happening in the picture.

This is the main idea of the paper, and I omitted a lot details. I encourage you to read the paper for the finer points.


I implemented the model using PyTorch and trained it using the MS COCO dataset, which contains about 80,000 images of common objects and situations, and each image is human annotated with 5 captions.

To speed up training, I used a pretrained VGG16 convnet, and pretrained GloVe word embeddings from SpaCy. Using lots of batching, the Adam optimizer, and a Titan X GPU, the neural network trains in about 4 hours. It’s one thing to understand how it works on paper, but watching it actually spit out captions for real images felt like magic.

4.jpgAbove: How I felt when I got this working

How are the results? For some of the images, the neural network does great:

COCO_val2014_000000431896.jpg“A train is on the tracks at a station”

COCO_val2014_000000226376.jpg“A woman is holding a cat in her arms”

Other times the neural network gets confused, with amusing results:

COCO_val2014_000000333406.jpg“A little girl holding a stuffed animal in her hand”

COCO_val2014_000000085826.jpg“A baby laying on a bed with a stuffed animal”

COCO_val2014_000000027617.jpg“A dog is running with a frisbee in its mouth”

I’d say we needn’t worry about the AI singularity anytime soon 🙂

The original paper has some more examples of correct and incorrect captions that might be generated. Newer models also made improvements to generate more accurate captions: for example, adding a visual attention mechanism improved the results a bit. However, the state-of-the-art models still fall short on human performance; they often make mistakes when describing pictures with objects in unusual configurations.

This is a work in progress; source code is on Github here.

Paper Review: Linguistic Features to Identify Alzheimer’s Disease

Today I’m going to be sharing a paper I’ve been looking at, related to my research: “Linguistic Features Identify Alzheimer’s Disease in Narrative Speech” by Katie Fraser, Jed Meltzer, and my adviser Frank Rudzicz. The paper was published in 2016 in the Journal of Alzheimer’s Disease. It uses NLP to automatically diagnose patients with Alzheimer’s disease, given a sample of their speech.


Alzheimer’s disease is a disease that you might have heard of, but it doesn’t get much attention in the media, unlike cancer and stroke. It is a neurodegenerative disease that mostly affects elderly people. 5 million Americans are living with Alzheimer’s, including 1 in 9 over the age of 65, and 1 in 3 over the age of 85.

Alzheimer’s is also the most expensive disease in America. After diagnosis, patients may continue to live for over 10 years, and during much of this time, they are unable to care for themselves and require a constant caregiver. In 2017, 68% of Medicare and Medicaid’s budget is spent on patients with Alzheimer’s, and this number is expected to increase as the elderly population grows.

Despite a lot of recent advances in our understanding of the disease, there is currently no cure for Alzheimer’s. Since the disease is so prevalent and harmful, research in this direction is highly impactful.

Previous tests to diagnose Alzheimer’s

One of the early signs of Alzheimer’s is having difficulty remembering things, including words, leading to a decrease in vocabulary. A reliable way to test for this is a retrieval question like the following (Monsch et al., 1992):

In the next 60 seconds, name as many items as possible that can be found in a supermarket.

A healthy person could rattle out about 20-30 items in a minute, whereas someone with Alzheimer’s could only produce about 10. By setting the threshold at 16 items, they could classify even mild cases of Alzheimer’s with about 92% accuracy.

This doesn’t quite capture the signs of Alzheimer’s disease though. Patients with Alzheimer’s tend to be rambly and incoherent. This can be tested with a picture description task, where the patient is given a picture and asked to describe it with as much detail as possible (Giles, Patterson, Hodges, 1994).

73c894ea4d2dc12ca69a6380e51f1d62Above: Boston Cookie Theft picture used for picture description task

There is no time limit, and the patients talked until they indicated they had nothing more to say, or if they didn’t say anything for 15 seconds.

Patients with Alzheimer’s disease produced descriptions with varying degrees of incoherence. Here’s an example transcript, from the above paper:

Experimenter: Tell me everything you see going on in this picture

Patient: oh yes there’s some washing up going on / (laughs) yes / …… oh and the other / ….. this little one is taking down the cookie jar / and this little girl is waiting for it to come down so she’ll have it / ………. er this girl has got a good old splash / she’s left the taps on (laughs) she’s gone splash all down there / um …… she’s got splash all down there

You can clearly tell that something’s off, but it’s hard to put a finger on exactly what the problem is. Well, time to apply some machine learning!

Results of Paper

Fraser’s 2016 paper uses data from the DementiaBank corpus, consisting of 240 narrative samples from patients with Alzheimer’s, and 233 from a healthy control group. The two groups were matched to have similar age, gender, and education levels. Each participant was asked to describe the Boston Cookie Theft picture above.

Fraser’s analysis used both the original audio data, as well as a detailed computer-readable transcript. She looked at 370 different features covering all sorts of linguistic metrics, like ratios of different parts of speech, syntactic structures, vocabulary richness, and repetition. Then, she performed a factor analysis and identified a set of 35 features that achieves about 81% accuracy in distinguishing between Alzheimer’s patients and controls.

According to the analysis, a few of the most important distinguishing features are:

  • Pronoun to noun ratio. Alzheimer’s patients produce vague statements and tend to substitute pronouns like “he” for nouns like “the boy”. This also applies to adverbial constructions like “the boy is reaching up there” rather than “the boy is reaching into the cupboard”.
  • Usage of high frequency words. Alzheimer’s patients have difficulty remembering specific words and replace them with more general, therefore higher frequency words.

Future directions

Shortly after this research was published, my adviser Frank Rudzicz co-founded WinterLight Labs, a company that’s working on turning this proof-of-concept into an actual usable product. It also diagnoses various other cognitive disorders like Primary Progressive Aphasia.

A few other grad students in my research group are working on Talk2Me, which is a large longitudinal study to collect more data from patients with various neurodegenerative disorders. More data is always helpful for future research.

So this is the starting point for my research. Stay tuned for updates!