Publishing Negative Results in Machine Learning is like Proving Dragons don’t Exist

I’ve been reading a lot of machine learning papers lately, and one thing I’ve noticed is that the vast majority of papers report positive results — “we used method X on problem Y, and beat the state-of-the-art results”. Very rarely do you see a paper that reports that something doesn’t work.

The result is publication bias — if we only publish the results of experiments that succeed, even statistically significant results could be due to random chance, rather than anything actually significant happening. Many areas of science are facing a replication crisis, where published research cannot be replicated.

There is some community discussion of encouraging more negative paper submissions, but as of now, negative results are rarely publishable. If you attempt an experiment but don’t get the results you expected, your best hope is to try a bunch of variations of the experiment until you get some positive result (perhaps on a special case of the problem), after which you pretend the failed experiments never happened. With few exceptions, any positive result is better than a negative result, like “we tried method X on problem Y, and it didn’t work”.

Why publication bias is not so bad

I just described a cynical view of academia, but actually, there’s a good reason why the community prefers positive results. Negative results are simply not very useful, and contribute very little to human knowledge.

Now why is that? When a new paper beats the state-of-the-art results on a popular benchmark, that’s definite proof that the method works. The converse is not true. If your model fails to produce good results, it could be due to a number of reasons:

  • Your dataset is too small / too noisy
  • You’re using the wrong batch size / activation function / regularization
  • You’re using the wrong loss function / wrong optimizer
  • Your model is overfitting
  • You have a bug in your code

lattice2.pngAbove: Only when everything is correct will you get positive results; many things can cause a model to fail. (Source)

So if you try method X on problem Y and it doesn’t work, you gain very little information. In particular, you haven’t proved that method X cannot work. Sure, you found that your specific setup didn’t work, but have you tried making modification Z? Negative results in machine learning are rare because you can’t possibly anticipate all possible variations of your method and convince people that all of them won’t work.

Searching for dragons

Suppose we’re scientists attending the International Conference of Flying Creatures (ICFC). Somebody mentioned it would be nice if we had dragons. Dragons are useful. You could do all sorts of cool stuff with a dragon, like ride it into battle.


“But wait!” you exclaim: “Dragons don’t exist!”

I glance at you questioningly: “How come? We haven’t found one yet, but we’ll probably find one soon.”

Your intuition tells you dragons shouldn’t exist, but you can’t articulate a convincing argument why. So you go home, and you and your team of grad students labor for a few years and publish a series of papers:

  • “We looked for dragons in China and we didn’t find any”
  • “We looked for dragons in Europe and we didn’t find any”
  • “We looked for dragons in North America and we didn’t find any”

Eventually, the community is satisfied that dragons probably don’t exist, for if they did, someone would have found one by now. But a few scientists still harbor the possibility that there may be dragons lying around in a remote jungle somewhere. We just don’t know for sure.

This remains the state of things for a few years until a colleague publishes a breakthrough result:

  • “Here’s a calculation that shows that any dragon with a wing span longer than 5 meters will collapse under its own weight”

You read the paper, and indeed, the logic is impeccable. This settles the matter once and for all: dragons don’t exist (or at least the large, flying sort of dragons).

When negative results are actually publishable

The research community dislikes negative results because they don’t prove a whole lot — you can have a lot of negative results and still not be sure that the task is impossible. In order for a negative result to be valuable, it needs to present a convincing argument why the task is impossible, and not just a list of experiments that you tried that failed.

This is difficult, but it can be done. Let me give an example from computational linguistics. Recurrent neural networks (RNNs) can, in theory, compute any function defined over a sequence. In practice, however, they had difficulty remembering long-term dependencies. Attempts to train RNNs using gradient descent ran into numerical difficulties known as the vanishing / exploding gradient problem.

Then, Bengio et al. (1994) formulated a mathematical model of an RNN as an iteratively applied function. Using ideas from dynamical systems theory, they showed that as the input sequence gets longer and longer, the result is more and more sensitive to noise. The details are technical, but the gist of it is that under some reasonable assumptions, training RNNs using gradient descent is impossible. This is a rare example of a negative result in machine learning — it’s an excellent paper and I’d recommend reading it.

3.pngAbove: A Long Short Term Memory (LSTM) network handles long term dependencies by adding a memory cell (Source)

Soon after the vanishing gradient problem was understood, researchers invented the LSTM (Hochreiter and Schmidhuber, 1997). Since training RNNs with gradient descent was hopeless, they added a ‘latching’ mechanism that allows state to persist through many iterations, thus avoiding the vanishing gradient problem. Unlike plain RNNs, LSTMs can handle long term dependencies and can be trained with gradient descent; they are among the most ubiquitous deep learning architectures in NLP today.

After reading the breakthrough dragon paper, you pace around your office, thinking. Large, flying dragons can’t exist after all, as they would collapse under their own weight — but what about smaller, non-flying dragons? Maybe we’ve been looking for the wrong type of dragons all along? Armed with new knowledge, you embark on a new search…

4.jpgAbove: Komodo Dragon, Indonesia

…and sure enough, you find one 🙂

My First Research Paper: State Complexity of Overlap Assembly

My first research paper is completed and has been uploaded to Arxiv! It’s titled “State Complexity of Overlap Assembly”, by Janusz Brzozowski, Lila Kari, Myself, and Marek Szykuła. It’s in the area of formal language theory, which is an area of theoretical computer science. I worked on it as a part-time URA research project for two terms during my undergrad at Waterloo.

The contents of the paper are fairly technical. In this blog post, I will explain the background and motivation for the problem, and give a statement of the main result that we proved.

What is Overlap Assembly?

The subject of this paper is a formal language operation called overlap assembly, which models annealing of DNA strands. When you cool a solution containing a lot of short DNA strands, they tend to stick together in a predictable manner: they stick together to form longer strands, but only if the ends “match”.

We can view this as a formal operation on strings. If we have two strings a and b such that some suffix of a matches some prefix of b, then the overlap assembly is the string we get by combining them together:

1.pngAbove: Example of overlap assembly of two strings

In some cases, there might be more than one way of combining the strings together, for example, a=CATA and b=ATAG — then both CATATAG and CATAG are possible overlaps. Therefore, overlap assembly actually produces a set of strings.

We can extend this idea to languages in the obvious way. The overlap assembly of two languages is the set of all the strings you get by overlapping any two strings in the respective languages. For example, if L1={ab, abab, ababab, …} and L2={ba, baba, bababa, …}, then the overlap language is {aba, ababa, …}.

It turns out that if we start with two regular languages, then the overlap assembly language will always be regular too. I won’t go into the details, but it suffices to construct an NFA that recognizes the overlap language, given two DFAs recognizing the two input languages.

2.pngAbove: Example of construction of an overlap assembly NFA (Figure 2 of our paper)

What is State Complexity?

Given that overlap assembly is closed under regular languages, a natural question to ask is: how “complex” is the regular language that gets produced? One measure of complexity of regular languages is state complexity: the number of states in the smallest DFA that recognizes the language.

State complexity was first studied in 1994 by Sheng Yu et al. Some operations do not increase state complexity very much: if two regular languages have state complexities m and n, then their union has state complexity at most mn. On the other hand, the reversal operation can blow up state complexity exponentially — it’s possible for a language to have state complexity n but its reversal to have state complexity 2^n.

Here’s a table of the state complexities of a few regular language operations:


Over the years, state complexity has been studied for a wide range of other regular language operations. Overlap assembly is another such operation — the paper studies the state complexity of this operation.

Main Result

In our paper, we proved that the state complexity of overlap assembly (for two languages with state complexities m and n) is at most:

2(m-1) 3^{n-1} + 2^n

Further, we constructed a family of DFAs that achieve this bound, so the bound is tight.

That’s it for my not-too-technical summary of this paper. I glossed over a lot of the details, so check out the paper for the full story!

Paper Review: Linguistic Features to Identify Alzheimer’s Disease

Today I’m going to be sharing a paper I’ve been looking at, related to my research: “Linguistic Features Identify Alzheimer’s Disease in Narrative Speech” by Katie Fraser, Jed Meltzer, and my adviser Frank Rudzicz. The paper was published in 2016 in the Journal of Alzheimer’s Disease. It uses NLP to automatically diagnose patients with Alzheimer’s disease, given a sample of their speech.

Alzheimer’s disease is a disease that you might have heard of, but it doesn’t get much attention in the media, unlike cancer and stroke. It is a neurodegenerative disease that mostly affects elderly people. 5 million Americans are living with Alzheimer’s, including 1 in 9 over the age of 65, and 1 in 3 over the age of 85.

Alzheimer’s is also the most expensive disease in America. After diagnosis, patients may continue to live for over 10 years, and during much of this time, they are unable to care for themselves and require a constant caregiver. In 2017, 68% of Medicare and Medicaid’s budget is spent on patients with Alzheimer’s, and this number is expected to increase as the elderly population grows.

Despite a lot of recent advances in our understanding of the disease, there is currently no cure for Alzheimer’s. Since the disease is so prevalent and harmful, research in this direction is highly impactful.

Previous tests to diagnose Alzheimer’s

One of the early signs of Alzheimer’s is having difficulty remembering things, including words, leading to a decrease in vocabulary. A reliable way to test for this is a retrieval question like the following (Monsch et al., 1992):

In the next 60 seconds, name as many items as possible that can be found in a supermarket.

A healthy person could rattle out about 20-30 items in a minute, whereas someone with Alzheimer’s could only produce about 10. By setting the threshold at 16 items, they could classify even mild cases of Alzheimer’s with about 92% accuracy.

This doesn’t quite capture the signs of Alzheimer’s disease though. Patients with Alzheimer’s tend to be rambly and incoherent. This can be tested with a picture description task, where the patient is given a picture and asked to describe it with as much detail as possible (Giles, Patterson, Hodges, 1994).

73c894ea4d2dc12ca69a6380e51f1d62Above: Boston Cookie Theft picture used for picture description task

There is no time limit, and the patients talked until they indicated they had nothing more to say, or if they didn’t say anything for 15 seconds.

Patients with Alzheimer’s disease produced descriptions with varying degrees of incoherence. Here’s an example transcript, from the above paper:

Experimenter: Tell me everything you see going on in this picture

Patient: oh yes there’s some washing up going on / (laughs) yes / …… oh and the other / ….. this little one is taking down the cookie jar / and this little girl is waiting for it to come down so she’ll have it / ………. er this girl has got a good old splash / she’s left the taps on (laughs) she’s gone splash all down there / um …… she’s got splash all down there

You can clearly tell that something’s off, but it’s hard to put a finger on exactly what the problem is. Well, time to apply some machine learning!

Results of Paper

Fraser’s 2016 paper uses data from the DementiaBank corpus, consisting of 240 narrative samples from patients with Alzheimer’s, and 233 from a healthy control group. The two groups were matched to have similar age, gender, and education levels. Each participant was asked to describe the Boston Cookie Theft picture above.

Fraser’s analysis used both the original audio data, as well as a detailed computer-readable transcript. She looked at 370 different features covering all sorts of linguistic metrics, like ratios of different parts of speech, syntactic structures, vocabulary richness, and repetition. Then, she performed a factor analysis and identified a set of 35 features that achieves about 81% accuracy in distinguishing between Alzheimer’s patients and controls.

According to the analysis, a few of the most important distinguishing features are:

  • Pronoun to noun ratio. Alzheimer’s patients produce vague statements and tend to substitute pronouns like “he” for nouns like “the boy”. This also applies to adverbial constructions like “the boy is reaching up there” rather than “the boy is reaching into the cupboard”.
  • Usage of high frequency words. Alzheimer’s patients have difficulty remembering specific words and replace them with more general, therefore higher frequency words.

Future directions

Shortly after this research was published, my adviser Frank Rudzicz co-founded WinterLight Labs, a company that’s working on turning this proof-of-concept into an actual usable product. It also diagnoses various other cognitive disorders like Primary Progressive Aphasia.

A few other grad students in my research group are working on Talk2Me, which is a large longitudinal study to collect more data from patients with various neurodegenerative disorders. More data is always helpful for future research.

So this is the starting point for my research. Stay tuned for updates!