Representation Learning for Discovering Phonemic Tone Contours

My paper titled “Representation Learning for Discovering Phonemic Tone Contours” was recently presented at the SIGMORPHON workshop, held concurrently with ACL 2020. This is joint work with Jing Yi Xie and Frank Rudzicz.

Problem: Can an algorithm learn the shapes of phonemic tones in a tonal language, given a list of spoken words?

Answer: We train a convolutional autoencoder to learn a representation for each contour, then use the mean shift algorithm to find clusters in the latent space.

sigmorphon1

By feeding the centers of each cluster into the decoder, we produce a prototypical contour that represents each cluster. Here are the results for Mandarin and Chinese.

sigmorphon2

We evaluate on mutual information with the ground truth tones, and the method is partially successful, but contextual effects and allophonic variation present considerable difficulties.

For the full details, read my paper here!

Deep Learning for NLP: SpaCy vs PyTorch vs AllenNLP

Deep neural networks have become really popular nowadays, producing state-of-the-art results in many areas of NLP, like sentiment analysis, text summarization, question answering, and more. In this blog post, we compare three popular NLP deep learning frameworks: SpaCy, PyTorch, and AllenNLP: what are their advantages, disadvantages, and use cases.

SpaCy

Pros: easy to use, very fast, ready for production

Cons: not customizable, internals are opaque

spacy_logo.jpg

SpaCy is a mature and batteries-included framework that comes with prebuilt models for common NLP tasks like classification, named entity recognition, and part-of-speech tagging. It’s very easy to train a model with your data: all the gritty details like tokenization and word embeddings are handled for you. SpaCy is written in Cython which makes it faster than a pure Python implementation, so it’s ideal for production.

The design philosophy is the user should only worry about the task at hand, and not the underlying details. If a newer and more accurate model comes along, SpaCy can update itself to use the improved model, and the user doesn’t need to change anything. This is good for getting a model up and running quickly, but leaves little room for a NLP practitioner to customize the model if the task doesn’t exactly match one of SpaCy’s prebuilt models. For example, you can’t build a classifier that takes both text, numerical, and image data at the same time to produce a classification.

PyTorch

Pros: very customizable, widely used in deep learning research

Cons: fewer NLP abstractions, not optimized for speed

pytorch_logo.jpeg

PyTorch is a deep learning framework by Facebook, popular among researchers for all kinds of DL models, like image classifiers or deep reinforcement learning or GANs. It uses a clear and flexible design where the model architecture is defined with straightforward Python code (rather than TensorFlow’s computational graph design).

NLP-specific functionality, like tokenization and managing word embeddings, are available in torchtext. However, PyTorch is a general purpose deep learning framework and has relatively few NLP abstractions compared to SpaCy and AllenNLP, which are designed for NLP.

AllenNLP

Pros: excellent NLP functionality, designed for quick prototyping

Cons: not yet mature, not optimized for speed

allennlp_logo.jpg

AllenNLP is built on top of PyTorch, designed for rapid prototyping NLP models for research purposes. It supports a lot of NLP functionality out-of-the-box, like text preprocessing and character embeddings, and abstracts away the training loop (whereas in PyTorch you have to write the training loop yourself). Currently, AllenNLP is not yet at a 1.0 stable release, but looks very promising.

Unlike PyTorch, AllenNLP’s design decouples what a model “does” from the architectural details of “how” it’s done. For example, a Seq2VecEncoder is any component that takes a sequence of vectors and outputs a single vector. You can use GloVe embeddings and average them, or you can use an LSTM, or you can put in a CNN. All of these are Seq2VecEncoders so you can swap them out without affecting the model logic.

The talk “Writing code for NLP Research” presented at EMNLP 2018 gives a good overview of AllenNLP’s design philosophy and its differences from PyTorch.

Which is the best framework?

It depends on how much you care about flexibility, ease of use, and performance.

  • If your task is fairly standard, then SpaCy is the easiest to get up and running. You can train a model using a small amount of code, you don’t have to think about whether to use a CNN or RNN, and the API is clearly documented. It’s also well optimized to deploy to production.
  • AllenNLP is the best for research prototyping. It supports all the bells and whistles that you’d include in your next research paper, and encourages you to follow the best practices by design. Its functionality is a superset of PyTorch’s, so I’d recommend AllenNLP over PyTorch for all NLP applications.

There’s a few runner-ups that I will mention briefly:

  • NLTK / Stanford CoreNLP / Gensim are popular libraries for NLP. They’re good libraries, but they don’t do deep learning, so they can’t be directly compared here.
  • Tensorflow / Keras are also popular for research, especially for Google projects. Tensorflow is the only framework supported by Google’s TPUs, and it also has better multi-GPU support than PyTorch. However, multi-GPU setups are relatively uncommon in NLP, and furthermore, its computational graph model is harder to debug than PyTorch’s model, so I don’t recommend it for NLP.
  • PyText is a new framework by Facebook, also built on top of PyTorch. It defines a network using pre-built modules (similar to Keras) and supports exporting models to Caffe to be faster in production. However, it’s very new (only released earlier this month) and I haven’t worked with it myself to form an opinion about it yet.

That’s all, let me know if there’s any that I’ve missed!

Useful properties of ROC curves, AUC scoring, and Gini Coefficients

Receiver Operating Characteristic (ROC) curves and AUC values are often used to score binary classification models in Kaggle and in papers. However, for a long time I found them fairly unintuitive and confusing. In this blog post, I will explain some basic properties of ROC curves that are useful to know for Kaggle competitions, and how you should interpret them.

1.pngAbove: Example of a ROC curve

First, the definitions. A ROC curve plots the performance of a binary classifier under various threshold settings; this is measured by true positive rate and false positive rate. If your classifier predicts “true” more often, it will have more true positives (good) but also more false positives (bad). If your classifier is more conservative, predicting “true” less often, it will have fewer false positives but fewer true positives as well. The ROC curve is a graphical representation of this tradeoff.

A perfect classifier has a 100% true positive rate and 0% false positive rate, so its ROC curve passes through the upper left corner of the square. A completely random classifier (ie: predicting “true” with probability p and “false” with probability 1-p for all inputs) will by random chance correctly classify proportion p of the actual true values and incorrectly classify proportion p of the false values, so its true and false positive rates are both p. Therefore, a completely random classifier’s ROC curve is a straight line through the diagonal of the plot.

The AUC (Area Under Curve) is the area enclosed by the ROC curve. A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5. Usually, your model will score somewhere in between. The range of possible AUC values is [0, 1]. However, if your AUC is below 0.5, that means you can invert all the outputs of your classifier and get a better score, so you did something wrong.

The Gini Coefficient is 2*AUC – 1, and its purpose is to normalize the AUC so that a random classifier scores 0, and a perfect classifier scores 1. The range of possible Gini coefficient scores is [-1, 1]. If you search for “Gini Coefficient” on Google, you will find a closely related concept from economics that measures wealth inequality within a country.


Why do we care about AUC, why not just score by percentage accuracy?

AUC is good for classification problems with a class imbalance. Suppose the task is to detect dementia from speech, and 99% of people don’t have dementia and only 1% do. Then you can submit a classifier that always outputs “no dementia”, and that would achieve 99% accuracy. It would seem like your 99% accurate classifier is pretty good, when in fact it is completely useless. Using AUC scoring, your classifier would score 0.5.

In many classification problems, the cost of a false positive is different from the cost of a false negative. For example, it is worse to falsely imprison an innocent person than to let a guilty criminal get away, which is why our justice system assumes you’re innocent until proven guilty, and not the other way around. In a classification system, we would use a threshold rule, where everything above a certain probability is treated as 1, and everything below is treated as 0. However, deciding on where to draw the line requires weighing the cost of a false positive versus a false negative — this depends on external factors and has nothing to do with the classification problem.

AUC scoring lets us evaluate models independently of the threshold. This is why AUC is so popular in Kaggle: it enables competitors to focus on developing a good classifier without worrying about choosing the threshold, and let the organizers choose the threshold later.

(Note: This isn’t quite true — a classifier can sometimes be better at certain thresholds and worse at other thresholds. Sometimes it’s necessary to combine classifiers to get the best one for a particular threshold. Details in the paper linked at the end of this post.)


Next, here’s a mix of useful properties to know when working with ROC curves and AUC scoring.

AUC is not directly comparable to accuracy, precision, recall, or F1-score. If your model is achieving 0.65 AUC, it’s incorrect to interpret that as “65% accurate”. The reason is that AUC exists independently of a threshold and is immune to class imbalance, whereas accuracy / precision / recall / F1-score do require you picking a threshold, so you’re measuring two different things.

Only relative order matters for AUC score. When computing ROC AUC, we predict a probability for each data point, sort the points by predicted probability, and evaluate how close is it from a perfect ordering of the points. Therefore, AUC is invariant under scaling, or any transformation that preserves relative order. For example, predicting [0.03, 0.99, 0.05, 0.06] is the same as predicting [0.15, 0.92, 0.89, 0.91] because the relative ordering for the 4 items is the same in both cases.

A corollary of this is we can’t treat outputs of an AUC-optimized model as the likelihood that it’s true. Some models may be poorly calibrated (eg: its output is always between 0.3 and 0.32) but still achieve a good AUC score because its relative ordering is correct. This is something to look out for when blending together predictions of different models.

That’s my summary of the most important properties to know about ROC curves. There’s more that I haven’t talked about, like how to compute AUC score. If you’d like to learn more, I’d recommend reading “An introduction to ROC analysis” by Tom Fawcett.

Publishing Negative Results in Machine Learning is like Proving Dragons don’t Exist

I’ve been reading a lot of machine learning papers lately, and one thing I’ve noticed is that the vast majority of papers report positive results — “we used method X on problem Y, and beat the state-of-the-art results”. Very rarely do you see a paper that reports that something doesn’t work.

The result is publication bias — if we only publish the results of experiments that succeed, even statistically significant results could be due to random chance, rather than anything actually significant happening. Many areas of science are facing a replication crisis, where published research cannot be replicated.

There is some community discussion of encouraging more negative paper submissions, but as of now, negative results are rarely publishable. If you attempt an experiment but don’t get the results you expected, your best hope is to try a bunch of variations of the experiment until you get some positive result (perhaps on a special case of the problem), after which you pretend the failed experiments never happened. With few exceptions, any positive result is better than a negative result, like “we tried method X on problem Y, and it didn’t work”.

Why publication bias is not so bad

I just described a cynical view of academia, but actually, there’s a good reason why the community prefers positive results. Negative results are simply not very useful, and contribute very little to human knowledge.

Now why is that? When a new paper beats the state-of-the-art results on a popular benchmark, that’s definite proof that the method works. The converse is not true. If your model fails to produce good results, it could be due to a number of reasons:

  • Your dataset is too small / too noisy
  • You’re using the wrong batch size / activation function / regularization
  • You’re using the wrong loss function / wrong optimizer
  • Your model is overfitting
  • You have a bug in your code

lattice2.pngAbove: Only when everything is correct will you get positive results; many things can cause a model to fail. (Source)

So if you try method X on problem Y and it doesn’t work, you gain very little information. In particular, you haven’t proved that method X cannot work. Sure, you found that your specific setup didn’t work, but have you tried making modification Z? Negative results in machine learning are rare because you can’t possibly anticipate all possible variations of your method and convince people that all of them won’t work.

Searching for dragons

Suppose we’re scientists attending the International Conference of Flying Creatures (ICFC). Somebody mentioned it would be nice if we had dragons. Dragons are useful. You could do all sorts of cool stuff with a dragon, like ride it into battle.

1.jpg

“But wait!” you exclaim: “Dragons don’t exist!”

I glance at you questioningly: “How come? We haven’t found one yet, but we’ll probably find one soon.”

Your intuition tells you dragons shouldn’t exist, but you can’t articulate a convincing argument why. So you go home, and you and your team of grad students labor for a few years and publish a series of papers:

  • “We looked for dragons in China and we didn’t find any”
  • “We looked for dragons in Europe and we didn’t find any”
  • “We looked for dragons in North America and we didn’t find any”

Eventually, the community is satisfied that dragons probably don’t exist, for if they did, someone would have found one by now. But a few scientists still harbor the possibility that there may be dragons lying around in a remote jungle somewhere. We just don’t know for sure.

This remains the state of things for a few years until a colleague publishes a breakthrough result:

  • “Here’s a calculation that shows that any dragon with a wing span longer than 5 meters will collapse under its own weight”

You read the paper, and indeed, the logic is impeccable. This settles the matter once and for all: dragons don’t exist (or at least the large, flying sort of dragons).

When negative results are actually publishable

The research community dislikes negative results because they don’t prove a whole lot — you can have a lot of negative results and still not be sure that the task is impossible. In order for a negative result to be valuable, it needs to present a convincing argument why the task is impossible, and not just a list of experiments that you tried that failed.

This is difficult, but it can be done. Let me give an example from computational linguistics. Recurrent neural networks (RNNs) can, in theory, compute any function defined over a sequence. In practice, however, they had difficulty remembering long-term dependencies. Attempts to train RNNs using gradient descent ran into numerical difficulties known as the vanishing / exploding gradient problem.

Then, Bengio et al. (1994) formulated a mathematical model of an RNN as an iteratively applied function. Using ideas from dynamical systems theory, they showed that as the input sequence gets longer and longer, the result is more and more sensitive to noise. The details are technical, but the gist of it is that under some reasonable assumptions, training RNNs using gradient descent is impossible. This is a rare example of a negative result in machine learning — it’s an excellent paper and I’d recommend reading it.

3.pngAbove: A Long Short Term Memory (LSTM) network handles long term dependencies by adding a memory cell (Source)

Soon after the vanishing gradient problem was understood, researchers invented the LSTM (Hochreiter and Schmidhuber, 1997). Since training RNNs with gradient descent was hopeless, they added a ‘latching’ mechanism that allows state to persist through many iterations, thus avoiding the vanishing gradient problem. Unlike plain RNNs, LSTMs can handle long term dependencies and can be trained with gradient descent; they are among the most ubiquitous deep learning architectures in NLP today.


After reading the breakthrough dragon paper, you pace around your office, thinking. Large, flying dragons can’t exist after all, as they would collapse under their own weight — but what about smaller, non-flying dragons? Maybe we’ve been looking for the wrong type of dragons all along? Armed with new knowledge, you embark on a new search…

4.jpgAbove: Komodo Dragon, Indonesia

…and sure enough, you find one 🙂

XGBoost learns the Canadian Flag

XGBoost is a machine learning library that’s great for classification tasks. It’s often seen in Kaggle competitions, and usually beats other classifiers like logistic regression, random forests, SVMs, and shallow neural networks. One day, I was feeling slightly patriotic, and wondered: can XGBoost learn the Canadian flag?

canada_original.pngAbove: Our home and native land

Let’s find out!

Preparing the dataset

The task is to classify each pixel of the Canadian flag as either red or white, given limited data points. First, we read in the image with R and take the red channel:

library(png)
library(ggplot2)
library(xgboost)

img <- readPNG("canada.png")
red <- img[,,2]

HEIGHT <- dim(red)[1]
WIDTH <- dim(red)[2]

Next, we sample 7500 random points for training. Also, to make it more interesting, each point has a probability 0.05 of flipping to the opposite color.

ERROR_RATE <- 0.05

get_data_points <- function(N) {
  x <- sample(1:WIDTH, N, replace = T)
  y <- sample(1:HEIGHT, N, replace = T)
  p <- red[cbind(y, x)]
  p <- round(p)
  flips <- sample(c(0, 1), N, replace = T,
                  prob = c(ERROR_RATE, 1 - ERROR_RATE))
  p[flips == 1] <- 1 - p[flips == 1]
  data.frame(x=as.numeric(x), y=as.numeric(y), p=p)
}

data <- get_data_points(7500)

This is what our classifier sees:

noisy.png

Alright, let’s start training.

Quick introduction to XGBoost

XGBoost implements gradient boosted decision trees, which were first proposed by Friedman in 1999.

1.png

Above: XGBoost learns an ensemble of short decision trees

The output of XGBoost is an ensemble of decision trees. Each individual tree by itself is not very powerful, containing only a few branches. But through gradient boosting, each subsequent tree tries to correct for the mistakes of all the trees before it, and makes the model better. After many iterations, we get a set of decision trees; the sum of the all their outputs is our final prediction.

For more technical details of how this works, refer to this tutorial or the XGBoost paper.

Experiments

Fitting an XGBoost model is very easy using R. For this experiment, we use decision trees of height 3, but you can play with the hyperparameters.

fit <- xgboost(data = matrix(c(data$x, data$y), ncol = 2), label = data$p,
               nrounds = 1,
               max_depth = 3)

We also need a way of visualizing the results. To do this, we run every pixel through the classifier and display the result:

plot_canada <- function(dataplot) {
  dataplot$y <- -dataplot$y
  dataplot$p <- as.factor(dataplot$p)

  ggplot(dataplot, aes(x = x, y = y, color = p)) +
    geom_point(size = 1) +
    scale_x_continuous(limits = c(0, 240)) +
    scale_y_continuous(limits = c(-120, 0)) +
    theme_minimal() +
    theme(panel.background = element_rect(fill='black')) +
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
    scale_color_manual(values = c("white", "red"))
}

fullimg <- expand.grid(x = as.numeric(1:WIDTH), y = as.numeric(1:HEIGHT))
fullimg$p <- predict(fit, newdata = matrix(c(fullimg$x, fullimg$y), ncol = 2))
fullimg$p <- as.numeric(fullimg$p > 0.5)

plot_canada(fullimg)

In the first iteration, XGBoost immediately learns the two red bands at the sides:

round1.png

After a few more iterations, the maple leaf starts to take form:

round7.png

round15

round60

By iteration 60, it learns a pretty recognizable maple leaf. Note that the decision trees split on x or y coordinates, so XGBoost can’t learn diagonal decision boundaries, only approximate them with horizontal and vertical lines.

If we run it for too long, then it starts to overfit and capture the random noise in the training data. In practice, we would use cross validation to detect when this is happening. But why cross-validate when you can just eyeball it?

round300.png

That was fun. If you liked this, check out this post which explores various classifiers using a flag of Australia.

The source code for this blog post is posted here. Feel free to experiment with it.

Kaggle Speech Recognition Challenge

For the past few weeks, I’ve been working on the TensorFlow Speech Recognition Challenge on Kaggle. The task is to recognize a one-second audio clip, where the clip contains one of a small number of words, like “yes”, “no”, “stop”, “go”, “left”, and “right”.

In general, speech recognition is a difficult problem, but it’s much easier when the vocabulary is limited to a handful of words. We don’t need to use complicated language models to detect phonemes, and then string the phonemes into words, like Kaldi does for speech recognition. Instead, a convolutional neural network works quite well.

First Steps

The dataset consists of about 64000 audio files which have already been split into training / validation / testing sets. You are then asked to make predictions on about 150000 audio files for which the labels are unknown.

Actually, this dataset had already been published in academic literature, and people published code to solve the same problem. I started with GCommandPytorch by Yossi Adi, which implements a speech recognition CNN in Pytorch.

The first step that it does is convert the audio file into a spectrogram, which is an image representation of sound. This is easily done using LibRosa.

1.pngAbove: Sample spectrograms of “yes” and “no”

Now we’ve converted the problem to an image classification problem, which is well studied. To an untrained human observer, all the spectrograms may look the same, but neural networks can learn things that humans can’t. Convolutional neural networks work very well for classifying images, for example VGG16:

2.pngAbove: A Convolutional Neural Network (LeNet). VGG16 is similar, but has even more layers.

For more details about this approach, refer to these papers:

  1. Convolutional Neural Networks for Small-footprint Keyword Spotting
  2. Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting

Voice Activity Detection

You might ask: if somebody already implemented this, then what’s there left to do other than run their code? Well, the test data contains “silence” samples, which contain background noise but no human speech. It also has words outside the set we care about, which we need to label as “unknown”. The Pytorch CNN produces about 95% validation accuracy by itself, but the accuracy is much lower when we add these two additional requirements.

For silence detection, I first tried the simplest thing I could think of: taking the maximum absolute value of the waveform and decide it’s “silence” if the value is below a threshold. When combined with VGG16, this gets accuracy 0.78 on the leaderboard. This is a crude metric because sufficiently loud noise would be considered speech.

Next, I tried running openSMILE, which I use in my research to extract various acoustic features from audio. It implements an LSTM for voice activity detection: every 0.05 seconds, it outputs a probability that someone is talking. Combining the openSMILE output with the VGG16 prediction gave a score of 0.81.

More improvements

I tried a bunch of things to improve my score:

  1. Fiddled around with the neural network hyperparameters which boosted my score to 0.85. Each epoch took about 10 minutes on a GPU, and the whole model takes about 2 hours to train. Somehow, Adam didn’t produce good results, and SGD with momentum worked better.
  2. Took 100% of the data for training and used the public LB for validation (don’t do this in real life lol). This improved my score to 0.86.
  3. Trained an ensemble 3 versions of the same neural network with same hyperparameters but different randomly initialized weights and took a majority vote to do prediction. This improved the score to 0.87. I would’ve liked to train more, but other people in my research group needed to use the GPUs.

In the end, the top scoring model had a score of 0.91, which beat my model by 4 percentage points. Although not enough to win a Kaggle medal, my model was in the top 15% of all submissions. Not bad!

My source code for the contest is available here.

Simple models in Kaggle competitions

This week I participated in the Porto Seguro Kaggle competition. Basically, you’re asked to predict a binary variable — whether or not an insurance claim will be filed — based on a bunch of numerical and categorical variables.

With over 5000 teams entering the competition, it was the largest Kaggle competition ever. I guess this is because it’s a fairly well-understood problem (binary classification) with a reasonably sized dataset, making it accessible to beginning data scientists.

Kaggle is a machine learning competition platform filled with thousands of smart data scientists, machine learning experts, and statistics PhDs, and I am not one of them. Still, I was curious to see how my relatively simple tools would fare against the sophisticated techniques on the leaderboard.


The first thing I tried was logistic regression. All you had to do was load the data into memory, invoke the glm() function in R, and output the predictions. Initially my logistic regression wasn’t working properly and I got a negative score. It took a day or so to figure out how to do logistic regression properly, which got me a score of 0.259 on the public leaderboard.

Next, I tried gradient boosted decision trees, which I had learned about in a stats class but never actually used before. In R, this is simple — I just needed to change the glm() call to gbm() and fit the model again. This improved my score to 0.265. It was near the end of the competition so I stopped here.

At this point, the top submission had a score of 0.291, and 0.288 was enough to get a gold medal. Yet despite being within 10% of the top submission in overall accuracy, I was still in the bottom half of the leaderboard, ranking in the 30th percentile.

The public leaderboard looked like this:

Rplot.pngAbove: Public leaderboard of the Porto Seguro Kaggle competition two days before the deadline. Line in green is my submission, scoring 0.265.

This graph illustrates the nature of this competition. At first, progress is easy, and pretty much anyone who submitted anything that was not “predict all zeros” got over 0.200. From there, you make steady, incremental progress until about 0.280 or so, but afterwards, any further improvements is limited.

The top of the leaderboard is very crowded, with over 1000 teams having the score of 0.287. Many teams used ensembles of XGBoost and LightGBM models with elaborate feature engineering. In the final battle for the private leaderboard, score differences of less than 0.001 translated to hundreds of places on the leaderboard and spelled the difference between victory and defeat.

591926572-christophe-lemaitre-of-france-usain-bolt-of-jamaica.jpg.CROP.promo-xlarge2.jpgAbove: To run 90% as fast as Usain Bolt, you need to run 100 meters in 10.5 seconds. To get 90% of the winning score in Kaggle, you just need to call glm().

This pattern is common in Kaggle and machine learning — often, a simple model can do quite well, at least the same order of magnitude as a highly optimized solution. It’s quite remarkable that you can get a decent solution with a day or two of work, and then, 5000 smart people working for 2 months can only improve it by 10%. Perhaps this is obvious to someone doing machine learning long enough, but we should look back and consider how rare this is. The same does not apply to most activities. You cannot play piano for two days and become 90% as good as a concert pianist. Likewise, you cannot train for two days and run 90% as fast as Usain Bolt.

Simple models won’t win you Kaggle competitions, but we shouldn’t understate their effectiveness. Not only are they quick to develop, but they are also easier to interpret, and can be trained in a few seconds rather than hours. It’s comforting to see how far you can get with simple solutions — the gap between the best and the rest isn’t so big after all.

Read further discussion of this post on the Kaggle forums!

Paper Review: Linguistic Features to Identify Alzheimer’s Disease

Today I’m going to be sharing a paper I’ve been looking at, related to my research: “Linguistic Features Identify Alzheimer’s Disease in Narrative Speech” by Katie Fraser, Jed Meltzer, and my adviser Frank Rudzicz. The paper was published in 2016 in the Journal of Alzheimer’s Disease. It uses NLP to automatically diagnose patients with Alzheimer’s disease, given a sample of their speech.


Alzheimer’s disease is a disease that you might have heard of, but it doesn’t get much attention in the media, unlike cancer and stroke. It is a neurodegenerative disease that mostly affects elderly people. 5 million Americans are living with Alzheimer’s, including 1 in 9 over the age of 65, and 1 in 3 over the age of 85.

Alzheimer’s is also the most expensive disease in America. After diagnosis, patients may continue to live for over 10 years, and during much of this time, they are unable to care for themselves and require a constant caregiver. In 2017, 68% of Medicare and Medicaid’s budget is spent on patients with Alzheimer’s, and this number is expected to increase as the elderly population grows.

Despite a lot of recent advances in our understanding of the disease, there is currently no cure for Alzheimer’s. Since the disease is so prevalent and harmful, research in this direction is highly impactful.

Previous tests to diagnose Alzheimer’s

One of the early signs of Alzheimer’s is having difficulty remembering things, including words, leading to a decrease in vocabulary. A reliable way to test for this is a retrieval question like the following (Monsch et al., 1992):

In the next 60 seconds, name as many items as possible that can be found in a supermarket.

A healthy person could rattle out about 20-30 items in a minute, whereas someone with Alzheimer’s could only produce about 10. By setting the threshold at 16 items, they could classify even mild cases of Alzheimer’s with about 92% accuracy.

This doesn’t quite capture the signs of Alzheimer’s disease though. Patients with Alzheimer’s tend to be rambly and incoherent. This can be tested with a picture description task, where the patient is given a picture and asked to describe it with as much detail as possible (Giles, Patterson, Hodges, 1994).

73c894ea4d2dc12ca69a6380e51f1d62Above: Boston Cookie Theft picture used for picture description task

There is no time limit, and the patients talked until they indicated they had nothing more to say, or if they didn’t say anything for 15 seconds.

Patients with Alzheimer’s disease produced descriptions with varying degrees of incoherence. Here’s an example transcript, from the above paper:

Experimenter: Tell me everything you see going on in this picture

Patient: oh yes there’s some washing up going on / (laughs) yes / …… oh and the other / ….. this little one is taking down the cookie jar / and this little girl is waiting for it to come down so she’ll have it / ………. er this girl has got a good old splash / she’s left the taps on (laughs) she’s gone splash all down there / um …… she’s got splash all down there

You can clearly tell that something’s off, but it’s hard to put a finger on exactly what the problem is. Well, time to apply some machine learning!

Results of Paper

Fraser’s 2016 paper uses data from the DementiaBank corpus, consisting of 240 narrative samples from patients with Alzheimer’s, and 233 from a healthy control group. The two groups were matched to have similar age, gender, and education levels. Each participant was asked to describe the Boston Cookie Theft picture above.

Fraser’s analysis used both the original audio data, as well as a detailed computer-readable transcript. She looked at 370 different features covering all sorts of linguistic metrics, like ratios of different parts of speech, syntactic structures, vocabulary richness, and repetition. Then, she performed a factor analysis and identified a set of 35 features that achieves about 81% accuracy in distinguishing between Alzheimer’s patients and controls.

According to the analysis, a few of the most important distinguishing features are:

  • Pronoun to noun ratio. Alzheimer’s patients produce vague statements and tend to substitute pronouns like “he” for nouns like “the boy”. This also applies to adverbial constructions like “the boy is reaching up there” rather than “the boy is reaching into the cupboard”.
  • Usage of high frequency words. Alzheimer’s patients have difficulty remembering specific words and replace them with more general, therefore higher frequency words.

Future directions

Shortly after this research was published, my adviser Frank Rudzicz co-founded WinterLight Labs, a company that’s working on turning this proof-of-concept into an actual usable product. It also diagnoses various other cognitive disorders like Primary Progressive Aphasia.

A few other grad students in my research group are working on Talk2Me, which is a large longitudinal study to collect more data from patients with various neurodegenerative disorders. More data is always helpful for future research.

So this is the starting point for my research. Stay tuned for updates!

What’s the difference between Mathematics and Statistics?

Statistics has a sort of funny and peculiar relationship with mathematics. In a lot of university departments, they’re lumped together and you have a Department of Mathematics and Statistics”. Other times, it’s grouped as a branch in applied math. Pure mathematicians tend to either think of it as an application of probability theory, or dislike it because it’s not rigorous enough”.

After having studied both, I feel it’s misleading to say that statistics is a branch of math. Rather, statistics is a separate discipline that uses math, but differs in fundamental ways from other branches of math, like combinatorics or differential equations or group theory. Statistics is the study of uncertainty, and this uncertainty permeates the subject so much that mathematics and statistics are fundamentally different modes of thinking.

mathstats.png

Above: if pure math and statistics were like games

 

Definitions and Proofs

Math always follows a consistent definition-theorem-proof structure. No matter what branch of mathematics you’re studying, whether it be algebraic number theory or real analysis, the structure of a mathematical argument is more or less the same.

You begin by defining some object, let’s say a wug. After defining it, everybody can look at the definition and agree on which objects are wugs and which objects are not wugs.

Next, you proceed to prove interesting things about wugs, using marvelous arguments like proof by contradiction and induction. At every step of the proof, the reader can verify that indeed, this step follows logically from the definitions. After several of these proofs, you now understand a lot of properties of wugs and how they connect to other objects in the mathematical universe, and everyone is happy.

In statistics, it’s common to define things with intuition and examples, so you know it when you see it”; things are rarely so black-and-white like in mathematics. This is born out of necessity: statisticians work with real data, which tends to be messy and doesn’t lend itself easily to clean, rigorous definitions.

Take for example the concept of an outlier”. Many statistical methods behave badly when the data contains outliers, so it’s a common practice to identify outliers and remove them. But what exactly constitutes an outlier? Well, that depends on many criteria, like how many data points you have, how far it is from the rest of the points, and what kind of model you’re fitting.

scatterplot1

In the above plot, two points are potentially outliers. Should you remove them, or keep them, or maybe remove one of them? There’s no correct answer, and you have to use your judgment.


For another example, consider p-values. Usually, when you get a p-value under 0.05, it can be considered statistically significant. But this value is merely a guideline, not a law – it’s not like 0.048 is definitely significant and 0.051 is not.

Now let’s say you run an A/B-test and find that changing a button to blue results in higher clicks, with p-value of 0.059. Should you recommend to your boss that they make the change? What if you get 0.072, or 0.105? At what point does it become not significant? There is no correct answer, you have to use your judgment.


Take another example: heteroscedasticity. This is a fancy word that means the variance is unequal for different parts of your dataset. Heteroscedasticity is bad because a lot of models assume that the variance is constant, and if this assumption is violated then you’ll get wrong results, so you need to use a different model.

 

1-qYXXQN-1eumcnYgTJZnaww

Is this data heteroscedastic, or does it only look like the variance is uneven because there are so few points to the left of 3.5? Is the problem serious enough that fitting a linear model is invalid? There’s no correct answer, you have to use your judgment.


Another example: consider a linear regression model with two variables. When you plot the points on a graph, you should expect the points to roughly lie on a straight line. Not exactly on a line, of course, just roughly linear. But what if you get this:

Rplot

There is some evidence of non-linearity, but how much bendiness” can you accept before the data is definitely not roughly linear” and you have to use a different model? Again, there’s no correct answer, and you have to use your judgment.


I think you see the pattern here. In both math and statistics, you have models that only work if certain assumptions are satisfied. However, unlike math, there is no universal procedure that can tell you whether your data satisfies these assumptions.

Here are some common things that statistical models assume:

  • A random variable is drawn from a normal (Gaussian) distribution
  • Two random variables are independent
  • Two random variables satisfy a linear relationship
  • Variance is constant

Your data is not going to exactly fit a normal distribution, so all of these are approximations. A common saying in statistics goes: all models are wrong, but some are useful”.

On the other hand, if your data deviates significantly from your model assumptions, then the model breaks down and you get garbage results. There’s no universal black-and-white procedure to decide if your data is normally distributed, so at some point you have to step in and apply your judgment.

Aside: in this article I’m ignoring Mathematical Statistics, which is the part of statistics that tries to justify statistical methods using rigorous math. Mathematical Statistics follows the definition-theorem-proof pattern and is very much like any other branch of math. Any proofs you see in a stats course likely belongs in this category.

 

Classical vs Statistical Algorithms

You might be wondering: without rigorous definitions and proofs, how do you be sure anything you’re doing is correct? Indeed, non-statistical (i.e. mathematical) and statistical methods have different ways of judging correctness”.

Non-statistical methods use theory to justify their correctness. For instance, we can prove by induction that Dijkstra’s algorithm always returns the shortest path in a graph, or that quicksort always arranges an array in sorted order. To compare running time, we use Big-O notation, a mathematical construct that formalizes runtimes of programs by looking at how they behave as their inputs get infinitely large.

Non-statistical algorithms focus primarily on worst-case analysis, even for approximation and randomized algorithms. The best known approximation algorithm for the Traveling Salesman problem has an approximation ratio of 1.5 – this means that even for the worst possible input, the algorithm gives a path that’s no more than 1.5 times longer than the optimal solution. It doesn’t make a difference if the algorithm performs a lot better than 1.5 for most practical inputs, because it’s always the worst case that we care about.

A statistical method is good if it can make inferences and predictions on real-world data. Broadly speaking, there are two main goals of statistics. The first is statistical inference: analyzing the data to understand the processes that gave rise to it; the second is prediction: using patterns from past data to predict the future. Therefore, data is crucial when evaluating two different statistical algorithms. No amount of theory will tell you whether a support vector machine is better than a decision tree classifier – the only way to find out is by running both on your data and seeing which one gives more accurate predictions.

2 Above: the winning neural network architecture for ImageNet Challenge 2012. Currently, theory fails at explaining why this method works so well.

In machine learning, there is still theory that tries to formally describe how statistical models behave, but it’s far removed from practice. Consider, for instance, the concepts of VC dimension and PAC learnability. Basically, the theory gives conditions under which the model eventually converges to the best one as you give it more and more data, but is not concerned with how much data you need to achieve a desired accuracy rate.

This approach is highly theoretical and impractical for deciding which model works best for a particular dataset. Theory falls especially short in deep learning, where model hyperparameters and architectures are found by trial and error. Even with models that are theoretically well-understood, the theory can only serve as a guideline; you still need cross-validation to determine the best hyperparameters.

 

Modelling the Real World

Both mathematics and statistics are tools we use to model and understand the world, but they do so in very different ways. Math creates an idealized model of reality where everything is clear and deterministic; statistics accepts that all knowledge is uncertain and tries to make sense of the data in spite of all the randomness. As for which approach is better – both approaches have their advantages and disadvantages.

Math is good for modelling domains where the rules are logical and can be expressed with equations. One example of this is physical processes: just a small set of rules is remarkably good for predicting what happens in the real world. Moreover, once we’ve figured out the mathematical laws that govern a system, they are infinitely generalizable — Newton’s laws can accurately predict the motion of celestial bodies even if we’ve only observed apples falling from trees. On the other hand, math is awkward at dealing with error and uncertainty. Mathematicians create an ideal version of reality, and hope that it’s close enough to the real thing.

Statistics shines when the rules of the game are uncertain. Rather than ignoring error, statistics embraces uncertainty. Every value has a confidence interval where you can expect it to be right about 95% of the time, but we can never be 100% sure about anything. But given enough data, the right model will separate the signal from the noise. This makes statistics a powerful tool when there are many unknown confounding factors, like modelling sociological phenomena or anything involving human decisions.

The downside is that statistics only works on the sample space where you have data; most models are bad at extrapolating past the range of data that it’s trained on. In other words, if we use a regression model with data of apples falling from trees, it will eventually be pretty good at predicting other apples falling from trees, but it won’t be able to predict the path of the moon. Thus, math enables us to understand the system at a deeper, more fundamental level than statistics.

Math is a beautiful subject that reduces a complicated system to its essence. But when you’re trying to understand how people behave, when the subjects are not always rational, learning from data is the way to go.

Applying to Graduate School in Computer Science

So you’re thinking of grad school. Four years of undergrad is not enough for you and you’re craving for more knowledge. Or perhaps you want to delay your entry into the “real world” for a couple more years. Well, grad school is the right place!

About a year ago, I decided I wanted to do grad school. However, most of my peers were on track to work in the industry after graduation. The process of applying for grad school is daunting, especially since it varies from country to country and from one subject to another. This is why I am writing a guide to grad school applications in Computer Science, for Canada and the USA: a compilation of things I wish I knew a year ago.

Why grad school?

People go to grad school for different reasons. Most people I know in computer science and software engineering plan to start working full-time — a reasonable choice, given the high salaries in the industry right now. I figured that I had the rest of my life to code for a company; there’s no rush to start working immediately.

Graduate programs come in three broad categories:

  1. Coursework Master’s. Typically about 1 year, this is basically an extension of undergrad. You take a bunch of graduate courses, which are more in-depth than undergrad classes, but you don’t do any research. This is useful for gaining more knowledge before going to work in the industry.
  2. Thesis Master’s. About 1-2 years, depending on the school. At first, you take some courses like in coursework master’s, but the primary goal is to produce a master’s thesis, which is equivalent to about one conference paper of original research. This is a good way to get some research experience, without the time commitment of a Ph.D (and is valuable as a stepping stone if you do decide to get one).
  3. Ph.D. A longer commitment of 4-5 years. In a Ph. D., you produce an extensive amount of original research; by the time you write your thesis, you will be a world expert on your specific topic. I like this illustrated explanation of what it’s like to do a Ph. D.

There are hybrid programs too, like thesis master’s often transition directly into a Ph. D, and also there are regional differences on how these programs work (more on this later).

Can I get into grad school?

As you may expect, top grad school programs are very competitive, and a typical grad student is a top student in his undergraduate class. So what do schools look for in their grad school admissions process?

Grades are a big factor: generally, an undergrad GPA of 85% or higher is good for grad school (with 80% being the bare minimum). However, even more important than GPA is your research experience. Publishing papers in conferences would be ideal, and research experience can make up for a lackluster transcript.

Unfortunately, Waterloo students are at a disadvantage here: with the co-op program, most people spend their undergrad years focusing on internships rather than research, which is considered less valuable. Don’t be discouraged though: my only research experience was through two part-time URA’s, and I have zero publications, but I still got into a good grad school.

Picking a research area

In grad school, you specialize on a specific area of computer science, for example, machine learning, or databases, or theory, or programming languages. You have to indicate roughly what area you want to study in your application, but it’s okay to not know exactly what you want to research.

For me, I wanted to do something involving artificial intelligence or data science or machine learning. Eventually I decided on natural language processing (NLP), since it’s an area of machine learning, and I like studying languages.

Some people have a specific professor that they want to work with, in which case it’s helpful to reach out to them beforehand and mention it in your statement of purpose. Otherwise, as in my case, you don’t need to explicitly contact potential advisers if you have nothing to say; you get to indicate your adviser preferences in your application.

Choosing your schools

The most important thing to look for in a grad school is the quality of the research group. You may be tempted to look at overall computer science rankings, but this can be misleading because different schools have strengths in different research areas. There are other factors to consider, like location, city environment (big city or college town), and social life.

It’s a good idea to apply to a variety of schools of different levels of competitiveness. However, each application costs about $100, so it can be expensive to apply to too many — 5-10 applications is a good balance.

I decided to apply to five schools: two in Canada and three in the USA. My main criteria were (1), a reputable research program in NLP, and (2), I wanted to live in a big city. After some deliberation, I decided to apply to the following:

  • Ph. D. at University of Washington
  • Ph. D. at UC Berkeley
  • Ph. D. at Columbia University
  • M. Sc. at University of Toronto
  • M. Sc. at University of British Columbia

I didn’t apply to the University of Waterloo, where I’m doing my undergrad, despite it being pretty highly ranked in Canada — after studying here for five years, I needed a change of scenery.

Differences between Canada and USA

You might have noticed that my three applications in the USA were all Ph. D. programs, while my two in Canada were master’s. Graduate school works quite differently in Canada vs in the USA. In Canada, most students do a master’s after undergrad and then go on to do a Ph. D., but in the USA, you enter into Ph. D. directly after undergrad, skipping the master’s.

There are master’s programs in the USA too, but they are almost exclusively coursework master’s, and are very expensive ($50k+ tuition per year). In contrast, thesis master’s programs in Canada and Ph. D. programs in the USA are fully funded, so you get paid a stipend of around $20k-30k a year.

A big reason to do a master’s in the USA is for visa purposes: for Chinese and Indian citizens, getting the H1-B is much easier with a master’s in the country, so the investment can be worth it. Otherwise, it’s probably not worth getting a master’s in the USA; studying in Canada is much cheaper.

If you go to grad school in Canada, you can apply for the CGS-M and OGS government scholarships for master’s students. Unfortunately, Canadian citizens are ineligible for most scholarships if you study in the USA.

Taking the GRE

Another difference for the USA is that the Graduate Record Exam (GRE) is required for all grad school admissions in the USA. This is a 4-hour-long computer-administered test with a reading, writing, and math component. If you’ve taken the SAT, this test is very similar. For grad school applications in computer science, only the general exam is required, and not the computer science subject test.

The GRE plays a fairly minor role in admissions: a terrible score will hurt your chances, but a perfect score will not help you that much. The quantitative and verbal sections are scored between 130-170, and for science and engineering programs, a good score is around 165+ for quantitative and 160+ for verbal.

The quantitative (math) section is a cakewalk for any computer science major, but the verbal section can be challenging if English is not your native language. It does require some preparation (1-6 months is recommended). I studied for a month and did quite well.

Applications are due on December 15 for most schools, so you should take the GRE in October at the latest (and earlier if you plan on taking it multiple times).

Letters of Recommendation

Most grad school and scholarship applications require three letters of recommendation; out of all requirements, this one requires the most planning. The ideal recommendation comes from a professor that you have done research with. If you go to Waterloo and are considering grad school, doing a part-time URA (undergraduate research assistantship) is a good way to secure a few recommendation letters.

It may be difficult to find three professors that you’ve worked with, so the next best thing is a weaker letter from a professor whose class you did well in. As a last resort, at most one letter may come from a non-academic source (like your co-op supervisor). I was lucky that one of my research projects was co-supervised by two different professors, so I got two letters that way.

Statement of Purpose

The statement of purpose is a two-page essay where you describe your academic history and research interests, and convince the admissions committee that you are the ideal candidate to admit. If you have internship experience, talk about what you learned any why it’s relevant for research.

Chances are that the first revision of your statement of purpose will suck (this was certainly the case for me), so get friends and professors to proofread it. After several revisions, here’s my final statement of purpose.

Offers of Admission

That’s about it — after you hit submit on all your applications by December 15, you can sit back and enjoy your final semester. With any luck, you will receive this in your inbox around the beginning of February:

Untitled

In the end, I got accepted to both master’s programs in Canada (UBC and UofT), but got rejected from all three Ph. D. programs in the USA (Washington, Berkeley, and Columbia). I chose to accept the UofT offer, where I will study NLP starting this fall.

Hopefully this guide has been helpful, and good luck with your applications!