The biggest headache with Chinese NLP: indeterminate word segmentation

I’ve had a few opportunities to work with NLP in Chinese. English and Chinese are very different languages, yet generally the same techniques apply to both. But there is one source of frustration that comes up from time to time, and it’s perhaps not what you’d expect.

The difficulty is that Chinese doesn’t put words between spaces. Soallyourwordsarejumbledtogetherlikethis.

“Okay, that’s fine,” you say. “We’ll just have to run a tokenizer to separate apart the words before we do anything else. And here’s a neural network that can do this with 93% accuracy (Qi et al., 2020). That should be good enough, right?”

Well, kind of. Accuracy here isn’t very well-defined because Chinese people don’t know how to segment words either. When you ask two native Chinese speakers to segment a sentence into words, they only agree about 90% of the time (Wang et al., 2017). Chinese has a lot of compound words and multi-word expressions, so there’s no widely accepted definition of what counts as a word. Some examples: 吃饭,外国人,开车,受不了. It is also possible (but rare) for a sentence to have multiple segmentations that mean different things.

Arguably, word boundaries are ill-defined in all languages, not just Chinese. Hapselmath (2011) defined 10 linguistic criteria to determine if something is a word (vs an affix or expression), but it’s hard to come up with anything consistent. Most writing systems puts spaces in between words, so there’s no confusion. Other than Chinese, only a handful of other languages (Japanese, Vietnamese, Thai, Khmer, Lao, and Burmese) have this problem.

Word segmentation ambiguity causes problems in NLP systems when different components expect different ways of segmenting a sentence. Another way the problem can appear is if the segmentation for some human-annotated data doesn’t match what a model expects.

Here is a more concrete example from one of my projects. I’m trying to get a language model to predict a tag for every word (imagine POS tagging using BERT). The language model uses SentencePiece encoding, so when a word is out-of-vocab, it gets converted into multiple subword tokens.

“expedite ratification of the proposed law”
=> [“expedi”, “-te”, “ratifica”, “-tion”, “of”, “the”, “propose”, “-d”, “law”]

In English, a standard approach is to use the first subword token of every word, and ignore the other tokens, like this:

This doesn’t work in Chinese — because of the word segmentation ambiguity, the tokenizer might produce tokens that span across multiple of our words:

So that’s why Chinese is sometimes headache-inducing when you’re doing multilingual NLP. You can work around the problem in a few ways:

  1. Ensure that all parts of the system uses a consistent word segmentation scheme. This is easy if you control all the components, but hard when working with other people’s models and data though.
  2. Work on the level of characters and don’t do word segmentation at all. This is what I ended up doing, and it’s not too bad, because individual characters do carry semantic meaning. But some words are unrelated to their character meanings, like transliterations of foreign words.
  3. Do some kind of segment alignment using Levenshtein distance — see the appendix of this paper by Tenney et al. (2019). I’ve never tried this method.

One final thought: the non-ASCII Chinese characters surprisingly never caused any difficulties for me. I would’ve expected to run into encoding problems occasionally, as I had in the past, but never had any character encoding problems with Python 3.


  1. Haspelmath, Martin. “The indeterminacy of word segmentation and the nature of morphology and syntax.” Folia linguistica 45.1 (2011): 31-80.
  2. Qi, Peng, et al. “Stanza: A python natural language processing toolkit for many human languages.” Association for Computational Linguistics (ACL) System Demonstrations. 2020.
  3. Tenney, Ian, et al. “What do you learn from context? Probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019.
  4. Wang, Shichang, et al. “Word intuition agreement among Chinese speakers: a Mechanical Turk-based study.” Lingua Sinica 3.1 (2017): 13.

Edge probing BERT and other language models

Recently, there has been a lot of research in the field of “BERTology”: investigating what aspects of language BERT does and doesn’t understand, using probing techniques. In this post, I will describe the “edge probing” technique developed by Tenney et al., in two papers titled “What do you learn from context? Probing for sentence structure in contextualized word representations” and “BERT Rediscovers the Classical NLP Pipeline“. On my first read through these papers, the method didn’t seem very complicated, and the authors only spend a paragraph explaining how it works. But upon closer look, the method is actually nontrivial, and took me some time to understand. The details are there, but hidden in an appendix in the first of the two papers.

The setup for edge probing is you have some classification task that takes a sentence, and a span of consecutive tokens within the sentence, and produces an N-way classification. This can be generalized to two spans in the sentence, but we’ll focus on the single-span case. The tasks cover a range of syntactic and semantic functions, varying from part-of-speech tagging, dependency parsing, coreference resolution, etc.


Above: Examples of tasks where edge probing may be used. Table taken from Tenney (2019a).

Let’s go through the steps of how edge probing works. Suppose we want to probe for which parts of BERT-BASE contain the most information about named entity recognition (NER). In this NER setup, we’re given the named entity spans and only need to classify which type of entity it is (e.g: person, organization, location, etc). The first step is to feed the sentence through BERT, giving 12 layers, each layer being a 768-dimensional vector.


Above: Edge probing example for NER.

The probing model has several stages:

  1. Mix: learn a task-specific linear combination of the layers, giving a single 768-dimensional vector for each span token. The weights that are learned indicate how much useful information for the task is contained in each layer.
  2. Projection: learn a linear mapping from 768 down to 256 dimensions.
  3. Self-attention pooling: learn a function to generate a scalar weight for each span vector. Then, we normalize the weights to sum up to 1, and take a weighted average. The purpose of this is to collapse the variable-length sequence of span vectors into a single fixed-length 256-dimensional vector.
  4. Feedforward NN: learn a multi-layer perceptron classifier with 2 hidden layers.

For the two-span case, they use two separate sets of weights for the mix, projection, and self-attention steps. Then, the feedforward neural network takes a concatenated 512-dimensional vector instead of a 256-dimensional vector.

The edge probing model needs to be trained on a dataset of classification instances. The probe has weights that are initialized randomly and trained using gradient descent, but the BERT weights are kept constant while the probe is being trained.

This setup was more sophisticated than it looked at first glance, and it turns out the probe itself is quite powerful. In one of the experiments, the authors found that even with randomized input, the probe was able to get over 70% of the full model’s performance!

Edge probing is not the only way of probing deep language models. Other probing experiments (Liu et al., 2019) used a simple linear classifier: this is better for measuring what information can easily be recovered from representations.


  1. Tenney, Ian, et al. “What do you learn from context? probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019a.
  2. Tenney, Ian, Dipanjan Das, and Ellie Pavlick. “BERT Rediscovers the Classical NLP Pipeline.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019b.
  3. Liu, Nelson F., et al. “Linguistic Knowledge and Transferability of Contextual Representations.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

Deep Learning for NLP: SpaCy vs PyTorch vs AllenNLP

Deep neural networks have become really popular nowadays, producing state-of-the-art results in many areas of NLP, like sentiment analysis, text summarization, question answering, and more. In this blog post, we compare three popular NLP deep learning frameworks: SpaCy, PyTorch, and AllenNLP: what are their advantages, disadvantages, and use cases.


Pros: easy to use, very fast, ready for production

Cons: not customizable, internals are opaque


SpaCy is a mature and batteries-included framework that comes with prebuilt models for common NLP tasks like classification, named entity recognition, and part-of-speech tagging. It’s very easy to train a model with your data: all the gritty details like tokenization and word embeddings are handled for you. SpaCy is written in Cython which makes it faster than a pure Python implementation, so it’s ideal for production.

The design philosophy is the user should only worry about the task at hand, and not the underlying details. If a newer and more accurate model comes along, SpaCy can update itself to use the improved model, and the user doesn’t need to change anything. This is good for getting a model up and running quickly, but leaves little room for a NLP practitioner to customize the model if the task doesn’t exactly match one of SpaCy’s prebuilt models. For example, you can’t build a classifier that takes both text, numerical, and image data at the same time to produce a classification.


Pros: very customizable, widely used in deep learning research

Cons: fewer NLP abstractions, not optimized for speed


PyTorch is a deep learning framework by Facebook, popular among researchers for all kinds of DL models, like image classifiers or deep reinforcement learning or GANs. It uses a clear and flexible design where the model architecture is defined with straightforward Python code (rather than TensorFlow’s computational graph design).

NLP-specific functionality, like tokenization and managing word embeddings, are available in torchtext. However, PyTorch is a general purpose deep learning framework and has relatively few NLP abstractions compared to SpaCy and AllenNLP, which are designed for NLP.


Pros: excellent NLP functionality, designed for quick prototyping

Cons: not yet mature, not optimized for speed


AllenNLP is built on top of PyTorch, designed for rapid prototyping NLP models for research purposes. It supports a lot of NLP functionality out-of-the-box, like text preprocessing and character embeddings, and abstracts away the training loop (whereas in PyTorch you have to write the training loop yourself). Currently, AllenNLP is not yet at a 1.0 stable release, but looks very promising.

Unlike PyTorch, AllenNLP’s design decouples what a model “does” from the architectural details of “how” it’s done. For example, a Seq2VecEncoder is any component that takes a sequence of vectors and outputs a single vector. You can use GloVe embeddings and average them, or you can use an LSTM, or you can put in a CNN. All of these are Seq2VecEncoders so you can swap them out without affecting the model logic.

The talk “Writing code for NLP Research” presented at EMNLP 2018 gives a good overview of AllenNLP’s design philosophy and its differences from PyTorch.

Which is the best framework?

It depends on how much you care about flexibility, ease of use, and performance.

  • If your task is fairly standard, then SpaCy is the easiest to get up and running. You can train a model using a small amount of code, you don’t have to think about whether to use a CNN or RNN, and the API is clearly documented. It’s also well optimized to deploy to production.
  • AllenNLP is the best for research prototyping. It supports all the bells and whistles that you’d include in your next research paper, and encourages you to follow the best practices by design. Its functionality is a superset of PyTorch’s, so I’d recommend AllenNLP over PyTorch for all NLP applications.

There’s a few runner-ups that I will mention briefly:

  • NLTK / Stanford CoreNLP / Gensim are popular libraries for NLP. They’re good libraries, but they don’t do deep learning, so they can’t be directly compared here.
  • Tensorflow / Keras are also popular for research, especially for Google projects. Tensorflow is the only framework supported by Google’s TPUs, and it also has better multi-GPU support than PyTorch. However, multi-GPU setups are relatively uncommon in NLP, and furthermore, its computational graph model is harder to debug than PyTorch’s model, so I don’t recommend it for NLP.
  • PyText is a new framework by Facebook, also built on top of PyTorch. It defines a network using pre-built modules (similar to Keras) and supports exporting models to Caffe to be faster in production. However, it’s very new (only released earlier this month) and I haven’t worked with it myself to form an opinion about it yet.

That’s all, let me know if there’s any that I’ve missed!

Kaggle Speech Recognition Challenge

For the past few weeks, I’ve been working on the TensorFlow Speech Recognition Challenge on Kaggle. The task is to recognize a one-second audio clip, where the clip contains one of a small number of words, like “yes”, “no”, “stop”, “go”, “left”, and “right”.

In general, speech recognition is a difficult problem, but it’s much easier when the vocabulary is limited to a handful of words. We don’t need to use complicated language models to detect phonemes, and then string the phonemes into words, like Kaldi does for speech recognition. Instead, a convolutional neural network works quite well.

First Steps

The dataset consists of about 64000 audio files which have already been split into training / validation / testing sets. You are then asked to make predictions on about 150000 audio files for which the labels are unknown.

Actually, this dataset had already been published in academic literature, and people published code to solve the same problem. I started with GCommandPytorch by Yossi Adi, which implements a speech recognition CNN in Pytorch.

The first step that it does is convert the audio file into a spectrogram, which is an image representation of sound. This is easily done using LibRosa.

1.pngAbove: Sample spectrograms of “yes” and “no”

Now we’ve converted the problem to an image classification problem, which is well studied. To an untrained human observer, all the spectrograms may look the same, but neural networks can learn things that humans can’t. Convolutional neural networks work very well for classifying images, for example VGG16:

2.pngAbove: A Convolutional Neural Network (LeNet). VGG16 is similar, but has even more layers.

For more details about this approach, refer to these papers:

  1. Convolutional Neural Networks for Small-footprint Keyword Spotting
  2. Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting

Voice Activity Detection

You might ask: if somebody already implemented this, then what’s there left to do other than run their code? Well, the test data contains “silence” samples, which contain background noise but no human speech. It also has words outside the set we care about, which we need to label as “unknown”. The Pytorch CNN produces about 95% validation accuracy by itself, but the accuracy is much lower when we add these two additional requirements.

For silence detection, I first tried the simplest thing I could think of: taking the maximum absolute value of the waveform and decide it’s “silence” if the value is below a threshold. When combined with VGG16, this gets accuracy 0.78 on the leaderboard. This is a crude metric because sufficiently loud noise would be considered speech.

Next, I tried running openSMILE, which I use in my research to extract various acoustic features from audio. It implements an LSTM for voice activity detection: every 0.05 seconds, it outputs a probability that someone is talking. Combining the openSMILE output with the VGG16 prediction gave a score of 0.81.

More improvements

I tried a bunch of things to improve my score:

  1. Fiddled around with the neural network hyperparameters which boosted my score to 0.85. Each epoch took about 10 minutes on a GPU, and the whole model takes about 2 hours to train. Somehow, Adam didn’t produce good results, and SGD with momentum worked better.
  2. Took 100% of the data for training and used the public LB for validation (don’t do this in real life lol). This improved my score to 0.86.
  3. Trained an ensemble 3 versions of the same neural network with same hyperparameters but different randomly initialized weights and took a majority vote to do prediction. This improved the score to 0.87. I would’ve liked to train more, but other people in my research group needed to use the GPUs.

In the end, the top scoring model had a score of 0.91, which beat my model by 4 percentage points. Although not enough to win a Kaggle medal, my model was in the top 15% of all submissions. Not bad!

My source code for the contest is available here.

Paper Review: Linguistic Features to Identify Alzheimer’s Disease

Today I’m going to be sharing a paper I’ve been looking at, related to my research: “Linguistic Features Identify Alzheimer’s Disease in Narrative Speech” by Katie Fraser, Jed Meltzer, and my adviser Frank Rudzicz. The paper was published in 2016 in the Journal of Alzheimer’s Disease. It uses NLP to automatically diagnose patients with Alzheimer’s disease, given a sample of their speech.

Alzheimer’s disease is a disease that you might have heard of, but it doesn’t get much attention in the media, unlike cancer and stroke. It is a neurodegenerative disease that mostly affects elderly people. 5 million Americans are living with Alzheimer’s, including 1 in 9 over the age of 65, and 1 in 3 over the age of 85.

Alzheimer’s is also the most expensive disease in America. After diagnosis, patients may continue to live for over 10 years, and during much of this time, they are unable to care for themselves and require a constant caregiver. In 2017, 68% of Medicare and Medicaid’s budget is spent on patients with Alzheimer’s, and this number is expected to increase as the elderly population grows.

Despite a lot of recent advances in our understanding of the disease, there is currently no cure for Alzheimer’s. Since the disease is so prevalent and harmful, research in this direction is highly impactful.

Previous tests to diagnose Alzheimer’s

One of the early signs of Alzheimer’s is having difficulty remembering things, including words, leading to a decrease in vocabulary. A reliable way to test for this is a retrieval question like the following (Monsch et al., 1992):

In the next 60 seconds, name as many items as possible that can be found in a supermarket.

A healthy person could rattle out about 20-30 items in a minute, whereas someone with Alzheimer’s could only produce about 10. By setting the threshold at 16 items, they could classify even mild cases of Alzheimer’s with about 92% accuracy.

This doesn’t quite capture the signs of Alzheimer’s disease though. Patients with Alzheimer’s tend to be rambly and incoherent. This can be tested with a picture description task, where the patient is given a picture and asked to describe it with as much detail as possible (Giles, Patterson, Hodges, 1994).

73c894ea4d2dc12ca69a6380e51f1d62Above: Boston Cookie Theft picture used for picture description task

There is no time limit, and the patients talked until they indicated they had nothing more to say, or if they didn’t say anything for 15 seconds.

Patients with Alzheimer’s disease produced descriptions with varying degrees of incoherence. Here’s an example transcript, from the above paper:

Experimenter: Tell me everything you see going on in this picture

Patient: oh yes there’s some washing up going on / (laughs) yes / …… oh and the other / ….. this little one is taking down the cookie jar / and this little girl is waiting for it to come down so she’ll have it / ………. er this girl has got a good old splash / she’s left the taps on (laughs) she’s gone splash all down there / um …… she’s got splash all down there

You can clearly tell that something’s off, but it’s hard to put a finger on exactly what the problem is. Well, time to apply some machine learning!

Results of Paper

Fraser’s 2016 paper uses data from the DementiaBank corpus, consisting of 240 narrative samples from patients with Alzheimer’s, and 233 from a healthy control group. The two groups were matched to have similar age, gender, and education levels. Each participant was asked to describe the Boston Cookie Theft picture above.

Fraser’s analysis used both the original audio data, as well as a detailed computer-readable transcript. She looked at 370 different features covering all sorts of linguistic metrics, like ratios of different parts of speech, syntactic structures, vocabulary richness, and repetition. Then, she performed a factor analysis and identified a set of 35 features that achieves about 81% accuracy in distinguishing between Alzheimer’s patients and controls.

According to the analysis, a few of the most important distinguishing features are:

  • Pronoun to noun ratio. Alzheimer’s patients produce vague statements and tend to substitute pronouns like “he” for nouns like “the boy”. This also applies to adverbial constructions like “the boy is reaching up there” rather than “the boy is reaching into the cupboard”.
  • Usage of high frequency words. Alzheimer’s patients have difficulty remembering specific words and replace them with more general, therefore higher frequency words.

Future directions

Shortly after this research was published, my adviser Frank Rudzicz co-founded WinterLight Labs, a company that’s working on turning this proof-of-concept into an actual usable product. It also diagnoses various other cognitive disorders like Primary Progressive Aphasia.

A few other grad students in my research group are working on Talk2Me, which is a large longitudinal study to collect more data from patients with various neurodegenerative disorders. More data is always helpful for future research.

So this is the starting point for my research. Stay tuned for updates!

Applying to Graduate School in Computer Science

So you’re thinking of grad school. Four years of undergrad is not enough for you and you’re craving for more knowledge. Or perhaps you want to delay your entry into the “real world” for a couple more years. Well, grad school is the right place!

About a year ago, I decided I wanted to do grad school. However, most of my peers were on track to work in the industry after graduation. The process of applying for grad school is daunting, especially since it varies from country to country and from one subject to another. This is why I am writing a guide to grad school applications in Computer Science, for Canada and the USA: a compilation of things I wish I knew a year ago.

Why grad school?

People go to grad school for different reasons. Most people I know in computer science and software engineering plan to start working full-time — a reasonable choice, given the high salaries in the industry right now. I figured that I had the rest of my life to code for a company; there’s no rush to start working immediately.

Graduate programs come in three broad categories:

  1. Coursework Master’s. Typically about 1 year, this is basically an extension of undergrad. You take a bunch of graduate courses, which are more in-depth than undergrad classes, but you don’t do any research. This is useful for gaining more knowledge before going to work in the industry.
  2. Thesis Master’s. About 1-2 years, depending on the school. At first, you take some courses like in coursework master’s, but the primary goal is to produce a master’s thesis, which is equivalent to about one conference paper of original research. This is a good way to get some research experience, without the time commitment of a Ph.D (and is valuable as a stepping stone if you do decide to get one).
  3. Ph.D. A longer commitment of 4-5 years. In a Ph. D., you produce an extensive amount of original research; by the time you write your thesis, you will be a world expert on your specific topic. I like this illustrated explanation of what it’s like to do a Ph. D.

There are hybrid programs too, like thesis master’s often transition directly into a Ph. D, and also there are regional differences on how these programs work (more on this later).

Can I get into grad school?

As you may expect, top grad school programs are very competitive, and a typical grad student is a top student in his undergraduate class. So what do schools look for in their grad school admissions process?

Grades are a big factor: generally, an undergrad GPA of 85% or higher is good for grad school (with 80% being the bare minimum). However, even more important than GPA is your research experience. Publishing papers in conferences would be ideal, and research experience can make up for a lackluster transcript.

Unfortunately, Waterloo students are at a disadvantage here: with the co-op program, most people spend their undergrad years focusing on internships rather than research, which is considered less valuable. Don’t be discouraged though: my only research experience was through two part-time URA’s, and I have zero publications, but I still got into a good grad school.

Picking a research area

In grad school, you specialize on a specific area of computer science, for example, machine learning, or databases, or theory, or programming languages. You have to indicate roughly what area you want to study in your application, but it’s okay to not know exactly what you want to research.

For me, I wanted to do something involving artificial intelligence or data science or machine learning. Eventually I decided on natural language processing (NLP), since it’s an area of machine learning, and I like studying languages.

Some people have a specific professor that they want to work with, in which case it’s helpful to reach out to them beforehand and mention it in your statement of purpose. Otherwise, as in my case, you don’t need to explicitly contact potential advisers if you have nothing to say; you get to indicate your adviser preferences in your application.

Choosing your schools

The most important thing to look for in a grad school is the quality of the research group. You may be tempted to look at overall computer science rankings, but this can be misleading because different schools have strengths in different research areas. There are other factors to consider, like location, city environment (big city or college town), and social life.

It’s a good idea to apply to a variety of schools of different levels of competitiveness. However, each application costs about $100, so it can be expensive to apply to too many — 5-10 applications is a good balance.

I decided to apply to five schools: two in Canada and three in the USA. My main criteria were (1), a reputable research program in NLP, and (2), I wanted to live in a big city. After some deliberation, I decided to apply to the following:

  • Ph. D. at University of Washington
  • Ph. D. at UC Berkeley
  • Ph. D. at Columbia University
  • M. Sc. at University of Toronto
  • M. Sc. at University of British Columbia

I didn’t apply to the University of Waterloo, where I’m doing my undergrad, despite it being pretty highly ranked in Canada — after studying here for five years, I needed a change of scenery.

Differences between Canada and USA

You might have noticed that my three applications in the USA were all Ph. D. programs, while my two in Canada were master’s. Graduate school works quite differently in Canada vs in the USA. In Canada, most students do a master’s after undergrad and then go on to do a Ph. D., but in the USA, you enter into Ph. D. directly after undergrad, skipping the master’s.

There are master’s programs in the USA too, but they are almost exclusively coursework master’s, and are very expensive ($50k+ tuition per year). In contrast, thesis master’s programs in Canada and Ph. D. programs in the USA are fully funded, so you get paid a stipend of around $20k-30k a year.

A big reason to do a master’s in the USA is for visa purposes: for Chinese and Indian citizens, getting the H1-B is much easier with a master’s in the country, so the investment can be worth it. Otherwise, it’s probably not worth getting a master’s in the USA; studying in Canada is much cheaper.

If you go to grad school in Canada, you can apply for the CGS-M and OGS government scholarships for master’s students. Unfortunately, Canadian citizens are ineligible for most scholarships if you study in the USA.

Taking the GRE

Another difference for the USA is that the Graduate Record Exam (GRE) is required for all grad school admissions in the USA. This is a 4-hour-long computer-administered test with a reading, writing, and math component. If you’ve taken the SAT, this test is very similar. For grad school applications in computer science, only the general exam is required, and not the computer science subject test.

The GRE plays a fairly minor role in admissions: a terrible score will hurt your chances, but a perfect score will not help you that much. The quantitative and verbal sections are scored between 130-170, and for science and engineering programs, a good score is around 165+ for quantitative and 160+ for verbal.

The quantitative (math) section is a cakewalk for any computer science major, but the verbal section can be challenging if English is not your native language. It does require some preparation (1-6 months is recommended). I studied for a month and did quite well.

Applications are due on December 15 for most schools, so you should take the GRE in October at the latest (and earlier if you plan on taking it multiple times).

Letters of Recommendation

Most grad school and scholarship applications require three letters of recommendation; out of all requirements, this one requires the most planning. The ideal recommendation comes from a professor that you have done research with. If you go to Waterloo and are considering grad school, doing a part-time URA (undergraduate research assistantship) is a good way to secure a few recommendation letters.

It may be difficult to find three professors that you’ve worked with, so the next best thing is a weaker letter from a professor whose class you did well in. As a last resort, at most one letter may come from a non-academic source (like your co-op supervisor). I was lucky that one of my research projects was co-supervised by two different professors, so I got two letters that way.

Statement of Purpose

The statement of purpose is a two-page essay where you describe your academic history and research interests, and convince the admissions committee that you are the ideal candidate to admit. If you have internship experience, talk about what you learned any why it’s relevant for research.

Chances are that the first revision of your statement of purpose will suck (this was certainly the case for me), so get friends and professors to proofread it. After several revisions, here’s my final statement of purpose.

Offers of Admission

That’s about it — after you hit submit on all your applications by December 15, you can sit back and enjoy your final semester. With any luck, you will receive this in your inbox around the beginning of February:


In the end, I got accepted to both master’s programs in Canada (UBC and UofT), but got rejected from all three Ph. D. programs in the USA (Washington, Berkeley, and Columbia). I chose to accept the UofT offer, where I will study NLP starting this fall.

Hopefully this guide has been helpful, and good luck with your applications!