# Deep Learning for NLP: SpaCy vs PyTorch vs AllenNLP

Deep neural networks have become really popular nowadays, producing state-of-the-art results in many areas of NLP, like sentiment analysis, text summarization, question answering, and more. In this blog post, we compare three popular NLP deep learning frameworks: SpaCy, PyTorch, and AllenNLP: what are their advantages, disadvantages, and use cases.

## SpaCy

Pros: easy to use, very fast, ready for production

Cons: not customizable, internals are opaque

SpaCy is a mature and batteries-included framework that comes with prebuilt models for common NLP tasks like classification, named entity recognition, and part-of-speech tagging. It’s very easy to train a model with your data: all the gritty details like tokenization and word embeddings are handled for you. SpaCy is written in Cython which makes it faster than a pure Python implementation, so it’s ideal for production.

The design philosophy is the user should only worry about the task at hand, and not the underlying details. If a newer and more accurate model comes along, SpaCy can update itself to use the improved model, and the user doesn’t need to change anything. This is good for getting a model up and running quickly, but leaves little room for a NLP practitioner to customize the model if the task doesn’t exactly match one of SpaCy’s prebuilt models. For example, you can’t build a classifier that takes both text, numerical, and image data at the same time to produce a classification.

## PyTorch

Pros: very customizable, widely used in deep learning research

Cons: fewer NLP abstractions, not optimized for speed

PyTorch is a deep learning framework by Facebook, popular among researchers for all kinds of DL models, like image classifiers or deep reinforcement learning or GANs. It uses a clear and flexible design where the model architecture is defined with straightforward Python code (rather than TensorFlow’s computational graph design).

NLP-specific functionality, like tokenization and managing word embeddings, are available in torchtext. However, PyTorch is a general purpose deep learning framework and has relatively few NLP abstractions compared to SpaCy and AllenNLP, which are designed for NLP.

## AllenNLP

Pros: excellent NLP functionality, designed for quick prototyping

Cons: not yet mature, not optimized for speed

AllenNLP is built on top of PyTorch, designed for rapid prototyping NLP models for research purposes. It supports a lot of NLP functionality out-of-the-box, like text preprocessing and character embeddings, and abstracts away the training loop (whereas in PyTorch you have to write the training loop yourself). Currently, AllenNLP is not yet at a 1.0 stable release, but looks very promising.

Unlike PyTorch, AllenNLP’s design decouples what a model “does” from the architectural details of “how” it’s done. For example, a Seq2VecEncoder is any component that takes a sequence of vectors and outputs a single vector. You can use GloVe embeddings and average them, or you can use an LSTM, or you can put in a CNN. All of these are Seq2VecEncoders so you can swap them out without affecting the model logic.

The talk “Writing code for NLP Research” presented at EMNLP 2018 gives a good overview of AllenNLP’s design philosophy and its differences from PyTorch.

## Which is the best framework?

It depends on how much you care about flexibility, ease of use, and performance.

• If your task is fairly standard, then SpaCy is the easiest to get up and running. You can train a model using a small amount of code, you don’t have to think about whether to use a CNN or RNN, and the API is clearly documented. It’s also well optimized to deploy to production.
• AllenNLP is the best for research prototyping. It supports all the bells and whistles that you’d include in your next research paper, and encourages you to follow the best practices by design. Its functionality is a superset of PyTorch’s, so I’d recommend AllenNLP over PyTorch for all NLP applications.

There’s a few runner-ups that I will mention briefly:

• NLTK / Stanford CoreNLP / Gensim are popular libraries for NLP. They’re good libraries, but they don’t do deep learning, so they can’t be directly compared here.
• Tensorflow / Keras are also popular for research, especially for Google projects. Tensorflow is the only framework supported by Google’s TPUs, and it also has better multi-GPU support than PyTorch. However, multi-GPU setups are relatively uncommon in NLP, and furthermore, its computational graph model is harder to debug than PyTorch’s model, so I don’t recommend it for NLP.
• PyText is a new framework by Facebook, also built on top of PyTorch. It defines a network using pre-built modules (similar to Keras) and supports exporting models to Caffe to be faster in production. However, it’s very new (only released earlier this month) and I haven’t worked with it myself to form an opinion about it yet.

That’s all, let me know if there’s any that I’ve missed!

# The Ethics of (not) Tipping at Restaurants

A customer finishes a meal at a restaurant. He gives a 20-dollar bill to the waiter, and the waiter returns with some change. The customer proceeds to pocket the change in its entirety.

“Excuse me sir,” the waiter interrupts, “but the gratuity has not been included in your bill”

The customer nods and calmly smiles at the waiter. “Yes, I know,” he replies. He gathers his belongings and walks out, indifferent to the astonished look on the waiter’s face.

This fictional scenario makes your blood boil just thinking about it. It evokes a feeling of unfairness, where a shameless and rude customer has cheated an innocent, hardworking waiter out of his well-deserved money. Not many situations provoke such a strong emotional response, yet still be perfectly legal.

There is compelling reason not to tip. On an individual level, you can save 10-15% on your meal. On a societal level, economists have criticized tipping for its discriminatory effects. Yet we still do it, but why?

In this blog post, we look at some common arguments in favor of tipping, but we see that these arguments may not hold up to scrutiny. Then, we examine the morality of refusing to tip under several ethical frameworks.

## Arguments in favor of tipping (and their rebuttals)

Here are four common reasons for why we should tip:

1. Tipping gives the waiter an incentive to provide better service.
2. Waiters are paid less than minimum wage and need the money.
3. Refusing to tip is embarrassing: it makes you lose face in front of the waiter and your colleagues.
4. Tipping is a strong social norm and violating it is extremely rude.

I’ve ordered these arguments from weakest to strongest. These are good reasons, but I don’t think any of them definitively settles the argument. I argue that the first two are factually inaccurate, and for the last two, it’s not obvious why the end effect is bad.

Argument 1: Tipping gives the waiter an incentive to provide better service. Since the customer tips at the end of the meal, the waiter does a better job to make him happy, so that he receives a bigger tip.

Rebuttal: The evidence for this is dubious. One study concluded that service quality has at most a modest correlation with how much people tip; many other factors affected tipping, like group size, day of week, and amount of alcohol consumed. Another study found that waitresses earned more tips from male customers if they wore red lipstick. The connection between good service and tipping is sketchy at best.

Argument 2: Waiters are paid less than minimum wage and need the money. In many parts of the USA, waiters earn a base rate of about $2 an hour and must rely on tips to survive. Rebuttal: This is false. In Canada, all waiters earn at least minimum wage. In the USA, the base rate for waiters is less than minimum wage in some states, but restaurants are required to pay the difference if they make less than minimum wage after tips. You may argue that restaurant waiters are poor and deserve more than minimum wage. I find this unconvincing as we there are lots of service workers (cashiers, janitors, retail clerks, fast food workers) that do strenuous labor and make minimum wage, and we don’t tip them. I don’t see why waiters are an exception. Arguably Uber drivers are the most deserving of tips, since they make less than minimum wage after accounting for costs, but tipping is optional and not expected for Uber rides. Argument 3: Refusing to tip is embarrassing: it makes you lose face in front of the waiter and your colleagues. You may be treated badly the next time you visit the restaurant and the waiter recognizes you. If you’re on a date and you get confronted for refusing to tip, you’re unlikely to get a second date. Rebuttal: Indeed, the social shame and embarrassment is a good reason to tip, especially if you’re dining with others. But what if you’re eating by yourself in a restaurant in another city that you will never go to again? Most people will still tip, even though the damage to your social reputation is minimal. So it seems that social reputation isn’t the only reason for tipping. It’s definitely embarrassing to get confronted for not tipping, but it’s not obvious that being embarrassed is bad (especially if the only observer is a waiter who you’ll never interact with again). If I give a public speech despite feeling embarrassed, then I am praised for my bravery. Why can’t the same principle apply here? Argument 4: Tipping is a strong social norm and violating it is extremely rude. Stiffing a waiter is considered rude in our society, even if no physical or economic damage is done. Giving the middle finger is also offensive, despite no clear damage being done. In both cases, you’re being rude to an innocent stranger. Rebuttal: Indeed, the above is true. A social norm is a convention that if violated, people feel rude. The problem is the arbitrariness of social norms. Is it always bad to violate a social norm, or can the social norm itself be wrong? Consider that only a few hundred years ago, slavery was commonplace and accepted. In medieval societies, religion was expected and atheists were condemned, and in other societies, women were considered property of their husbands. All of these are examples of social norms; all of these norms are considered barbaric today. It’s not enough to justify something by saying that “everybody else does it”. ## Tipping under various ethical frameworks Is it immoral not to tip at restaurants? We consider this question under the ethical frameworks of ethical egoism, utilitarianism, Kant’s categorical imperative, social contract theory, and cultural relativism. Above: The trolley problem, often used to compare different ethical frameworks, but unlikely to occur in real life. Tipping is a more quotidian situation to apply ethics. 1) Ethical egoism says it is moral to act in your own self-interest. The most moral action is the one that is best for yourself. Clearly, it is in your financial self-interest not to tip. However, the social stigma and shame creates negative utility, which may or may not be worth more than the money saved from tipping. This depends on the individual. Verdict: Maybe OK. 2) Utilitarianism says the moral thing to do is maximize the well-being of the greatest number of people. Under utilitarianism, you should tip if the money benefits the waiter more than it would benefit you. This is difficult to answer, as it depends on many things, like your relative wealth compared to the waiter’s. Again, subtract some utility for the social stigma and shame if you refuse to tip. Verdict: Maybe OK. 3) Kant’s categorical imperative says that an action is immoral if the goal of the action would be defeated if everyone started doing it. Essentially, it’s immoral to gain a selfish advantage at the expense of everyone else. If everyone refused to tip, then the prices of food in restaurants would universally go up to compensate, which negates the intended goal of saving money in the first place. Verdict: Not OK. 4) Social contract theory is the set of rules that a society of free, rational people would agree to obey in order to benefit everyone. This is to prevent tragedy of the commons scenarios, where the system would collapse if everyone behaved selfishly. There is no evidence that tipping makes a society better off. Indeed, many societies (eg: China, Japan) don’t practice tipping, and their restaurants operate just fine. Verdict: OK. 5) Cultural relativism says that morals are determined by the society that you live in (ie, social norms). There is a strong norm in our culture that tipping is obligatory in restaurants. Verdict: Not OK. ## Conclusion In this blog post, we have considered a bunch of arguments for tipping, and examined it under several ethical frameworks. Stiffing the waiter is a legal method of saving some money when eating out. There is no single argument that shows it’s definitely wrong to do this, and some ethical frameworks consider it acceptable while some don’t. This is often the case in ethics when you’re faced with complicated topics. However, refusing to tip has several negative effects: rudeness of violating a strong social norm, feeling of embarrassment to yourself and colleagues, and potential social backlash. Furthermore, it violates some ethical systems. Therefore, one should reconsider if saving 10-15% at restaurants by not tipping is really worth it. # I trained a neural network to describe images, then I gave it dementia This blog post is a summary of my work from earlier this year: Dropout during inference as a model for neurological degeneration in an image captioning network. For a long time, deep learning has had an interesting connection to neuroscience. The definition of the neuron in neural networks was inspired by early models of the neuron. Later, convolutional neural networks were inspired by the structure of neurons in the visual cortex. Many other models also drew inspiration from how the brain functions, like visual attention which replicated how humans looked at different areas of an image when interpreting it. The connection was always a loose and superficial, however. Despite advances in neuroscience about better models of neurons, these never really caught on among deep learning researchers. Real neurons obviously don’t learn by gradient back-propagation and stochastic gradient descent. In this work, we study how human neurological degeneration can have a parallel in the universe of deep neural networks. In humans, neurodegeneration can occur by several mechanisms, such as Alzheimer’s disease (which affects connections between individual neurons) or stroke (in which large sections of brain tissue die). The effect of Alzheimer’s disease is dementia, where language, motor, and other cognitive abilities gradually become impaired. To simulate this effect, we give our neural network a sort of dementia, by interfering with connections between neurons using a method called dropout. Yup, this probably puts me high up on the list of humans to exact revenge in the event of an AI apocalypse. ## The Model We started with an encoder-decoder style image captioning neural network (described in this post), which looks at an image and outputs a sentence that describes it. This is inspired by a picture description task that we give to patients suspected of having dementia: given a picture, describe it in as much detail as possible. Patients with dementia typically exhibit patterns of language different from healthy patients, which we can detect using machine learning. To simulate neurological degeneration in the neural network, we apply dropout in the inference mode, which randomly selects a portion of the neurons in a layer and sets their outputs to zero. Dropout is a common technique during training to regularize neural networks to prevent overfitting, but usually you turn it off during evaluation for the best possible accuracy. To our knowledge, nobody’s experimented with applying dropout in the evaluation stage in a language model before. We train the model using a small amount of dropout, then apply a larger amount of dropout during inference. Then, we evaluate the quality of the sentences produced by BLEU-4 and METEOR metrics, as well as sentence length and similarity of vocabulary distribution to the training corpus. ## Results When we applied dropout during inference, the accuracy of the captions (measured by BLEU-4 and METEOR) decreased with more dropout. However, the vocabulary generated was more diverse, and the word frequency distribution was more similar (measured by KL-divergence to the training set) when a moderate amount of dropout was applied. When the dropout was too high, the model degenerated into essentially generating random words. Here are some examples of sentences that were generated, at various levels of dropout: Qualitatively, the effects of dropout seemed to cause two types of errors: • Caption starts out normally, then repeats the same word several times: “a small white kitten with red collar and yellow chihuahua chihuahua chihuahua” • Caption starts out normally, then becomes nonsense: “a man in a baseball bat and wearing a uniform helmet and glove preparing their handles won while too frown” This was not that similar to speech produced by people with Alzheimer’s, but kind of resembled fluent aphasia (caused by damage to the part of the brain responsible for understanding language). ## Challenges and Difficulties Excited with our results, we submitted the paper to EMNLP 2018. Unfortunately, our paper was rejected. Despite the novelty of our approach, the reviewers pointed out that our work had some serious drawbacks: 1. Unclear connection to neuroscience. Adding dropout during inference mode has no connections to any biological models of what happens to the brain during atrophy. 2. Only superficial resemblance to aphasic speech. A similar result could have been generated by sampling words randomly from a dictionary, without any complicated RNN models. 3. Not really useful for anything. We couldn’t think of any situations where this model would be useful, such as detecting aphasia. We decided that there was no way around these roadblocks, so we scrapped the idea, put the paper up on arXiv and worked on something else. For more technical details, refer to our paper: # First trip to Europe: Portugal, Netherlands, Hungary, Romania, UK I just came back from my first trip to Europe, covering five countries in about three weeks. A week in a city is not enough to really know it beyond a superficial level, but in this post I will describe my first impressions of the places I visited. ## Lisbon My journey began by flying in directly from Toronto to Lisbon. With just under seven hours, it’s one of the closest destinations in Europe. Lisbon reminds me of San Francisco (minus the homeless and tech employees). The city is built on hilly terrain, with streetcars running up and down the narrow and steep streets. A big red suspension bridge crosses the Tagus bay. The city has a love-hate relationship with tourism. Although tourists drive a large part of the economy, short term Airbnb rentals have jacked up housing prices and have driven out the locals from many neighborhoods. Most stores and restaurants are forced to close before 11pm, due to complaints of noisy tourists. Sintra is an hour west of Lisbon by train, and is well worth it for a day trip. It features some gorgeous castles and palaces built on a mountain. ## Amsterdam Next destination was Amsterdam. Compared to Lisbon’s rolling hills, Amsterdam is as flat as a pancake. Every major street has a dedicated bike lane, so bikes are a great way to get around. Great way to encourage a healthy lifestyle. Around the old city center, there is a canal every few blocks — just don’t fall into them. If you’re a budget traveler, beware that Amsterdam is expensive — we spent as much on accommodation in 5 days in Amsterdam as the rest of the trip combined. The first time at a restaurant, I couldn’t believe that a bottle of water cost 3 euros. My two companions watched amusingly as I demanded tap water (and my request was refused). ## Budapest Budapest, the Hungarian capital on the Danube, is a city with magnificent architecture. I’d describe the architecture as giving a feeling of abundance: whereas most capital cities are tightly packed to maximize space efficiency, Budapest’s buildings are grand and spacious. Doorways and staircases are so tall that a man riding a horse could’ve walked through without dismounting. I would recommend taking a Danube cruise at night: the parliament building and other major landmarks are brightly lit by powerful lights at night. Also, make sure to try some goulash (Hungarian soup) for lunch and kürtőskalács (chimney cake) for dessert. Out of the places I went to this trip, Budapest is my favorite. I’d definitely recommend adding Budapest to your next trip to Europe. ## Romania From Budapest, we took an overnight train eastward across the Great Hungarian Plain and the Transylvanian mountains to Bucharest, the capital of Romania. You can still feel the effects of Soviet communism in Bucharest. Most of the older historical buildings were damaged during war or systematically demolished under Ceausescu’s communist regime, and replaced with rows of austere-looking apartment blocks. Brasov, our next stop in Romania, escaped relatively unscathed from World War 2. It’s a small town surrounded by lush green mountains and featuring a variety of shops, churches and other historical buildings. Nearby, you can visit the Bran and Peles castles: the former is tenuously connected to Vlad the Impaler, a historical figure who inspired Bram Stoker’s Dracula. Lastly, I made a brief stop at London on the way back, where I got a chance to catch up with some friends living there. It turns out that plane tickets don’t satisfy the triangle inequality, so flying from Bucharest to London, staying a few days, then flying to Toronto was a lot cheaper than flying from Bucharest to Toronto directly. Now I’m back — ready to start a new semester of research at UofT! # How to read research papers for fun and profit One skill that I’ve learned after a year in grad school is how to effectively read research papers. Previously I had found them impenetrable, but now I find them a great source of information about cutting-edge science while it is being done and before it’s made its way into textbooks. Now I read about 4-5 of them every week. My research area is natural language processing and machine learning, but I read papers in lots of fields, not just in AI and computer science. Papers are my go-to source for a myriad of scientific inquiries, for example: does drinking alcohol cause cancer? Are women more talkative than men? Was winter in Toronto abnormally cold this year? Etc. ## Why read scientific papers? If you try to Google questions like these, you typically end up on Wikipedia or some random article on the internet. Research papers are an underutilized resource that have several advantages over other common sources of information on the internet. Advantages over articles on the internet: no matter what topic, you will undoubtedly find articles on it on the internet. Some of these articles are excellent, but others are opinionated nonsense. Without being an expert yourself, it can be difficult to decide what information to trust. Peer-reviewed research papers are held to a much higher minimum quality standard, and for every claim they make, they have to clearly state their evidence, assumptions, how they arrived at the conclusion, and their degree of confidence in their result. You can examine the paper for yourself and decide if the assumptions are reasonable and the conclusions follow logically, rather than trust someone else’s word for it. With some digging deeper and some critical thinking, you can avoid a lot of misinformation on the internet. Advantages over Wikipedia: Wikipedia is a pretty reliable source of truth; in fact, it often cites scientific papers as its sources. However, Wikipedia is written to be concise, so that oftentimes, a 30-page research paper is summarized to 1-2 sentences. If you only read Wikipedia, you will miss a lot of the nuances contained in the original paper, and only develop a cursory understanding compared to going directly to the source. ## Finding the right paper to read If your professor or colleague has assigned you a specific paper to read, then you can skip this section. A big part of the challenge of reading papers is deciding which ones to read. There are a lot of papers out there, and only a few will be relevant to you. Therefore, deciding what to read is a nontrivial skill in itself. Research papers are the most useful when you have a specific problem or question in mind. When I first started out reading papers, I approached this the wrong way. One day, I’d suddenly decide “hmm, complexity theory is pretty interesting, let’s go on arXiv and look at some recent complexity theory papers“. Then, I’d open a few, attempt to read them, get confused, and conclude I’m not smart enough to read complexity theory papers. Why is this a bad idea? A research paper exists to answer a very specific question, so it makes no sense to pick up a random paper without the background context. What is the problem? What approaches have been tried in the past, and how have they failed? Without understanding background information like this, it’s impossible to appreciate the contribution of a specific paper. Above: Use the forward citation and related article buttons on Google Scholar to explore relevant papers. It’s helpful to think of each research paper as a node in a massive, interconnected graph. Rather than each paper existing as a standalone item, a paper is deeply connected to the research that came before and after it. Google Scholar is your best friend for exploring this graph. Begin by entering a few keywords and picking a few promising hits from the first 2-3 pages. Good, this is your starting point. Here are some heuristics for traversing the paper graph: • To go forward in time, look at works that cited this paper. A paper being cited usually means one of two things: (1) the future paper uses some technique or result developed in the current paper for some other purpose, or (2) the future paper improves on the techniques in the current paper. Citations of the second type are more useful. • To go backward in time, look at the paper’s introduction and related work. This puts the paper in context of previous work. Occasionally, you find a survey paper that doesn’t contribute anything novel of its own, but summarizes a bunch of previous related work; these are really helpful when you’re beginning your research in a topic. • Citation count is a good indicator of a paper’s importance and merit. If the paper has under 10 citations, take its claims with a grain of salt (even more so if it’s an arXiv preprint and not a peer-reviewed paper). Over 100 citations means the paper has made a significant contribution; over 1000 citations indicates a landmark paper in the field and is probably worth reading. Citation count is not a perfect metric, especially for very recent work, but it’s a useful heuristic that’s applicable across disciplines. ## The first pass: High level overview Great, you’ve decided on a paper to read. Now how to read it effectively? Reading a paper is not like reading a novel. When you read a novel, you start at the beginning and read linearly until you reach the end. However, reading a paper is most efficient by hopping around the sections as appropriate, rather than read linearly from beginning to end. The goal of your first reading of a paper is to first get a high level overview of the paper, before diving into the details. As you go through the paper, here are some good questions that you should be asking yourself: • What is the problem being solved? • What approaches have been tried before, and what are their limitations? • What is this paper’s novel contribution? • What experiments were done, using what dataset? How successful were the results? • Can the method in this paper be applied to my problem? • If not, what assumptions are needed for this method to work? Above: Treat each paper as a node in a massive graph of research, rather than a standalone item in a vacuum. When I read a paper, I usually proceed in the following order: 1. Abstract: a long paragraph that summarizes the entire paper. Read this to decide if the rest of the paper is worth reading or not. 2. Introduction, diagrams, tables, and conclusion. Often, reading the diagrams and captions gives you a good idea of what’s going on with minimal effort. 3. If the field is unfamiliar to you, then note down any interesting references in the introduction and related works sections to explore later. If the field is familiar, then just skim these sections. 4. Read the main body of the paper: model, experiment, and discussion, without getting too bogged down in the details. If a section is confusing, skip it for now and come back to it on a second reading. That’s it — you’ve finished reading a paper! Now you can either go back and read it again, focusing on the details you skimmed over the first pass, or move on to a different paper that you’ve added to your backlog. When reading a paper, you should not expect to understand every aspect of the paper by the time you’re done. You can always refer back to the paper at a later time, as needed. Generally, you don’t need to understand all the details, unless you’re trying to replicate or extend the paper. ## Help, I’m stuck! Sometimes, despite your best efforts, you find that a paper is impenetrable. It’s not necessarily your fault — some papers are hastily written hours before a conference deadline. What do you do now? Look for a video or blog post explaining the paper. If you’re lucky, someone may have recorded a lecture where the author presents the paper at a conference. Maybe somebody wrote a blog post summarizing the paper (Colah’s blog has great summaries of machine learning research). These are often better at explaining things than the actual paper. If there’s a lot of background terminology that don’t make sense, it may be better to consult other sources like textbooks and course lectures rather than papers. This is especially true if the research is not new (>10 years old). Research papers are not always the best at explaining a concept clearly: by their nature, they document research as it’s being done. Sometimes, the paper paints an incomplete picture of something that’s better understood later. Textbook writers can look back on research after it’s already done, and thereby benefit from hindsight knowledge that didn’t exist when the paper was written. Basic statistics is useful in many experimental fields — concepts like linear / logistic regression, p-values, hypothesis testing, and common statistical distribution. Any paper that deals with experimental data will use at least some statistics, so it’s worthwhile to be comfortable with basic stats. That’s it for my advice. The densely packed two-column pages of text may appear daunting to the uninitiated reader, but they can be conquered with a bit of practice. Whether it’s for work or for fun, you definitely don’t need a PhD to read papers. # Useful properties of ROC curves, AUC scoring, and Gini Coefficients Receiver Operating Characteristic (ROC) curves and AUC values are often used to score binary classification models in Kaggle and in papers. However, for a long time I found them fairly unintuitive and confusing. In this blog post, I will explain some basic properties of ROC curves that are useful to know for Kaggle competitions, and how you should interpret them. Above: Example of a ROC curve First, the definitions. A ROC curve plots the performance of a binary classifier under various threshold settings; this is measured by true positive rate and false positive rate. If your classifier predicts “true” more often, it will have more true positives (good) but also more false positives (bad). If your classifier is more conservative, predicting “true” less often, it will have fewer false positives but fewer true positives as well. The ROC curve is a graphical representation of this tradeoff. A perfect classifier has a 100% true positive rate and 0% false positive rate, so its ROC curve passes through the upper left corner of the square. A completely random classifier (ie: predicting “true” with probability p and “false” with probability 1-p for all inputs) will by random chance correctly classify proportion p of the actual true values and incorrectly classify proportion p of the false values, so its true and false positive rates are both p. Therefore, a completely random classifier’s ROC curve is a straight line through the diagonal of the plot. The AUC (Area Under Curve) is the area enclosed by the ROC curve. A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5. Usually, your model will score somewhere in between. The range of possible AUC values is [0, 1]. However, if your AUC is below 0.5, that means you can invert all the outputs of your classifier and get a better score, so you did something wrong. The Gini Coefficient is 2*AUC – 1, and its purpose is to normalize the AUC so that a random classifier scores 0, and a perfect classifier scores 1. The range of possible Gini coefficient scores is [-1, 1]. If you search for “Gini Coefficient” on Google, you will find a closely related concept from economics that measures wealth inequality within a country. Why do we care about AUC, why not just score by percentage accuracy? AUC is good for classification problems with a class imbalance. Suppose the task is to detect dementia from speech, and 99% of people don’t have dementia and only 1% do. Then you can submit a classifier that always outputs “no dementia”, and that would achieve 99% accuracy. It would seem like your 99% accurate classifier is pretty good, when in fact it is completely useless. Using AUC scoring, your classifier would score 0.5. In many classification problems, the cost of a false positive is different from the cost of a false negative. For example, it is worse to falsely imprison an innocent person than to let a guilty criminal get away, which is why our justice system assumes you’re innocent until proven guilty, and not the other way around. In a classification system, we would use a threshold rule, where everything above a certain probability is treated as 1, and everything below is treated as 0. However, deciding on where to draw the line requires weighing the cost of a false positive versus a false negative — this depends on external factors and has nothing to do with the classification problem. AUC scoring lets us evaluate models independently of the threshold. This is why AUC is so popular in Kaggle: it enables competitors to focus on developing a good classifier without worrying about choosing the threshold, and let the organizers choose the threshold later. (Note: This isn’t quite true — a classifier can sometimes be better at certain thresholds and worse at other thresholds. Sometimes it’s necessary to combine classifiers to get the best one for a particular threshold. Details in the paper linked at the end of this post.) Next, here’s a mix of useful properties to know when working with ROC curves and AUC scoring. AUC is not directly comparable to accuracy, precision, recall, or F1-score. If your model is achieving 0.65 AUC, it’s incorrect to interpret that as “65% accurate”. The reason is that AUC exists independently of a threshold and is immune to class imbalance, whereas accuracy / precision / recall / F1-score do require you picking a threshold, so you’re measuring two different things. Only relative order matters for AUC score. When computing ROC AUC, we predict a probability for each data point, sort the points by predicted probability, and evaluate how close is it from a perfect ordering of the points. Therefore, AUC is invariant under scaling, or any transformation that preserves relative order. For example, predicting [0.03, 0.99, 0.05, 0.06] is the same as predicting [0.15, 0.92, 0.89, 0.91] because the relative ordering for the 4 items is the same in both cases. A corollary of this is we can’t treat outputs of an AUC-optimized model as the likelihood that it’s true. Some models may be poorly calibrated (eg: its output is always between 0.3 and 0.32) but still achieve a good AUC score because its relative ordering is correct. This is something to look out for when blending together predictions of different models. That’s my summary of the most important properties to know about ROC curves. There’s more that I haven’t talked about, like how to compute AUC score. If you’d like to learn more, I’d recommend reading “An introduction to ROC analysis” by Tom Fawcett. # I trained a neural network to describe pictures and it’s hilariously bad This month, I’ve been working on a neural network to describe in a sentence what’s happening in a picture, otherwise known as image captioning. My model roughly follows the architecture outlined in the paper “Show and Tell: A Neural Image Caption Generator” by Vinyals et al., 2014. A high level overview: the neural network first uses a convolutional neural network to turn the picture into an abstract representation. Then, it uses this representation as the initial hidden state of a recurrent neural network or LSTM, which generates a natural language sentence. This type of neural network is called an encoder-decoder network and is commonly used for a lot of NLP tasks like machine translation. Above: Encoder-decoder image captioning neural network (Figure 1 of paper) When I first encountered LSTMs, I was really confused about how they worked, and how to train them. If your output is a sequence of words, what is your loss function and how do you backpropagate it? In fact, the training and inference passes of an LSTM are quite different. In this blog post, I’ll try to explain this difference. Above: Training procedure for caption LSTM, given known image and caption During training mode, we train the neural network to minimize perplexity of the image-caption pair. Perplexity measures how the likelihood that the neural network would generate the given caption when it sees the given image. If we’re training it to output the caption “a cute cat”, the perplexity is: P(“a” | image) * P(“cute” | image, “a”) * P(“cat” | image, “a”, “cute”) * (Note: for numerical stability reasons, we typically work with sums of negative log likelihoods rather than products of likelihood probabilities, so perplexity is actually the negative log of that whole thing) After passing the whole sequence through the LSTM one word at a time, we get a single value, the perplexity, which we can minimize using backpropagation and gradient descent. As perplexity gets lower and lower, the LSTM is more likely to produce similar captions to the ground truth when it sees a similar image. This is how the network learns to caption images. Above: Inference procedure for caption LSTM, given only the image but no caption During inference mode, we repeatedly sample the neural network, one word at a time, to produce a sentence. On each step, the LSTM outputs a probability distribution for the next word, over the entire vocabulary. We pick the highest probability word, add it to the caption, and feed it back into the LSTM. This is repeated until the LSTM generates the end marker. Hopefully, if we trained it properly, the resulting sentence will actually describe what’s happening in the picture. This is the main idea of the paper, and I omitted a lot details. I encourage you to read the paper for the finer points. I implemented the model using PyTorch and trained it using the MS COCO dataset, which contains about 80,000 images of common objects and situations, and each image is human annotated with 5 captions. To speed up training, I used a pretrained VGG16 convnet, and pretrained GloVe word embeddings from SpaCy. Using lots of batching, the Adam optimizer, and a Titan X GPU, the neural network trains in about 4 hours. It’s one thing to understand how it works on paper, but watching it actually spit out captions for real images felt like magic. Above: How I felt when I got this working How are the results? For some of the images, the neural network does great: “A train is on the tracks at a station” “A woman is holding a cat in her arms” Other times the neural network gets confused, with amusing results: “A little girl holding a stuffed animal in her hand” “A baby laying on a bed with a stuffed animal” “A dog is running with a frisbee in its mouth” I’d say we needn’t worry about the AI singularity anytime soon 🙂 The original paper has some more examples of correct and incorrect captions that might be generated. Newer models also made improvements to generate more accurate captions: for example, adding a visual attention mechanism improved the results a bit. However, the state-of-the-art models still fall short on human performance; they often make mistakes when describing pictures with objects in unusual configurations. This is a work in progress; source code is on Github here. # Publishing Negative Results in Machine Learning is like Proving Dragons don’t Exist I’ve been reading a lot of machine learning papers lately, and one thing I’ve noticed is that the vast majority of papers report positive results — “we used method X on problem Y, and beat the state-of-the-art results”. Very rarely do you see a paper that reports that something doesn’t work. The result is publication bias — if we only publish the results of experiments that succeed, even statistically significant results could be due to random chance, rather than anything actually significant happening. Many areas of science are facing a replication crisis, where published research cannot be replicated. There is some community discussion of encouraging more negative paper submissions, but as of now, negative results are rarely publishable. If you attempt an experiment but don’t get the results you expected, your best hope is to try a bunch of variations of the experiment until you get some positive result (perhaps on a special case of the problem), after which you pretend the failed experiments never happened. With few exceptions, any positive result is better than a negative result, like “we tried method X on problem Y, and it didn’t work”. ## Why publication bias is not so bad I just described a cynical view of academia, but actually, there’s a good reason why the community prefers positive results. Negative results are simply not very useful, and contribute very little to human knowledge. Now why is that? When a new paper beats the state-of-the-art results on a popular benchmark, that’s definite proof that the method works. The converse is not true. If your model fails to produce good results, it could be due to a number of reasons: • Your dataset is too small / too noisy • You’re using the wrong batch size / activation function / regularization • You’re using the wrong loss function / wrong optimizer • Your model is overfitting • You have a bug in your code Above: Only when everything is correct will you get positive results; many things can cause a model to fail. (Source) So if you try method X on problem Y and it doesn’t work, you gain very little information. In particular, you haven’t proved that method X cannot work. Sure, you found that your specific setup didn’t work, but have you tried making modification Z? Negative results in machine learning are rare because you can’t possibly anticipate all possible variations of your method and convince people that all of them won’t work. ## Searching for dragons Suppose we’re scientists attending the International Conference of Flying Creatures (ICFC). Somebody mentioned it would be nice if we had dragons. Dragons are useful. You could do all sorts of cool stuff with a dragon, like ride it into battle. “But wait!” you exclaim: “Dragons don’t exist!” I glance at you questioningly: “How come? We haven’t found one yet, but we’ll probably find one soon.” Your intuition tells you dragons shouldn’t exist, but you can’t articulate a convincing argument why. So you go home, and you and your team of grad students labor for a few years and publish a series of papers: • “We looked for dragons in China and we didn’t find any” • “We looked for dragons in Europe and we didn’t find any” • “We looked for dragons in North America and we didn’t find any” Eventually, the community is satisfied that dragons probably don’t exist, for if they did, someone would have found one by now. But a few scientists still harbor the possibility that there may be dragons lying around in a remote jungle somewhere. We just don’t know for sure. This remains the state of things for a few years until a colleague publishes a breakthrough result: • “Here’s a calculation that shows that any dragon with a wing span longer than 5 meters will collapse under its own weight” You read the paper, and indeed, the logic is impeccable. This settles the matter once and for all: dragons don’t exist (or at least the large, flying sort of dragons). ## When negative results are actually publishable The research community dislikes negative results because they don’t prove a whole lot — you can have a lot of negative results and still not be sure that the task is impossible. In order for a negative result to be valuable, it needs to present a convincing argument why the task is impossible, and not just a list of experiments that you tried that failed. This is difficult, but it can be done. Let me give an example from computational linguistics. Recurrent neural networks (RNNs) can, in theory, compute any function defined over a sequence. In practice, however, they had difficulty remembering long-term dependencies. Attempts to train RNNs using gradient descent ran into numerical difficulties known as the vanishing / exploding gradient problem. Then, Bengio et al. (1994) formulated a mathematical model of an RNN as an iteratively applied function. Using ideas from dynamical systems theory, they showed that as the input sequence gets longer and longer, the result is more and more sensitive to noise. The details are technical, but the gist of it is that under some reasonable assumptions, training RNNs using gradient descent is impossible. This is a rare example of a negative result in machine learning — it’s an excellent paper and I’d recommend reading it. Above: A Long Short Term Memory (LSTM) network handles long term dependencies by adding a memory cell (Source) Soon after the vanishing gradient problem was understood, researchers invented the LSTM (Hochreiter and Schmidhuber, 1997). Since training RNNs with gradient descent was hopeless, they added a ‘latching’ mechanism that allows state to persist through many iterations, thus avoiding the vanishing gradient problem. Unlike plain RNNs, LSTMs can handle long term dependencies and can be trained with gradient descent; they are among the most ubiquitous deep learning architectures in NLP today. After reading the breakthrough dragon paper, you pace around your office, thinking. Large, flying dragons can’t exist after all, as they would collapse under their own weight — but what about smaller, non-flying dragons? Maybe we’ve been looking for the wrong type of dragons all along? Armed with new knowledge, you embark on a new search… Above: Komodo Dragon, Indonesia …and sure enough, you find one 🙂 # XGBoost learns the Canadian Flag XGBoost is a machine learning library that’s great for classification tasks. It’s often seen in Kaggle competitions, and usually beats other classifiers like logistic regression, random forests, SVMs, and shallow neural networks. One day, I was feeling slightly patriotic, and wondered: can XGBoost learn the Canadian flag? Above: Our home and native land Let’s find out! ## Preparing the dataset The task is to classify each pixel of the Canadian flag as either red or white, given limited data points. First, we read in the image with R and take the red channel: library(png) library(ggplot2) library(xgboost) img <- readPNG("canada.png") red <- img[,,2] HEIGHT <- dim(red)[1] WIDTH <- dim(red)[2]  Next, we sample 7500 random points for training. Also, to make it more interesting, each point has a probability 0.05 of flipping to the opposite color. ERROR_RATE <- 0.05 get_data_points <- function(N) { x <- sample(1:WIDTH, N, replace = T) y <- sample(1:HEIGHT, N, replace = T) p <- red[cbind(y, x)] p <- round(p) flips <- sample(c(0, 1), N, replace = T, prob = c(ERROR_RATE, 1 - ERROR_RATE)) p[flips == 1] <- 1 - p[flips == 1] data.frame(x=as.numeric(x), y=as.numeric(y), p=p) } data <- get_data_points(7500)  This is what our classifier sees: Alright, let’s start training. ## Quick introduction to XGBoost XGBoost implements gradient boosted decision trees, which were first proposed by Friedman in 1999. Above: XGBoost learns an ensemble of short decision trees The output of XGBoost is an ensemble of decision trees. Each individual tree by itself is not very powerful, containing only a few branches. But through gradient boosting, each subsequent tree tries to correct for the mistakes of all the trees before it, and makes the model better. After many iterations, we get a set of decision trees; the sum of the all their outputs is our final prediction. For more technical details of how this works, refer to this tutorial or the XGBoost paper. ## Experiments Fitting an XGBoost model is very easy using R. For this experiment, we use decision trees of height 3, but you can play with the hyperparameters. fit <- xgboost(data = matrix(c(data$x, data$y), ncol = 2), label = data$p,
nrounds = 1,
max_depth = 3)


We also need a way of visualizing the results. To do this, we run every pixel through the classifier and display the result:

plot_canada <- function(dataplot) {
dataplot$y <- -dataplot$y
dataplot$p <- as.factor(dataplot$p)

ggplot(dataplot, aes(x = x, y = y, color = p)) +
geom_point(size = 1) +
scale_x_continuous(limits = c(0, 240)) +
scale_y_continuous(limits = c(-120, 0)) +
theme_minimal() +
theme(panel.background = element_rect(fill='black')) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
scale_color_manual(values = c("white", "red"))
}

fullimg <- expand.grid(x = as.numeric(1:WIDTH), y = as.numeric(1:HEIGHT))
fullimg$p <- predict(fit, newdata = matrix(c(fullimg$x, fullimg$y), ncol = 2)) fullimg$p <- as.numeric(fullimg\$p > 0.5)



In the first iteration, XGBoost immediately learns the two red bands at the sides:

After a few more iterations, the maple leaf starts to take form:

By iteration 60, it learns a pretty recognizable maple leaf. Note that the decision trees split on x or y coordinates, so XGBoost can’t learn diagonal decision boundaries, only approximate them with horizontal and vertical lines.

If we run it for too long, then it starts to overfit and capture the random noise in the training data. In practice, we would use cross validation to detect when this is happening. But why cross-validate when you can just eyeball it?

That was fun. If you liked this, check out this post which explores various classifiers using a flag of Australia.

The source code for this blog post is posted here. Feel free to experiment with it.

# Kaggle Speech Recognition Challenge

For the past few weeks, I’ve been working on the TensorFlow Speech Recognition Challenge on Kaggle. The task is to recognize a one-second audio clip, where the clip contains one of a small number of words, like “yes”, “no”, “stop”, “go”, “left”, and “right”.

In general, speech recognition is a difficult problem, but it’s much easier when the vocabulary is limited to a handful of words. We don’t need to use complicated language models to detect phonemes, and then string the phonemes into words, like Kaldi does for speech recognition. Instead, a convolutional neural network works quite well.

## First Steps

The dataset consists of about 64000 audio files which have already been split into training / validation / testing sets. You are then asked to make predictions on about 150000 audio files for which the labels are unknown.

Actually, this dataset had already been published in academic literature, and people published code to solve the same problem. I started with GCommandPytorch by Yossi Adi, which implements a speech recognition CNN in Pytorch.

The first step that it does is convert the audio file into a spectrogram, which is an image representation of sound. This is easily done using LibRosa.

Above: Sample spectrograms of “yes” and “no”

Now we’ve converted the problem to an image classification problem, which is well studied. To an untrained human observer, all the spectrograms may look the same, but neural networks can learn things that humans can’t. Convolutional neural networks work very well for classifying images, for example VGG16:

Above: A Convolutional Neural Network (LeNet). VGG16 is similar, but has even more layers.

## Voice Activity Detection

You might ask: if somebody already implemented this, then what’s there left to do other than run their code? Well, the test data contains “silence” samples, which contain background noise but no human speech. It also has words outside the set we care about, which we need to label as “unknown”. The Pytorch CNN produces about 95% validation accuracy by itself, but the accuracy is much lower when we add these two additional requirements.

For silence detection, I first tried the simplest thing I could think of: taking the maximum absolute value of the waveform and decide it’s “silence” if the value is below a threshold. When combined with VGG16, this gets accuracy 0.78 on the leaderboard. This is a crude metric because sufficiently loud noise would be considered speech.

Next, I tried running openSMILE, which I use in my research to extract various acoustic features from audio. It implements an LSTM for voice activity detection: every 0.05 seconds, it outputs a probability that someone is talking. Combining the openSMILE output with the VGG16 prediction gave a score of 0.81.

## More improvements

I tried a bunch of things to improve my score:

1. Fiddled around with the neural network hyperparameters which boosted my score to 0.85. Each epoch took about 10 minutes on a GPU, and the whole model takes about 2 hours to train. Somehow, Adam didn’t produce good results, and SGD with momentum worked better.
2. Took 100% of the data for training and used the public LB for validation (don’t do this in real life lol). This improved my score to 0.86.
3. Trained an ensemble 3 versions of the same neural network with same hyperparameters but different randomly initialized weights and took a majority vote to do prediction. This improved the score to 0.87. I would’ve liked to train more, but other people in my research group needed to use the GPUs.

In the end, the top scoring model had a score of 0.91, which beat my model by 4 percentage points. Although not enough to win a Kaggle medal, my model was in the top 15% of all submissions. Not bad!

My source code for the contest is available here.