How to deploy your deep learning side project on a budget

Imagine this scenario: you have successfully trained a nice big neural network model for your side project. The next step is deploying it, making it available for everyone to use and benefit from. However, you quickly realize that deploying your model is easier said than done. Running your model solely on a CPU would be too slow, but upon exploring GPU offerings, you are confronted with a disheartening reality – keeping a GPU running, even for the most basic on-demand g4dn instance on AWS, would cost an exorbitant $400 USD per month, which is probably more than you are willing to pay for hosting your side project. Not to mention, in the beginning when you may only receive a few requests a day, keeping such a powerful GPU instance feels wasteful. You may wonder: what viable alternatives exist? How can you deploy your model affordably without compromising on availability and latency?

This is exactly the challenge we encountered when deploying our Efficient NLP translation models. To give you some context, Efficient NLP is the translation engine powering LevelText, a language learning app that I developed. In LevelText, users perform searches for articles on the web in a foreign language, and our backend must translate a bunch of web articles in real-time to return the result. So the translation needs to happen fairly quickly or else the user will get impatient and go do something else.

We decided to deploy our model with serverless GPUs. In the serverless computing architecture, instead of managing instances, you define functions that are triggered by specific events, such as an HTTP request. Typically you package up your model and inference code into a Docker container, and push the docker image to a cloud registry like AWS ECR. The beauty of serverless lies in its automated resource provisioning – the platform takes care of allocating and shutting down resources as needed to execute your function. The good news is you only pay for when the function is actually run, so your cost will be as low as a few dollars rather than hundreds of dollars a month if the function is idle most of the time. Also, if your side project suddenly ends up on Hacker News and gains traction, serverless allows you to automatically scale up to meet the demand without additional engineering effort.

When it comes to serverless for deep learning models, several challenges arise. The first is that AWS (as well as other major cloud providers like GCP and Azure) does not currently offer serverless GPUs. AWS provides several serverless options like Lambda and SageMaker Serverless Inference, but they are limited to CPU usage only. This limitation stems from their underlying infrastructure: AWS serverless services including Lambda and Fargate are built on Firecracker, a virtualization engine designed for quickly and securely spinning up serverless containers, but Firecracker has no GPU support.

With serverless computing, the most crucial metric is the time to start up a container, commonly known as cold start time. This rules out alternative options such as spinning up EC2 instances on-demand to handle requests: while EC2 instances offer GPU capabilities, they are not optimized for quick startup times (typically taking around 40 seconds to start up), which is impractical for time-sensitive applications.

Fortunately, there are a number of startups that have emerged to fill this gap and offer serverless GPUs, such as Banana, Runpod, Modal, Replicate, and many more. They are similar but differ on cold start time, availability and pricing of different GPUs, and configuration options. We evaluated several and opted for one of these startups, but the serverless GPU landscape is quickly evolving, and the most suitable option for you may be none of the above by the time you read this.

Also worth noting: before choosing a serverless GPU startup, it’s a good idea to do some benchmarks and see if it’s really not possible to deploy your model on a CPU instance. With multicore processors, instances with large memory, newer CPU instruction sets like AVX-512, often running the model on CPUs may be fast enough. Generally deploying on CPUs will be easier and you will have more options available to choose from on the major cloud providers.

The next step is making your model smaller and/or faster. This can be achieved through several different approaches:

  • Quantization: This means reducing the precision of each weight in the model, such as from fp32 to fp16 or int8 or sometimes even smaller. Converting from fp32 to int8 means a 4x reduction in model size. There are several different types of quantization, in particular for int8 quantization there is the issue of determining the scale factor (how to convert values between float values and integers between 0-255): the easiest is dynamic quantization which means the scale factors are determined at runtime; the alternative is static quantization meaning the scale factors are precomputed using a calibration set, but the speedup is minimal so it’s probably not worth it. There are also hardware and runtime considerations that impact speedup of quantization that I’ll get into later.
  • Pruning: This entails setting certain weights in the model to zero, the easiest is magnitude pruning which eliminates the lowest magnitude weights, but there are other methods like movement or lottery ticket pruning. In certain circumstances up to 90% of the weights may be pruned. The challenge with pruning is it requires a specialized runtime capable of handling sparse models, otherwise setting weights to zero won’t make it any faster. One promising library is Neural Magic’s DeepSparse which runs sparse models efficiently on the CPU, but I haven’t tried it myself because it currently doesn’t support any sequence generation models (including translation).
  • Knowledge Distillation: Distillation is taking a larger “teacher” model and using it to “teach” a smaller model, the “student” model to predict the outputs of the teacher. When done correctly, this is extremely effective and out of all approaches it has the best potential of making a model faster. The student model can have a completely different architecture that’s optimized for fast inference, for example, in machine translation, knowledge distillation has been applied to train non-autoregressive models or shallow decoder models (see my video on NAR and shallow decoder models for details). However, knowledge distillation is relatively compute-intensive as it requires setting up the training loop and running the entire training dataset through the teacher model to obtain its predictions. From my experience in other projects, knowledge distillation typically requires about 10% of the compute of training the teacher model from scratch.
  • Engineering optimizations: This aspect is underreported in the literature but can significantly contribute to performance gains: for example, profiling and optimizing critical code paths, performing the inference in C++ instead of Python, using fused kernels and operators that are optimized for GPU microarchitectures like FlashAttention, etc. Sounds intimidating but in practice this often just means exporting your model and using an inference runtime engine that performs the appropriate optimizations (and that also supports your model architecture).

Out of these four approaches, quantization and engineering optimizations are usually the most effective ways for achieving speedup with minimal effort. These methods don’t require additional training, making them relatively straightforward to implement. A notable example is llama.cpp, which heavily used quantization and hardware-specific optimizations to run LLaMA models on CPUs.

Above: Latency for one of our translation models, showing the cold start time and inference time for translating 5000 characters. The original model is run using Hugging Face and the quantized model on an optimized runtime.

In our case, our original model took about 18 seconds to process a single request after a cold start, which was unacceptable. But the majority of this time was spent cold starting the model, which involves retrieving the docker image containing the model from the cloud. Consequently, reducing the model size was the crucial factor since the cold start time is determined by the size of the model / docker image. By quantizing the model to int8 and removing unnecessary dependencies from the image, we were able to reduce the cold start time by about 5x and reduce the worst-case overall latency to about 3 seconds.

Note that quantizing a model results in a slight decrease in accuracy. The extent of this drop depends on the quantization and the model itself; in our case, after quantizing our model from fp32 to int8, we observed a degradation of about 1 BLEU score compared to our original model, which is not really noticeable. Nevertheless, it is a good idea to benchmark your model after quantization to assess the extent of degradation and validate that the accuracy is still acceptable.

Another crucial optimization is running the model on an efficient runtime. Why is this necessary? The requirements for the training and prototyping phases differ significantly from the deployment phase: training requires a library that can calculate gradients for the back propagation, loop over large datasets, split the model across multiple GPUs, etc, whereas none of this is necessary during inference, where the goal is usually to run a quantized version of the model on a specific target hardware, whether it be CPU, GPU, or mobile.

Therefore, you will often use a different library for running the model than training it. To bridge this gap, a commonly used standard is ONNX (Open Neural Network Exchange). By exporting the model to ONNX format, you can convert the weights of your model and import it to execute with a different engine that is specifically tailored for deployment.

The execution engine needs to not only execute the forward pass on the model but also handle any inference logic, which can be quite complex and sometimes model-specific. For instance, generating a sequence output requires autoregressive decoding, so the engine needs to perform the model’s forward pass in a loop to generate tokens one at a time, using some generation algorithm like nucleus sampling or beam search (see my video if you’re interested in the glorious details of how beam search is implemented in Hugging Face).

We considered several libraries for our translation model inference. Here are the libraries that we considered (by no means an exhaustive list):

  • Hugging Face: this is a widely used library that is familiar to a lot of people and a good choice for prototyping and general-purpose use, as long as performance is not too crucial. However, since it is written in Python, it is often slower than other libraries; additionally, it is implemented on top of PyTorch and PyTorch lacks support for certain optimizations such as running quantized models on GPU. Nonetheless, Hugging Face is a good starting point to iterate from, and in many cases, the performance will be sufficient and you can move on with your project.
  • ONNX Runtime (ORT): this is probably what I’d use for most models, it handles the model’s forward pass on various deployment targets on models converted to the ONNX format. When converting from Hugging Face models, the Hugging Face Optimum library provides an interface to ORT that is similar to Hugging Face AutoModels, handling the scaffolding logic like beam search. In an ideal world, the interface between ORT and Hugging Face would be identical to that of Hugging Face’s AutoModels, but in reality due to underlying differences between ORT and PyTorch, achieving an exact match may not be possible. For instance, in our case, ORT seq2seq models require splitting the encoder and decoder into separate models since ORT cannot execute parts of the model conditionally, as PyTorch allows.
  • Bitsandbytes: this is a specialized library that uses a special form of mixed decomposition for quantization: this is necessary for very large models exceeding about 6B parameters, where the authors found that large models have outlier dimensions and the usual quantization method results in a complete degradation of the model’s performance. For models smaller than 6B parameters, this mixed quantization technique is probably not necessary. Bitsandbytes integrates with Hugging Face, allowing you to load the weights to GPU as int8 format, but the model weights themselves must be stored on disk in fp32 format. Consequently there is no reduction of model size on disk and thereby no reduction of the model’s cold start time.
  • CTranslate2: this is a library for running translation models on CPU and GPU, it is written entirely in C++ including the beam search, and is designed to have minimal dependencies and be fast. It supports only a handful of specific model architectures (mostly translation but also some language and code generation models) and it is not straightforward to extend the library to handle other model architectures, but may be worth considering if your model is in the list supported.

In this post, our primary reason for quantization was to reduce the model size, but note that quantization can also make the model faster on certain hardware, but not always. On newer GPUs equipped with tensor cores, executing int8 operations is up to 4x faster than fp32 operations but on older GPUs, int8 and fp32 operations have similar performance. Additionally, when the model and data occupy less memory on the GPU, it allows for larger batch sizes and consequently higher throughput for batch processing. If these performance considerations are important for you, it is worth looking into the FLOPS for different precision operations which are available on the GPU technical specifications sheet.

That’s it for now. Finally, if you’re looking for a budget friendly translation API for your side projects, check out my Efficient NLP Translate API: it’s priced at 30% of the cost of the Google Translate API. You will get $10 translation credits for free when you sign up.

Further reading:

Effectively self-studying over the Internet

The Internet is a great invention. Just about everything in humanity’s knowledge can be found on the Internet: with just a few keystrokes, you can find dozens of excellent quality textbooks, YouTube videos and blog posts about any topic you want to learn. The information is accessible to a degree that scholars and inventors could have hardly dreamed of even a few decades ago. With the popularity of remote work and everything moving online this decade, it is not surprising that many people are eager to learn new skills from the Internet.

However, despite the abundance of resources, self-studying over the Internet is harder than you think. You can easily find a list of resources on your topic, full of courses and textbooks to study from (like this one for theoretical physics). This knowledge is, in theory, enough to make you an expert, if only you were able to go through it all.

The problem is that internet self-study tends to be a passive activity. Humans cannot learn new skills effectively by only passively absorbing knowledge, we learn by doing. Ideally, 70% of your time should be active practice, trying to do the skill and learning from experience, and maybe 30% of the time on passively reading to expand your theoretical and background knowledge. You can’t learn to ski just by reading books and watching tutorials about it, at some point, you will need to actually put on a pair of skis and head to the mountains.

Some skills are relatively straightforward to actively practice using only a computer. Learning to program, for example: you can build a fully functional website with just a laptop and all the tools you need are available for free on the Internet. If you want to learn music production, everything can be done in software, and you need only a modest budget to buy some equipment. If you want to learning digital marketing, just set up an ecommerce website and play around with some Google ad campaigns, and you are good to go.

But many skills cannot be done using just a computer, in which case self-studying over the Internet is much more difficult. If you want to study chemistry, you can’t just set up a lab in your house and start playing around with chemicals. When the subject requires institutional resources to participate actively, this makes it difficult to self-study. In some cases it’s not even clear what counts as “actively” doing it, like studying philosophy or literature.

The situation is not hopeless, though. What you need to do is try to make it as an active process as possible. Here I will talk about some strategies I have used to learn more actively.

If possible, learn by doing it. Obvious, but worth reiterating one more time. Even for activities done on the computer, it is very tempting to passively consume content about it instead of actively practicing. Many music production students watch hours of tutorials about how to use various plugins and their features, instead of working on producing a song.

This is human nature, we are lazy creatures and passively consuming content is much easier than doing. It feels like we are learning something, but if we don’t put the knowledge to use, we’ll quickly forget it. If your skill requires some modest investment in tools and equipment to start doing, it is well worth it since you will learn much faster.

Do the exercises. Many textbooks and courses come with exercises to help you practice. In some fields, like mathematics, the standard way of learning is by doing lots of exercises. And in most university courses, much of your grade come from assignments that you hand in.

In practice, though, I’ve found it difficult to get enough motivation to do exercises properly, when I know that nobody is going to read it, and the experience is not very satisfying. Also, with nobody to check your answers, you don’t know if you did it incorrectly. But even a few exercises are better than not doing any at all, and anyhow the first few are the most valuable since you get diminishing returns the more exercises you do.

Discuss with others. The ideal situation is if you know somebody who is more experienced than you in the activity, then you can meet with them periodically to discuss what you are learning (and buy them a lunch for the favor). If you can find a friend who is not an expert but is interested in learning the same subject together, that is helpful as well. Even if they are at the same level as you, they’ll uncover some knowledge gaps that you missed and vice-versa.

Take online lessons. A hack that I’ve used on numerous occasions when I don’t know anybody who knows the topic I’m learning is buying an expert’s time via online lessons. For learning languages, I like Italki, a platform where you pay hourly for lessons with a native speaker. There are platforms where you can schedule lessons on a variety of topics, or buy consulting sessions with an expert.

The most effective way to use online lessons is to come prepared with a list of questions, or ask for feedback on something you attempted to do. In my Italki sessions, what I often do is write a passage of text in Spanish, and during the lesson, have my teacher correct it for mistakes in grammar and phrasing. It’s important to do this because otherwise, many teachers will simply regurgitate lesson material that they prepared in advance, which is no better than just watching a YouTube video.

Write blog posts. This is one of the best ways to actively learn and can be applied to any subject. It is effective because it will be visible to the public, so you will naturally make a serious effort to do it properly (much more than when doing throwaway exercises). I write blog post because I prefer the written medium; some people prefer to make YouTube videos which has a similar effect.

One limitation is that you can get it completely wrong, unwillingly spread misinformation, and nobody will notice or care enough to correct you. In other cases, readers have contacted me to correct errors on my blog post, which is helpful for improving the content (albeit not so helpful for my learning if it has been several years since I wrote it and I’ve already forgotten all the details).

Write book reviews. It is a lot of effort to write a blog post, and recently, not something that I’ve done that often. Instead, a lower effort version that I’ve been doing is writing a book review blog where I write a short summary and review of every book that I read. These are relatively quick to write since I’m just summarizing the main points of a book that I’ve just read (and is fresh on my mind), and I’m not trying to think of something original to say like for my blog posts, yet writing it is still somewhat of an active process.

Lessons learned after 6 months of building a language learning startup

For the past six months, I’ve been working on LevelText, a startup that helps you learn languages by reading authentic native material. Actually, calling it a startup is a bit of a stretch because our app never produced any revenue and only a handful of users.

Nevertheless, I worked on it seriously for about half a year with a cofounder, joined an incubator, incorporated a company, iterated on an idea and launched several times. During this process, I learned a lot about the startup ecosystem, web development, talking to users, and many aspects of running a business. In this article, I will tell the story of how we got here and what we learned.

Long before LevelText, building a language learning app had been on my mind for a long time. I had always been passionate about languages: I studied several languages including French, Spanish and Japanese, and I was a PhD student researching linguistics and natural language processing. Thus, during my idle time, I liked to brainstorm ideas about improving language learning. Surely, language learning can be improved with my AI and NLP expertise, right?

There were a lot of language learning apps out there already, the most famous one being Duolingo, but there are many alternatives including Babbel, Rosetta Stone, Busuu, and dozens of others. Most people in the language learning community liked to criticize these apps: they offered similar features like language games and flashcards, teaching you words and phrases, but you could use them for a long time without achieving much fluency in the language.

According to many language acquisition researchers and educators, a more effective method of learning languages was “comprehensible input”: essentially getting a lot of exposure to authentic material (whether it’s books, articles, TV shows, podcasts, etc). The language is ideally at a level slightly above what you can easily understand, but not so difficult that it’s overwhelming. This was in stark contrast to all the apps that taught language through unnatural games and exercises, thus, despite the crowded marketplace, I was confident that teaching languages using authentic reading material would be a revolutionary app idea.

Iteration 1: Learning languages through reading

Personally, I like to learn languages by reading books. My process was the following: while reading, I would mark all the words that I didn’t understand, then I would look them up in a dictionary and add them to my collection to be reviewed later. I was improving my Chinese at the time, and found this process quite tedious because inputting unknown logographic characters into a dictionary was nontrivial.

So my first idea was an app where you take a picture of a printed text where you highlighted some words, and it would use some computer vision to look up these words for you in a dictionary. Then, if you liked, you could use this information to create Anki flashcards to review later. We called the tool VocabAssist and my wife helped me implement the prototype.

At this time, I started to interview all of my friends who were into language learning. A few people agreed with me that reading in the target language was a pain point, but most people did not have physical books lying around, and preferred reading on their phone or computer. And people who owned books wanted to keep them clean rather than writing all over them.

Therefore, we redesigned the concept: the user would instead paste text into the app. The app would then use some NLP to decide which words you probably don’t know, given your language level, and look up their definitions in a dictionary for you. Yet a problem still remained: how were users supposed to find material to read? Making them copy/paste text from web pages would have been too clumsy.

After doing a bit of market research, we found two existing apps with a similar concept already: LingQ and Readlang. These two apps did a decent job of helping learners study from the text, with fully featured dictionaries and flashcards, and they even supported video. However, they had to already have some material on hand; otherwise, both had libraries of user-contributed content to read, but they were small and you were unlikely to find what you’re looking for if you had niche interests. It would’ve been much better if they could leverage all of the content available on the Internet.

Both my wife and I had other full-time responsibilities, so progress was slow. Neither of us knew much about web development (I was a decent backend engineer but knew very little frontend), so the app was barely functional. As a result, we never launched the app to the public and only demoed it to some friends. Before we could launch it for real, I would need a cofounder with more experience building websites, and probably handle business responsibilities as well.

Iteration 2: Search engine for French learners

At the end of 2021, I had just moved to Vancouver and was wrapping up my PhD; I had more free time and was ready to give this project a more serious effort. So I set up a profile on the YCombinator cofounder matching platform, and soon I matched with Jonny Kalambay. He was working as a software engineer a Yelp and also had a passion for language learning. We worked on a toy project as a trial of working together and then decided to become cofounders.

Jonny was a talented full-stack engineer who could build and launch prototypes within days. Following the advice of the books “Running Lean” and “The Mom Test”, we started to interview users of our target demographic who were interested in language learning, asking them how they learn languages, what were the biggest struggles, and what tools they wish existed.

It turned out that finding interesting reading material of an appropriate difficulty was a fairly common problem. Thus, LevelText was born. We decided to build a search engine tool for French learners: we would take the user’s search query, translate it into French, search the internet for articles about the topic, and finally, return a sorted list of the articles ranked by estimated reading difficulty.

Our MVP assumed the user spoke English and was studying French. We decided to support only one language in our initial launch, since this would allow us to test out the concept with minimal development. We picked French because it was the most widely studied language in Canada and both of us were familiar with it.

After launching it on ProductHunt and conducting user tests on some French learners, the response was lukewarm. The tool was novel and cool, but didn’t provide enough value for them to keep coming back to the site. Intermediate and advanced learners already had lists of places to find reading material, and most articles on the internet were too difficult for beginners. We didn’t address the problem of motivation, one of the biggest struggles for language learners, and depended on the user remembering to use our site.

From the business side, the idea was difficult to monetize. There are several options for making money from a consumer-facing software product. Placing advertisements is the lowest effort option, but requires a massive amount of traffic to produce sufficient revenue: possible for a social network, but usually not for language learning apps. Most language learning apps make money by charging monthly or annual subscriptions. But we would need to provide a lot more value to justify charging a subscription – a simple tool wouldn’t cut it.

Iteration 3: Daily bite-sized reading

Our idea to fix the motivation problem was instead of the user needing to make searches, why not send them articles to their email? They would only need to indicate their interests once during sign-up and then they would receive a lesson every day in their inbox. With the lesson prepared and ready to go, the friction needed to start studying would be much lower.

Each lesson consisted of a paragraph of simple-to-read French text accessible to beginners, and a multiple-choice question to test their comprehension. We started prototyping this concept, using the GPT-3 model from OpenAI to simplify French articles down to a simple paragraph, and manually writing the multiple choice questions. We then enrolled a handful of French learners from social language learning sites to validate the idea.

The results were promising: most of the learners opened the e-mail and answered the question for three days in a row, although we never tried to charge money for it (the first real test of product-market fit), and the amount of manual work to prepare lessons meant that we could not get a large sample size of users to experiment with. Also, our mini-lessons could be completed in under a minute, which we felt was too little content for people to pay for it.

Founder troubles

It was around this time that we were invited to interview with YCombinator. This was a big deal because getting in meant receiving $500,000 USD of investment and connections to a network of experienced founders and mentors. We didn’t expect to get this far since our product hadn’t demonstrated traction yet; I guess it was our founder credentials that landed us the interview. We spent a day or two rehearsing our pitch and practicing answering questions for the rapid-fire 10-minute interview.

The next day, we received a rejection letter from YCombinator, basically saying that many great founders have tried to break into the language learning apps market, but it is hard to build a business in this space. Language learning (especially for non-English languages) was a hobby for most people with no urgent need, and they urged us to consider solving a must-have rather than a nice-to-have problem.

Our team morale took a hit after the rejection, and truth be told, we had gotten lots of hints at this point that language learning was not a great business direction. However, language learning was what we had been doing from the beginning, and we had no idea what else we could pivot to. Jonny suggested switching to making software for language teachers, because this seemed like a less competitive space than language learning apps. But I wasn’t thrilled with this direction: neither of us had connections or experience in education, so we had no good understanding of the problems that teachers faced and the realities of their daily working environment.

We started to fight more and more over product and business decisions: what level of language learners to aim at, how much to rely on automation versus manual content curation, how much we should care about monetization, etc. Neither of us were more experienced than the other in business, and neither of us trusted the other to take the lead on it. About two weeks after YCombinator, Jonny decided to leave LevelText and start his own company to make tools for language teachers. To be fair, he agreed to leave me with full ownership of the company and all its intellectual property (i.e., the code that we wrote), but I was on my own.

Iteration 4: Personalized courses from native content

Up to this point, we had talked to many users and ran some experiments, but we had never actually tried to charge money for our service. In this iteration, my plan was to take what I had learned thus far and build a product that at least had a chance of succeeding. By “success”, I mean making at least minimum wage, or roughly $3000 CAD/month.

Occasionally, you hear of startups raising investment from the founders’ credentials alone, without any revenue, but we talked to some investors and found this to be mostly not the case. Only incubators would invest on the order of $100k in early stage startups with no revenue, but we thought this was too little to unlock any options that we can’t already do, like hiring engineers. All the bigger investors expected to see monthly recurring revenue of at least $5-10k/month before our request for a $500k seed round would be taken seriously. So investment or no investment, I would need to start making that much revenue in the near future.

My business model was to bundle lessons into courses, which students could buy at a fixed cost. Several successful companies sold educational content bundled into courses, such as LinkedIn Learning and Chessable: the idea being that if most users drop off after a few weeks, we make more money by charging a one-time course purchase cost at the beginning than from a subscription (which they would quickly cancel). Another advantage of the course model was that I only needed to produce a fixed amount of content, whereas if I sold subscriptions, I would need to continuously generate new content for as long as they are subscribed, an unattractive proposition given that the app was not fully automated and a lot of work still needed to be done manually.

With a price point of $20 per course, I would need to sell just over 100 courses per month to make minimum wage: an optimistic, but not an impossible quantity. I built a landing page and a form where users input which topics they wanted to read about. The student then receives one trial lesson for free, at which point they would need to buy a course to receive more lessons.

To prepare a lesson, I would pick a French article of around B1 difficulty on your desired topic, and run an algorithm to guess the most difficult French vocabulary in the article and provide English translations. Additionally, the lesson included a computer text-to-speech reading and an English translation of the full article. For $20, I would prepare 7 daily French lessons matching your interests.

I launched the platform, bought some traffic from Google Adwords, but to my disbelief, nobody bought the courses. To figure out what was wrong, I conducted another round of user interviews. Again, the responses revealed that we were far from product market fit.

Everybody complained that the pricing was too expensive for what we provided: other digital services like Spotify and Netflix cost a similar amount or lower, and you got a lot more. People also found the pricing model confusing and assumed it was $20/month for a subscription. They also complained that our app only helped you practice reading, with no exercises or anything to help you with speaking, listening or writing. Worst of all, users didn’t find much value in personalized lessons, which was supposed to be our unique selling point – they rated lessons catered to their personal interests about equally interesting as lessons about the default topics.

Reflecting on this feedback, I think what happened makes a lot of sense. Most users, when asked to provide a list of their interests, gave fairly generic responses like “sports” or “politics”: not very useful for personalization. I wanted to make at least minimum wage to justify all the effort I was putting into this project, so I worked backwards from the $3000/month revenue target and designed the pricing scheme to achieve this goal. But my users don’t care about me or how much money I wanted to make, they only saw that I didn’t offer much for $20, so of course they didn’t buy.

I gathered the user feedback and sketched out some ideas of how to improve the product for the next iteration. But honestly, I was running out of steam – after countless iterations and little to show for my effort, it became difficult to gather the focus and willpower to implement more features. Being a solo founder is not much fun: nobody else understood my problems or cared about my language learning startup. So I decided it was a good time to quit.

Why language learning is a poor business

Despite my passion for it, language learning is ultimately a weak business model, especially software-based apps. For English speakers, learning a language is mainly something that is done for fun; it is seldom a must-have. Learners typically get bored and quit after only a few weeks, when their initial motivation runs out and it becomes a chore. This makes it difficult to make a subscription-based model work (many language learning apps get around this by asking for payment of several months or a whole year upfront).

Most language learners don’t spend much money on their hobby – out of all the people I interviewed, hardly anybody had ever paid for an app. It’s hard to blame them – there are tons of free language learning resources on the internet, so there is rarely a compelling reason to justify taking out your credit card. Even when people do pay to learn a language, it is usually for human tutoring or physical books (which people expect to pay for), not software products.

Finally, the market is saturated with hundreds of language learning products offering similar features. There is no good way to measure results of language learning, so you have no way to prove that your method is better than the others in terms of educational efficacy. With no better alternative, most companies optimize for user retention, inevitably leading to an app filled with gamification features.

Was founding a startup the wrong choice for me?

There are several reasons why being a founder in this space was probably not the right choice for me personally. First of all, my main expertise is in AI: while useful, it is far from the biggest priority in the early stages of most consumer products. Language learning apps needed to be engaging and fun to use, and AI was at most a minor component. Before product-market fit, the main tasks in a startup are developing quick prototypes of web or mobile apps, networking on social media to find users, talking to those users, and aligning the product with their needs. As for AI, it is usually best to throw something together quickly with GPT-3 and improve it later if there is demand for it, because any more sophisticated machine learning would take too long relative to the benefits.

While I could probably learn web development, sales, and product management given enough time, I had no experience in any of these skills, meaning I have no competitive advantage over the thousands of other aspiring entrepreneurs, and would be at a disadvantage compared to founders with experience in these skills.

Another factor is personal risk tolerance: how long are you willing to work full-time on your startup without any revenue or investment? For me, I was willing to put six months. This short timeline drove me to only consider business models that would generate a couple thousand dollars a month of revenue right away, and kill the idea if this goal could not be reached. But in reality, it is common to iterate for several years before achieving this goal (eg: Readlang took about 3 years to reach $1000/month in revenue). Since I wasn’t prepared to risk several years, potentially with zero payout at the end, it would’ve probably been better to work on it on the side while working a regular job somewhere, instead of going all-in full-time with an overly optimistic timeline.

There is a common belief in the startup community that persistence in your startup is “good” and giving up is “bad”. Part of this is observation bias: you only read about successful companies that raise millions and get acquired, after a lot of failure and persistence, and never hear about the majority of startups that fail. This has led many talented young people to pursue the startup dream, often willing to fail at it for years. I don’t think this is a particularly noble act: there is nothing magical that happens when you incorporate a company and declare yourself to be a startup founder. Your “company” is only worth something if you can provide something of sufficient value to customers that they pay for it; if you have no revenue then you are basically no different from being unemployed (even if you write a lot of code and get a few people to use it).

Even though I didn’t succeed in my venture, this was a valuable learning experience. You learn much more by actually attempting a startup than just reading about them; the life of a founder is not as glamorous as portrayed in the media. Now I know what I don’t know, and have a better idea of what skills I’d need to have a better chance next time. That’s the end of my rather long post, I hope it will be useful to readers thinking about building a startup or a language learning app.

How I write NLP papers in 8 months: idea to publication

Earlier this month I defended my PhD thesis. The bulk of the work of a PhD consists of producing peer-reviewed publications, and in my field, three first-author publications on a coherent topic in top-tier venues (EMNLP, ACL, NAACL, TACL, etc) is typically sufficient to earn a PhD.

Reflecting on my process for producing my three papers, I noticed that all of them took roughly 8-10 months from the time of the initial idea, until submitting the 8-page manuscript to a conference. Furthermore, each paper followed a similar trajectory from how it evolved from a vague idea into a concrete research contribution.

This is definitely not the only way to write papers, but it has worked well for me. All three of my papers were accepted into main conferences on the first attempt, with reviewer scores consistently between 3.5 and 4 (on a 1-5 rating scale). Therefore, I think it’s a good method to iterate on research projects and reliably produce strong NLP papers.

Month 1: Literature review

When I begin a project, I typically only have a vague idea of the direction or some phenomenon I want to explore. Since I don’t have a good understanding yet of the field, it helps to spend some time reading the prior literature at this stage, instead of diving into experiments right away. This is my focus for the first 3-4 weeks of a project.

See my blog post here for a guide on how to find and read research papers.

By the end of the first month, I would’ve read about 50 papers, and have a good understanding of:

  • The theoretical frameworks and assumptions related to the problem.
  • Recent work in this area, and what datasets and methodologies they use to solve it.
  • Major challenges and open problems in this area, and why they remain difficult to solve.

At this point, after familiarizing myself with the recent work, I can usually identify some gaps that have not yet been addressed in the literature and have some ideas of how to begin solving them. This is when I begin running experiments – these initial experiments almost never make it into the final paper, but allow me to start building an intuition for the problem and become familiar with commonly used dataset and techniques.

Months 2-4: Exploration

The next phase of the project is iterative exploration and experimentation. For the next two or three months, I work on experiments, building on top of each other and using lessons learned from one experiment to guide the design of the next. Most of these experiments will be “failures” – inconclusive for various reasons:

  • I discover that some theoretical assumptions turn out to be invalid, rendering the experiment pointless.
  • After running the experiment, I find that the results are not very interesting: they are explainable by something obvious, or there are no consistent patterns.
  • I try to run the experiment, but find that it’s impossible because the dataset is missing some crucial feature, or my tools are not powerful enough.

One thing you should never do is decide beforehand what evidence you want to find, and then run experiments until you find it. That would be bad science. So in this context, an experiment failure means it didn’t produce a result that’s interesting enough to include in a paper. An experiment might produce results that are different from what I expected, while being a very interesting and successful experiment.

During this phase, I read papers in a different way from the first month. Rather than casting a wide net, my reading is more focused on understanding details so that I can execute a specific experiment correctly.

After a few months of iteration and doing about 15-20 experiments, I have at least 2 or 3 with sufficiently interesting or cool results that I want to share with the community. These experiments will form the core of my new paper, but it’s not enough, since I still have to tie them together into a single coherent narrative, and strengthen all the weaknesses that would be mercilessly attacked during peer review.

Month 5: Telling a story

Before you can write a paper, you have to decide on a framing or narrative that aligns with your experiments. If this is not done correctly, the reader will be confused; your experiments will feel incoherent and unmotivated.

The same data and experiments can be framed in many different ways. Is our focus on evaluating several NLP models on how well they represent some linguistic property? Or are we using evidence from NLP models to argue for some theory of human language learnability? Or perhaps our main contribution is releasing a novel dataset and annotation schema?

To decide on a framing, we must consider several possible narratives and pick the one that best aligns holistically with our core experiments. We’ll need to provide a justification for it, which is usually not the original reason we did the experiment (since the exploration phase is so haphazard).

The product of this narrative brainstorming is a draft of an abstract of the final paper, containing the main results and motivation for them. By writing the abstract first, the overall scientific goal and structure of the paper is clarified. This also gives everyone an idea of gaps in the narrative and what experiments are still needed to fill in these gaps. Around this time is when I decide on a conference submission date to aim for, around 3-4 months down the road.

Months 6-7: Make it stronger

Now we are on the home stretch of the project: we have decided on the core contributions, we now just have to patiently take the time to make it as strong as possible. I make a list of experiments to be done to strengthen the result, like running it with different models, different datasets in other languages, ablation studies, controlling for potential confounding variables, etc.

Essentially I look at my own paper from the perspective of a reviewer, asking myself: “why would I reject this paper?” My co-authors will help out by pointing out flaws in my reasoning and methodology, anticipating problems in advance of the official peer review and giving me ample time to fix them. The paper probably has a decent chance of acceptance without all this extra work, but it is worth it because it lowers the risk of having the paper rejected and needing to resubmit, which would waste several valuable months for everyone.

Month 8: Paper writing

It takes me about 3 weeks to actually write the paper. I like to freeze all the code and experimental results one month before the deadline, so that during the last month, I can focus on presentation and writing. When all the tables and figures are in place, it is a lot easier to write the paper without having to worry about which parts will need to be updated when new results materialize.

The experiment methodology and results sections are the easiest to write since that’s what’s been on my mind for the past few months. The introduction is the most difficult since I have to take a step back and think about how to present the work for someone who is seeing it for the first time, but it is the first thing the reader sees and it’s perhaps the most important part of the whole paper.

A week before the deadline, I have a reasonably good first draft. After sending it out to my co-authors one last time to improve the final presentation, I’m ready to press the submit button. Now we cross our fingers and wait eagerly for the acceptance email!

Parting advice

There were two things that helped me tremendously during my PhD: reading lots of NLP papers, and having a good committee.

Reading a lot of NLP papers is really useful because it helps you build an intuition of what good and bad papers look like. Early in my research career, I participated in a lot of paper reading groups, where we discuss recent papers (both published and arXiV preprints) and talk about which parts are strong and weak, and why. I notice recurring trends of common problems and how strong papers manage them, so that I can incorporate the same solutions in my own papers.

This is sort of like training a GAN (generative adversarial network). I trained myself to be a good discriminator of good vs bad papers, and this is useful for my generator as well: when my paper passes my own discriminator, it is usually able to pass peer review as well.

Another thing that helped me was having a solid committee of experts from different academic backgrounds. This turned out to be very useful because they often pointed out weaknesses and faulty assumptions that I did not realize, even if they didn’t have a solution of how to fix these problems. This way I have no surprises when the peer reviews come out: all the weaknesses have already been pointed out.

For the PhD students reading this, I have two pieces of advice. First, read lots of papers to fine-tune your discriminator. Second, get feedback on your papers as often and as early as possible. It is a lot less painful at this stage when you’re still in the exploratory phase of the project, rather than after you’ve submitted the paper to get the same feedback from reviewers.

I am looking for a position as an NLP research scientist or machine learning engineer. Here is my CV. I can work in-person from Vancouver, Canada or remotely. If your company is hiring, please leave me a message!

Virtual NLP Conferences: The Good and the Bad

We are now almost two years into the COVID-19 pandemic: international travel has slowed to a trickle and all machine learning conferences have moved online. By now, most of my PhD research has taken place during the pandemic, which I’ve presented at four online conferences (ACL ’20, EMNLP ’20, NAACL ’21, and ACL ’21). I’ve also had the fortune to attend a few in-person conferences before the pandemic, so in this post I’ll compare the advantages of each format.

Travel

One of the perks of choosing grad school for me was the chance to travel to conferences to present my work. Typically the first author of each paper gets funded by the university to present their work. The university pays for the conference fees, hotels, and airfare, which adds up to several thousand dollars per conference. With all conferences online, the school only pays for the conference fee (a trivial amount, about $25-$100). Effectively, a major part of grad student compensation has been cut without replacing it with something else.

There are some advantages though. Before the pandemic, it was mandatory to travel to present your work at the conference. This can be at an inconvenient time or location (such as a 20-hour flight to Australia), so I avoided submitting to certain conferences because of this. With virtual conferences, I can submit anywhere without location being a factor.

Another factor is climate change. The IPCC report this year announced that the earth is warming up at an alarming rate, and at an individual level, one of the biggest contributors to greenhouse gas emissions is air travel. Thousands of grad students travelling internationally several times a year adds up to a significant amount of carbon emissions. Therefore, unless there are substantial benefits to meeting in-person, the climate impact alone is probably worth keeping to virtual conferences post-pandemic.

Talks and Posters

Typically in conferences, paper presentations can be oral (12-minute talk with 3 minutes Q/A), or poster (stand beside a poster for 2 hours). Online conferences mimiced the format: oral presentations were done in Zoom calls, while poster sessions were done in Gather Town (a game-like environment where you move around an avatar). Additionally, most conferences got the authors to record their presentations in advance, so the videos were available to watch at any time during the conference and afterwards.

Above: Presenting my poster in Gather Town (ACL 2021)

For me, Gather Town was quite successful at replicating the in-person conference discussion experience. I made a list of people I wanted to talk to, and either attended their poster session if they had one, otherwise logged on to Gather Town several times a day to check if they were online, and then go talk to them. This created an environment where it was easy to enter into spontaneous conversations, without the friction of scheduling a formal Zoom call with them.

The live oral sessions on Zoom were quite pointless, in my opinion, since the videos were already available to watch asynchronously at your own pace. There was no real opportunity for interaction in the 3-minute Q/A period, so this felt like a strictly worse format than just watching the videos (I usually watch these at 2x speed, which is impossible in a live session). Therefore I didn’t attend any of them.

The paper talk videos were by far the most useful feature of online conferences. Watching a 12-minute video is a good way of getting a high-level overview of a paper, and much faster than reading the paper: I typically watch 5 of them in one sitting. They are available after the conference, leaving a valuable resource for posterity. This is one thing we should keep even if we return to in-person conferences: the benefit is high, while the burden to authors is minimal (if they already prepared to give a talk, it is not much additional effort to record a video).

Collaborations

A commonly stated reason in favor of in-person conferences is the argument that face-to-face interaction is good for developing collaborations across universities. In my experience at in-person conferences, while I did talk to people from other universities, this never resulted in any collaborations with them. Other people’s experiences may vary though, if they’re more extroverted than me.

Virtual meetings certainly puts a strain on collaborations, but this makes more sense for collaborations within an institution. (For example, I haven’t talked to most people in my research group for the last year and a half, and have no idea what they’re working on until I see their paper published). This probably doesn’t extend to conferences though: one week is not enough time for any serious cross-institution collaboration to happen.

Final Thoughts

Like many others, I regret having missed out on so many conferences, which used to be a quintessential part of the PhD experience. There is a positive side: virtual conferences are a lot more accessible, making it feasible to attend conferences in adjacent research areas (like ICML, CVPR), since the $50 registration fee is trivial compared to the cost of travelling internationally.

Many aspects of virtual conferences have worked quite well, so we should consider keeping them virtual after the pandemic. The advantages of time, money, and carbon emissions are significant. However, organizers should embrace the virtual format and not try to mimic a physical conference. There is no good reason to have hours of back-to-back Zoom presentations, when the platform supports uploading videos to play back asynchronously. The virtual conference experience can only get better as organizers learn from past experience and the software continues to improve.

The Efficient Market Hypothesis in Research

A classic economics joke goes like this:

Two economists are walking down a road, when one of them notices a $20 bill on the ground. He turns to his friend and exclaims: “Look, a $20 bill!” The other replies: “Nah, if there’s a $20 on the bill on the ground, someone would’ve picked it up already.”

The economists in the joke believe in the Efficient Market Hypothesis (EMH), which roughly says that financial markets are efficient and there’s no way to “beat the market” by making intelligent trades.

If the EMH was true, then why is there still a trillion-dollar finance industry with active mutual funds and hedge funds? In reality, the EMH is not a universal law of economics (like the law of gravity), but more like an approximation. There may exist inefficiencies in markets where stock prices follow a predictable pattern and there is profit to be made (e.g.: stock prices fall when it’s cloudy in New York). However, as soon as someone notices the pattern and starts exploiting it (by making a trading algorithm based on weather data), the inefficiency disappears. The next person will find zero correlation between weather in New York and stock prices.

There is a close parallel in academic research. Here, the “market” is generally efficient: most problems that are solvable are already solved. There are still “inefficiencies”: open problems that can be reasonably solved, and one “exploits” them by solving it and publishing a paper. Once exploited, it is no longer available: nobody else can publish the same paper solving the same problem.

Where does this leave the EMH? In my view, the EMH is a useful approximation, but its accuracy depends on your skill and expertise. For non-experts, the EMH is pretty much universally true: it’s unlikely that you’ve found an inefficiency that everyone else has missed. For experts, the EMH is less often true: when you’re working in highly specialized areas that only a handful of people understand, you begin to notice more inefficiencies that are still unexploited.

A large inefficiency is like a $20 bill on the ground: it gets picked up very quickly. An example of this is when a new tool is invented that can straightforwardly be applied to a wide range of problems. When the BERT model was released in 2018, breaking the state-of-the-art on all the NLP benchmarks, there was instantly an explosion of activity as researchers raced to apply it to all the important NLP problems and be the first to publish. By mid-2019, all the straightforward applications of BERT were done, and the $20 bill was no more.

Above: Representation of the EMH in research. To outsiders, there are no inefficiencies; to experts, inefficiencies exist briefly before they are exploited. Loosely inspired by this diagram by Matt Might.

The EMH implies various heuristics that I use to guide my daily research. If I have a research idea that’s relatively obvious, and the tools to attack it have existed for a while (say, >= 3 years), then probably one of the following is true:

  1. Someone already published it 3 years ago.
  2. Your idea doesn’t work very well.
  3. The result is not that useful or interesting.
  4. One of your basic assumptions is wrong, so your idea doesn’t even make sense.
  5. Etc.

Conversely, a research idea is much more likely to be fruitful (i.e., a true inefficiency) if the tools to solve it have only existed for a few months, requires data and resources that nobody else has access to, or requires rare combinations of insights that conceivably nobody has thought of.

Outside the realm of the known (the red area in my diagram), there are many questions that are unanswerable. These include the hard problems of consciousness and free will, P=NP, etc, or more mundane problems where our current methods are not strong enough. For an outsider, these might seem like inefficiencies, but it would be wise to assume they’re not. The EMH ensures that true inefficiencies are quickly picked up.

To give a more relatable example, take the apps Uber (launched in 2009) and Instagram (launched in 2010). Many of the apps on your phone probably launched around the same time. In order for Uber and Instagram to work, people needed to have smartphones that were connected to the internet, with GPS (for Uber) and decent quality cameras (for Instagram). Neither of these ideas would’ve been possible in 2005, but thanks to the EMH, as soon as smartphone adoption took off, we didn’t have to wait very long to see all the viable use-cases for the new technology to emerge.

How can deep learning be combined with theoretical linguistics?

Natural language processing is mostly done using deep learning and neural networks nowadays. In a typical NLP paper, you might see some Transformer models, some RNNs built using linear algebra and statistics, but very little linguistic theory. Is linguistics irrelevant to NLP now, or can the two fields still contribute to each other?

In a series of articles in the Language journal, Joe Pater discussed the history of neural networks and generative linguistics, and invited experts to give their perspectives of how the two may be combined going forward. I found their discussion very interesting, although a bit long (almost 100 pages). In this blog post, I will give a brief summary of it.

Generative Linguistics and Neural Networks at 60: Foundation, Friction, and Fusion

Research in generative syntax and neural networks began at the same time in 1957, and were both broadly considered under AI, but the two schools mostly stayed separate, at least for a few decades. In neural network research, Rosenblatt proposed the perceptron learning algorithm and realized that you needed hidden layers to learn XOR, but didn’t know of a procedure to train multi-layer NNs (backpropagation wasn’t invented yet). In generative grammar, Chomsky studied natural language like formal languages, and proposed controversial transformational rules. Interestingly, both schools faced challenges from learnability of their systems.

Above: Frank Rosenblatt and Noam Chomsky, two pioneers of neural networks and generative grammar, respectively.

The first time these two schools were combined was in 1986, when a RNN was used to learn a probabilistic model of past tense. This shows that neural networks and generative grammar are not incompatible, and the dichotomy is a false one. Another method of combining them comes from Harmonic Optimality Theory in theoretical phonology, which extends OT to continuous constraints and the procedure for learning them is similar to gradient descent.

Neural models have proved to be capable of learning a remarkable amount of syntax, despite having a lot less structural priors than Chomsky’s model of Universal Grammar. At the same time, they fail with certain complex examples, so maybe it’s time to add back some linguistic structure.

Linzen’s Response

Linguistics and DL can be combined in two ways. First, linguistics is useful for constructing minimal pairs for evaluating neural models, when such examples are hard to find in natural corpora. Second, neural models can be quickly trained on data, so they’re useful for testing learnability. By comparing human language acquisition data with various neural architectures, we can gain insights about how human language acquisition works. (But I’m not sure how such a deduction would logically work.)

Potts’s Response

Formal semantics has not had much contact with DL, as formal semantics is based around higher-order logic, while deep learning is based on matrices of numbers. Socher did some work of representing tree-based semantic composition as operations on vectors.

Above: Formal semantics uses higher-order logic to build representations of meaning. Is this compatible with deep learning?

In several ways, semanticists make different assumptions from deep learning. Semantics likes to distinguish meaning from use, and consider compositional meaning separately from pragmatics and context, whereas DL cares most of all about generalization, and has no reason to discard context or separate semantics and pragmatics. Compositional semantics does not try to analyze meaning of lexical items, leaving them as atoms; DL has word vectors, but linguists criticize that individual dimensions of word vectors are not easily interpretable.

Rawski and Heinz’s Response

Above: Natural languages exhibit features that span various levels of the Chomsky hierarchy.

The “no free lunch” theorem in machine learning says that you can’t get better performance for free, any gains in some problems must be compensated by decreases in performance on other problems. A model performs well if it has an inductive bias well-suited for the type of problems it applies to. This is true for neural networks as well, and we need to study the inductive biases in neural networks: which classes of languages in the Chomsky hierarchy are NNs capable of learning? We must not confuse ignorance of bias with absence of bias.

Berent and Marcus’s Response

There are significant differences between how generative syntax and neural networks view language, that must be resolved before the fields can make progress with integration. The biggest difference is the “algebraic hypothesis” — the assumption that there exists abstract algebraic categories like Noun, that’s distinct from their instances. This allows you to make powerful generalizations using rules that operate on abstract categories. On the other hand, neural models try to process language without structural representations, and this results in failures in generalizations.

Dunbar’s Response

The central problem in connecting neural networks to generative grammar is the implementational mapping problem: how do you decide if a neural network N is implementing a linguistic theory T? The physical system might not look anything like the abstract theory, eg: implementing addition can look like squiggles on a piece of paper. Some limited classes of NNs may be mapped to harmonic grammar, but most NNs cannot, and the success criterion is unclear right now. Future work should study this problem.

Pearl’s Response

Neural networks learn language but don’t really try to model human neural processes. This could be an advantage, as neural models might find generalizations and building blocks that a human would never have thought of, and new tools in interpretability can help us discover these building blocks contained within the model.

Launching Lucky’s Bookshelf

I recently launched a new book review website / blog:

https://luckybookshelf.com/

Here, I will post reviews of all the books I read (about one book a week). It has over 100 book reviews already, all the reviews I wrote since 2016. My books cover a wide range of topics, including science, linguistics, economics, philosophy, and more. Come check it out!

The biggest headache with Chinese NLP: indeterminate word segmentation

I’ve had a few opportunities to work with NLP in Chinese. English and Chinese are very different languages, yet generally the same techniques apply to both. But there is one source of frustration that comes up from time to time, and it’s perhaps not what you’d expect.

The difficulty is that Chinese doesn’t put words between spaces. Soallyourwordsarejumbledtogetherlikethis.

“Okay, that’s fine,” you say. “We’ll just have to run a tokenizer to separate apart the words before we do anything else. And here’s a neural network that can do this with 93% accuracy (Qi et al., 2020). That should be good enough, right?”

Well, kind of. Accuracy here isn’t very well-defined because Chinese people don’t know how to segment words either. When you ask two native Chinese speakers to segment a sentence into words, they only agree about 90% of the time (Wang et al., 2017). Chinese has a lot of compound words and multi-word expressions, so there’s no widely accepted definition of what counts as a word. Some examples: 吃饭,外国人,开车,受不了. It is also possible (but rare) for a sentence to have multiple segmentations that mean different things.

Arguably, word boundaries are ill-defined in all languages, not just Chinese. Hapselmath (2011) defined 10 linguistic criteria to determine if something is a word (vs an affix or expression), but it’s hard to come up with anything consistent. Most writing systems puts spaces in between words, so there’s no confusion. Other than Chinese, only a handful of other languages (Japanese, Vietnamese, Thai, Khmer, Lao, and Burmese) have this problem.

Word segmentation ambiguity causes problems in NLP systems when different components expect different ways of segmenting a sentence. Another way the problem can appear is if the segmentation for some human-annotated data doesn’t match what a model expects.

Here is a more concrete example from one of my projects. I’m trying to get a language model to predict a tag for every word (imagine POS tagging using BERT). The language model uses SentencePiece encoding, so when a word is out-of-vocab, it gets converted into multiple subword tokens.

“expedite ratification of the proposed law”
=> [“expedi”, “-te”, “ratifica”, “-tion”, “of”, “the”, “propose”, “-d”, “law”]

In English, a standard approach is to use the first subword token of every word, and ignore the other tokens, like this:

This doesn’t work in Chinese — because of the word segmentation ambiguity, the tokenizer might produce tokens that span across multiple of our words:

So that’s why Chinese is sometimes headache-inducing when you’re doing multilingual NLP. You can work around the problem in a few ways:

  1. Ensure that all parts of the system uses a consistent word segmentation scheme. This is easy if you control all the components, but hard when working with other people’s models and data though.
  2. Work on the level of characters and don’t do word segmentation at all. This is what I ended up doing, and it’s not too bad, because individual characters do carry semantic meaning. But some words are unrelated to their character meanings, like transliterations of foreign words.
  3. Do some kind of segment alignment using Levenshtein distance — see the appendix of this paper by Tenney et al. (2019). I’ve never tried this method.

One final thought: the non-ASCII Chinese characters surprisingly never caused any difficulties for me. I would’ve expected to run into encoding problems occasionally, as I had in the past, but never had any character encoding problems with Python 3.

References

  1. Haspelmath, Martin. “The indeterminacy of word segmentation and the nature of morphology and syntax.” Folia linguistica 45.1 (2011): 31-80.
  2. Qi, Peng, et al. “Stanza: A python natural language processing toolkit for many human languages.” Association for Computational Linguistics (ACL) System Demonstrations. 2020.
  3. Tenney, Ian, et al. “What do you learn from context? Probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019.
  4. Wang, Shichang, et al. “Word intuition agreement among Chinese speakers: a Mechanical Turk-based study.” Lingua Sinica 3.1 (2017): 13.

Representation Learning for Discovering Phonemic Tone Contours

My paper titled “Representation Learning for Discovering Phonemic Tone Contours” was recently presented at the SIGMORPHON workshop, held concurrently with ACL 2020. This is joint work with Jing Yi Xie and Frank Rudzicz.

Problem: Can an algorithm learn the shapes of phonemic tones in a tonal language, given a list of spoken words?

Answer: We train a convolutional autoencoder to learn a representation for each contour, then use the mean shift algorithm to find clusters in the latent space.

sigmorphon1

By feeding the centers of each cluster into the decoder, we produce a prototypical contour that represents each cluster. Here are the results for Mandarin and Chinese.

sigmorphon2

We evaluate on mutual information with the ground truth tones, and the method is partially successful, but contextual effects and allophonic variation present considerable difficulties.

For the full details, read my paper here!