How I write NLP papers in 8 months: idea to publication

Earlier this month I defended my PhD thesis. The bulk of the work of a PhD consists of producing peer-reviewed publications, and in my field, three first-author publications on a coherent topic in top-tier venues (EMNLP, ACL, NAACL, TACL, etc) is typically sufficient to earn a PhD.

Reflecting on my process for producing my three papers, I noticed that all of them took roughly 8-10 months from the time of the initial idea, until submitting the 8-page manuscript to a conference. Furthermore, each paper followed a similar trajectory from how it evolved from a vague idea into a concrete research contribution.

This is definitely not the only way to write papers, but it has worked well for me. All three of my papers were accepted into main conferences on the first attempt, with reviewer scores consistently between 3.5 and 4 (on a 1-5 rating scale). Therefore, I think it’s a good method to iterate on research projects and reliably produce strong NLP papers.

Month 1: Literature review

When I begin a project, I typically only have a vague idea of the direction or some phenomenon I want to explore. Since I don’t have a good understanding yet of the field, it helps to spend some time reading the prior literature at this stage, instead of diving into experiments right away. This is my focus for the first 3-4 weeks of a project.

See my blog post here for a guide on how to find and read research papers.

By the end of the first month, I would’ve read about 50 papers, and have a good understanding of:

  • The theoretical frameworks and assumptions related to the problem.
  • Recent work in this area, and what datasets and methodologies they use to solve it.
  • Major challenges and open problems in this area, and why they remain difficult to solve.

At this point, after familiarizing myself with the recent work, I can usually identify some gaps that have not yet been addressed in the literature and have some ideas of how to begin solving them. This is when I begin running experiments – these initial experiments almost never make it into the final paper, but allow me to start building an intuition for the problem and become familiar with commonly used dataset and techniques.

Months 2-4: Exploration

The next phase of the project is iterative exploration and experimentation. For the next two or three months, I work on experiments, building on top of each other and using lessons learned from one experiment to guide the design of the next. Most of these experiments will be “failures” – inconclusive for various reasons:

  • I discover that some theoretical assumptions turn out to be invalid, rendering the experiment pointless.
  • After running the experiment, I find that the results are not very interesting: they are explainable by something obvious, or there are no consistent patterns.
  • I try to run the experiment, but find that it’s impossible because the dataset is missing some crucial feature, or my tools are not powerful enough.

One thing you should never do is decide beforehand what evidence you want to find, and then run experiments until you find it. That would be bad science. So in this context, an experiment failure means it didn’t produce a result that’s interesting enough to include in a paper. An experiment might produce results that are different from what I expected, while being a very interesting and successful experiment.

During this phase, I read papers in a different way from the first month. Rather than casting a wide net, my reading is more focused on understanding details so that I can execute a specific experiment correctly.

After a few months of iteration and doing about 15-20 experiments, I have at least 2 or 3 with sufficiently interesting or cool results that I want to share with the community. These experiments will form the core of my new paper, but it’s not enough, since I still have to tie them together into a single coherent narrative, and strengthen all the weaknesses that would be mercilessly attacked during peer review.

Month 5: Telling a story

Before you can write a paper, you have to decide on a framing or narrative that aligns with your experiments. If this is not done correctly, the reader will be confused; your experiments will feel incoherent and unmotivated.

The same data and experiments can be framed in many different ways. Is our focus on evaluating several NLP models on how well they represent some linguistic property? Or are we using evidence from NLP models to argue for some theory of human language learnability? Or perhaps our main contribution is releasing a novel dataset and annotation schema?

To decide on a framing, we must consider several possible narratives and pick the one that best aligns holistically with our core experiments. We’ll need to provide a justification for it, which is usually not the original reason we did the experiment (since the exploration phase is so haphazard).

The product of this narrative brainstorming is a draft of an abstract of the final paper, containing the main results and motivation for them. By writing the abstract first, the overall scientific goal and structure of the paper is clarified. This also gives everyone an idea of gaps in the narrative and what experiments are still needed to fill in these gaps. Around this time is when I decide on a conference submission date to aim for, around 3-4 months down the road.

Months 6-7: Make it stronger

Now we are on the home stretch of the project: we have decided on the core contributions, we now just have to patiently take the time to make it as strong as possible. I make a list of experiments to be done to strengthen the result, like running it with different models, different datasets in other languages, ablation studies, controlling for potential confounding variables, etc.

Essentially I look at my own paper from the perspective of a reviewer, asking myself: “why would I reject this paper?” My co-authors will help out by pointing out flaws in my reasoning and methodology, anticipating problems in advance of the official peer review and giving me ample time to fix them. The paper probably has a decent chance of acceptance without all this extra work, but it is worth it because it lowers the risk of having the paper rejected and needing to resubmit, which would waste several valuable months for everyone.

Month 8: Paper writing

It takes me about 3 weeks to actually write the paper. I like to freeze all the code and experimental results one month before the deadline, so that during the last month, I can focus on presentation and writing. When all the tables and figures are in place, it is a lot easier to write the paper without having to worry about which parts will need to be updated when new results materialize.

The experiment methodology and results sections are the easiest to write since that’s what’s been on my mind for the past few months. The introduction is the most difficult since I have to take a step back and think about how to present the work for someone who is seeing it for the first time, but it is the first thing the reader sees and it’s perhaps the most important part of the whole paper.

A week before the deadline, I have a reasonably good first draft. After sending it out to my co-authors one last time to improve the final presentation, I’m ready to press the submit button. Now we cross our fingers and wait eagerly for the acceptance email!

Parting advice

There were two things that helped me tremendously during my PhD: reading lots of NLP papers, and having a good committee.

Reading a lot of NLP papers is really useful because it helps you build an intuition of what good and bad papers look like. Early in my research career, I participated in a lot of paper reading groups, where we discuss recent papers (both published and arXiV preprints) and talk about which parts are strong and weak, and why. I notice recurring trends of common problems and how strong papers manage them, so that I can incorporate the same solutions in my own papers.

This is sort of like training a GAN (generative adversarial network). I trained myself to be a good discriminator of good vs bad papers, and this is useful for my generator as well: when my paper passes my own discriminator, it is usually able to pass peer review as well.

Another thing that helped me was having a solid committee of experts from different academic backgrounds. This turned out to be very useful because they often pointed out weaknesses and faulty assumptions that I did not realize, even if they didn’t have a solution of how to fix these problems. This way I have no surprises when the peer reviews come out: all the weaknesses have already been pointed out.

For the PhD students reading this, I have two pieces of advice. First, read lots of papers to fine-tune your discriminator. Second, get feedback on your papers as often and as early as possible. It is a lot less painful at this stage when you’re still in the exploratory phase of the project, rather than after you’ve submitted the paper to get the same feedback from reviewers.

I am looking for a position as an NLP research scientist or machine learning engineer. Here is my CV. I can work in-person from Vancouver, Canada or remotely. If your company is hiring, please leave me a message!

Virtual NLP Conferences: The Good and the Bad

We are now almost two years into the COVID-19 pandemic: international travel has slowed to a trickle and all machine learning conferences have moved online. By now, most of my PhD research has taken place during the pandemic, which I’ve presented at four online conferences (ACL ’20, EMNLP ’20, NAACL ’21, and ACL ’21). I’ve also had the fortune to attend a few in-person conferences before the pandemic, so in this post I’ll compare the advantages of each format.

Travel

One of the perks of choosing grad school for me was the chance to travel to conferences to present my work. Typically the first author of each paper gets funded by the university to present their work. The university pays for the conference fees, hotels, and airfare, which adds up to several thousand dollars per conference. With all conferences online, the school only pays for the conference fee (a trivial amount, about $25-$100). Effectively, a major part of grad student compensation has been cut without replacing it with something else.

There are some advantages though. Before the pandemic, it was mandatory to travel to present your work at the conference. This can be at an inconvenient time or location (such as a 20-hour flight to Australia), so I avoided submitting to certain conferences because of this. With virtual conferences, I can submit anywhere without location being a factor.

Another factor is climate change. The IPCC report this year announced that the earth is warming up at an alarming rate, and at an individual level, one of the biggest contributors to greenhouse gas emissions is air travel. Thousands of grad students travelling internationally several times a year adds up to a significant amount of carbon emissions. Therefore, unless there are substantial benefits to meeting in-person, the climate impact alone is probably worth keeping to virtual conferences post-pandemic.

Talks and Posters

Typically in conferences, paper presentations can be oral (12-minute talk with 3 minutes Q/A), or poster (stand beside a poster for 2 hours). Online conferences mimiced the format: oral presentations were done in Zoom calls, while poster sessions were done in Gather Town (a game-like environment where you move around an avatar). Additionally, most conferences got the authors to record their presentations in advance, so the videos were available to watch at any time during the conference and afterwards.

Above: Presenting my poster in Gather Town (ACL 2021)

For me, Gather Town was quite successful at replicating the in-person conference discussion experience. I made a list of people I wanted to talk to, and either attended their poster session if they had one, otherwise logged on to Gather Town several times a day to check if they were online, and then go talk to them. This created an environment where it was easy to enter into spontaneous conversations, without the friction of scheduling a formal Zoom call with them.

The live oral sessions on Zoom were quite pointless, in my opinion, since the videos were already available to watch asynchronously at your own pace. There was no real opportunity for interaction in the 3-minute Q/A period, so this felt like a strictly worse format than just watching the videos (I usually watch these at 2x speed, which is impossible in a live session). Therefore I didn’t attend any of them.

The paper talk videos were by far the most useful feature of online conferences. Watching a 12-minute video is a good way of getting a high-level overview of a paper, and much faster than reading the paper: I typically watch 5 of them in one sitting. They are available after the conference, leaving a valuable resource for posterity. This is one thing we should keep even if we return to in-person conferences: the benefit is high, while the burden to authors is minimal (if they already prepared to give a talk, it is not much additional effort to record a video).

Collaborations

A commonly stated reason in favor of in-person conferences is the argument that face-to-face interaction is good for developing collaborations across universities. In my experience at in-person conferences, while I did talk to people from other universities, this never resulted in any collaborations with them. Other people’s experiences may vary though, if they’re more extroverted than me.

Virtual meetings certainly puts a strain on collaborations, but this makes more sense for collaborations within an institution. (For example, I haven’t talked to most people in my research group for the last year and a half, and have no idea what they’re working on until I see their paper published). This probably doesn’t extend to conferences though: one week is not enough time for any serious cross-institution collaboration to happen.

Final Thoughts

Like many others, I regret having missed out on so many conferences, which used to be a quintessential part of the PhD experience. There is a positive side: virtual conferences are a lot more accessible, making it feasible to attend conferences in adjacent research areas (like ICML, CVPR), since the $50 registration fee is trivial compared to the cost of travelling internationally.

Many aspects of virtual conferences have worked quite well, so we should consider keeping them virtual after the pandemic. The advantages of time, money, and carbon emissions are significant. However, organizers should embrace the virtual format and not try to mimic a physical conference. There is no good reason to have hours of back-to-back Zoom presentations, when the platform supports uploading videos to play back asynchronously. The virtual conference experience can only get better as organizers learn from past experience and the software continues to improve.

The Efficient Market Hypothesis in Research

A classic economics joke goes like this:

Two economists are walking down a road, when one of them notices a $20 bill on the ground. He turns to his friend and exclaims: “Look, a $20 bill!” The other replies: “Nah, if there’s a $20 on the bill on the ground, someone would’ve picked it up already.”

The economists in the joke believe in the Efficient Market Hypothesis (EMH), which roughly says that financial markets are efficient and there’s no way to “beat the market” by making intelligent trades.

If the EMH was true, then why is there still a trillion-dollar finance industry with active mutual funds and hedge funds? In reality, the EMH is not a universal law of economics (like the law of gravity), but more like an approximation. There may exist inefficiencies in markets where stock prices follow a predictable pattern and there is profit to be made (e.g.: stock prices fall when it’s cloudy in New York). However, as soon as someone notices the pattern and starts exploiting it (by making a trading algorithm based on weather data), the inefficiency disappears. The next person will find zero correlation between weather in New York and stock prices.

There is a close parallel in academic research. Here, the “market” is generally efficient: most problems that are solvable are already solved. There are still “inefficiencies”: open problems that can be reasonably solved, and one “exploits” them by solving it and publishing a paper. Once exploited, it is no longer available: nobody else can publish the same paper solving the same problem.

Where does this leave the EMH? In my view, the EMH is a useful approximation, but its accuracy depends on your skill and expertise. For non-experts, the EMH is pretty much universally true: it’s unlikely that you’ve found an inefficiency that everyone else has missed. For experts, the EMH is less often true: when you’re working in highly specialized areas that only a handful of people understand, you begin to notice more inefficiencies that are still unexploited.

A large inefficiency is like a $20 bill on the ground: it gets picked up very quickly. An example of this is when a new tool is invented that can straightforwardly be applied to a wide range of problems. When the BERT model was released in 2018, breaking the state-of-the-art on all the NLP benchmarks, there was instantly an explosion of activity as researchers raced to apply it to all the important NLP problems and be the first to publish. By mid-2019, all the straightforward applications of BERT were done, and the $20 bill was no more.

Above: Representation of the EMH in research. To outsiders, there are no inefficiencies; to experts, inefficiencies exist briefly before they are exploited. Loosely inspired by this diagram by Matt Might.

The EMH implies various heuristics that I use to guide my daily research. If I have a research idea that’s relatively obvious, and the tools to attack it have existed for a while (say, >= 3 years), then probably one of the following is true:

  1. Someone already published it 3 years ago.
  2. Your idea doesn’t work very well.
  3. The result is not that useful or interesting.
  4. One of your basic assumptions is wrong, so your idea doesn’t even make sense.
  5. Etc.

Conversely, a research idea is much more likely to be fruitful (i.e., a true inefficiency) if the tools to solve it have only existed for a few months, requires data and resources that nobody else has access to, or requires rare combinations of insights that conceivably nobody has thought of.

Outside the realm of the known (the red area in my diagram), there are many questions that are unanswerable. These include the hard problems of consciousness and free will, P=NP, etc, or more mundane problems where our current methods are not strong enough. For an outsider, these might seem like inefficiencies, but it would be wise to assume they’re not. The EMH ensures that true inefficiencies are quickly picked up.

To give a more relatable example, take the apps Uber (launched in 2009) and Instagram (launched in 2010). Many of the apps on your phone probably launched around the same time. In order for Uber and Instagram to work, people needed to have smartphones that were connected to the internet, with GPS (for Uber) and decent quality cameras (for Instagram). Neither of these ideas would’ve been possible in 2005, but thanks to the EMH, as soon as smartphone adoption took off, we didn’t have to wait very long to see all the viable use-cases for the new technology to emerge.

How can deep learning be combined with theoretical linguistics?

Natural language processing is mostly done using deep learning and neural networks nowadays. In a typical NLP paper, you might see some Transformer models, some RNNs built using linear algebra and statistics, but very little linguistic theory. Is linguistics irrelevant to NLP now, or can the two fields still contribute to each other?

In a series of articles in the Language journal, Joe Pater discussed the history of neural networks and generative linguistics, and invited experts to give their perspectives of how the two may be combined going forward. I found their discussion very interesting, although a bit long (almost 100 pages). In this blog post, I will give a brief summary of it.

Generative Linguistics and Neural Networks at 60: Foundation, Friction, and Fusion

Research in generative syntax and neural networks began at the same time in 1957, and were both broadly considered under AI, but the two schools mostly stayed separate, at least for a few decades. In neural network research, Rosenblatt proposed the perceptron learning algorithm and realized that you needed hidden layers to learn XOR, but didn’t know of a procedure to train multi-layer NNs (backpropagation wasn’t invented yet). In generative grammar, Chomsky studied natural language like formal languages, and proposed controversial transformational rules. Interestingly, both schools faced challenges from learnability of their systems.

Above: Frank Rosenblatt and Noam Chomsky, two pioneers of neural networks and generative grammar, respectively.

The first time these two schools were combined was in 1986, when a RNN was used to learn a probabilistic model of past tense. This shows that neural networks and generative grammar are not incompatible, and the dichotomy is a false one. Another method of combining them comes from Harmonic Optimality Theory in theoretical phonology, which extends OT to continuous constraints and the procedure for learning them is similar to gradient descent.

Neural models have proved to be capable of learning a remarkable amount of syntax, despite having a lot less structural priors than Chomsky’s model of Universal Grammar. At the same time, they fail with certain complex examples, so maybe it’s time to add back some linguistic structure.

Linzen’s Response

Linguistics and DL can be combined in two ways. First, linguistics is useful for constructing minimal pairs for evaluating neural models, when such examples are hard to find in natural corpora. Second, neural models can be quickly trained on data, so they’re useful for testing learnability. By comparing human language acquisition data with various neural architectures, we can gain insights about how human language acquisition works. (But I’m not sure how such a deduction would logically work.)

Potts’s Response

Formal semantics has not had much contact with DL, as formal semantics is based around higher-order logic, while deep learning is based on matrices of numbers. Socher did some work of representing tree-based semantic composition as operations on vectors.

Above: Formal semantics uses higher-order logic to build representations of meaning. Is this compatible with deep learning?

In several ways, semanticists make different assumptions from deep learning. Semantics likes to distinguish meaning from use, and consider compositional meaning separately from pragmatics and context, whereas DL cares most of all about generalization, and has no reason to discard context or separate semantics and pragmatics. Compositional semantics does not try to analyze meaning of lexical items, leaving them as atoms; DL has word vectors, but linguists criticize that individual dimensions of word vectors are not easily interpretable.

Rawski and Heinz’s Response

Above: Natural languages exhibit features that span various levels of the Chomsky hierarchy.

The “no free lunch” theorem in machine learning says that you can’t get better performance for free, any gains in some problems must be compensated by decreases in performance on other problems. A model performs well if it has an inductive bias well-suited for the type of problems it applies to. This is true for neural networks as well, and we need to study the inductive biases in neural networks: which classes of languages in the Chomsky hierarchy are NNs capable of learning? We must not confuse ignorance of bias with absence of bias.

Berent and Marcus’s Response

There are significant differences between how generative syntax and neural networks view language, that must be resolved before the fields can make progress with integration. The biggest difference is the “algebraic hypothesis” — the assumption that there exists abstract algebraic categories like Noun, that’s distinct from their instances. This allows you to make powerful generalizations using rules that operate on abstract categories. On the other hand, neural models try to process language without structural representations, and this results in failures in generalizations.

Dunbar’s Response

The central problem in connecting neural networks to generative grammar is the implementational mapping problem: how do you decide if a neural network N is implementing a linguistic theory T? The physical system might not look anything like the abstract theory, eg: implementing addition can look like squiggles on a piece of paper. Some limited classes of NNs may be mapped to harmonic grammar, but most NNs cannot, and the success criterion is unclear right now. Future work should study this problem.

Pearl’s Response

Neural networks learn language but don’t really try to model human neural processes. This could be an advantage, as neural models might find generalizations and building blocks that a human would never have thought of, and new tools in interpretability can help us discover these building blocks contained within the model.

Launching Lucky’s Bookshelf

I recently launched a new book review website / blog:

https://luckybookshelf.com/

Here, I will post reviews of all the books I read (about one book a week). It has over 100 book reviews already, all the reviews I wrote since 2016. My books cover a wide range of topics, including science, linguistics, economics, philosophy, and more. Come check it out!

The biggest headache with Chinese NLP: indeterminate word segmentation

I’ve had a few opportunities to work with NLP in Chinese. English and Chinese are very different languages, yet generally the same techniques apply to both. But there is one source of frustration that comes up from time to time, and it’s perhaps not what you’d expect.

The difficulty is that Chinese doesn’t put words between spaces. Soallyourwordsarejumbledtogetherlikethis.

“Okay, that’s fine,” you say. “We’ll just have to run a tokenizer to separate apart the words before we do anything else. And here’s a neural network that can do this with 93% accuracy (Qi et al., 2020). That should be good enough, right?”

Well, kind of. Accuracy here isn’t very well-defined because Chinese people don’t know how to segment words either. When you ask two native Chinese speakers to segment a sentence into words, they only agree about 90% of the time (Wang et al., 2017). Chinese has a lot of compound words and multi-word expressions, so there’s no widely accepted definition of what counts as a word. Some examples: 吃饭,外国人,开车,受不了. It is also possible (but rare) for a sentence to have multiple segmentations that mean different things.

Arguably, word boundaries are ill-defined in all languages, not just Chinese. Hapselmath (2011) defined 10 linguistic criteria to determine if something is a word (vs an affix or expression), but it’s hard to come up with anything consistent. Most writing systems puts spaces in between words, so there’s no confusion. Other than Chinese, only a handful of other languages (Japanese, Vietnamese, Thai, Khmer, Lao, and Burmese) have this problem.

Word segmentation ambiguity causes problems in NLP systems when different components expect different ways of segmenting a sentence. Another way the problem can appear is if the segmentation for some human-annotated data doesn’t match what a model expects.

Here is a more concrete example from one of my projects. I’m trying to get a language model to predict a tag for every word (imagine POS tagging using BERT). The language model uses SentencePiece encoding, so when a word is out-of-vocab, it gets converted into multiple subword tokens.

“expedite ratification of the proposed law”
=> [“expedi”, “-te”, “ratifica”, “-tion”, “of”, “the”, “propose”, “-d”, “law”]

In English, a standard approach is to use the first subword token of every word, and ignore the other tokens, like this:

This doesn’t work in Chinese — because of the word segmentation ambiguity, the tokenizer might produce tokens that span across multiple of our words:

So that’s why Chinese is sometimes headache-inducing when you’re doing multilingual NLP. You can work around the problem in a few ways:

  1. Ensure that all parts of the system uses a consistent word segmentation scheme. This is easy if you control all the components, but hard when working with other people’s models and data though.
  2. Work on the level of characters and don’t do word segmentation at all. This is what I ended up doing, and it’s not too bad, because individual characters do carry semantic meaning. But some words are unrelated to their character meanings, like transliterations of foreign words.
  3. Do some kind of segment alignment using Levenshtein distance — see the appendix of this paper by Tenney et al. (2019). I’ve never tried this method.

One final thought: the non-ASCII Chinese characters surprisingly never caused any difficulties for me. I would’ve expected to run into encoding problems occasionally, as I had in the past, but never had any character encoding problems with Python 3.

References

  1. Haspelmath, Martin. “The indeterminacy of word segmentation and the nature of morphology and syntax.” Folia linguistica 45.1 (2011): 31-80.
  2. Qi, Peng, et al. “Stanza: A python natural language processing toolkit for many human languages.” Association for Computational Linguistics (ACL) System Demonstrations. 2020.
  3. Tenney, Ian, et al. “What do you learn from context? Probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019.
  4. Wang, Shichang, et al. “Word intuition agreement among Chinese speakers: a Mechanical Turk-based study.” Lingua Sinica 3.1 (2017): 13.

Representation Learning for Discovering Phonemic Tone Contours

My paper titled “Representation Learning for Discovering Phonemic Tone Contours” was recently presented at the SIGMORPHON workshop, held concurrently with ACL 2020. This is joint work with Jing Yi Xie and Frank Rudzicz.

Problem: Can an algorithm learn the shapes of phonemic tones in a tonal language, given a list of spoken words?

Answer: We train a convolutional autoencoder to learn a representation for each contour, then use the mean shift algorithm to find clusters in the latent space.

sigmorphon1

By feeding the centers of each cluster into the decoder, we produce a prototypical contour that represents each cluster. Here are the results for Mandarin and Chinese.

sigmorphon2

We evaluate on mutual information with the ground truth tones, and the method is partially successful, but contextual effects and allophonic variation present considerable difficulties.

For the full details, read my paper here!

I didn’t break the bed, the bed broke: Exploring semantic roles with VerbNet / FrameNet

Some time ago, my bed fell apart, and I entered into a dispute with my landlord. “You broke the bed,” he insisted, “so you will have to pay for a new one.”

Being a poor grad student, I wasn’t about to let him have his way. “No, I didn’t break the bed,” I replied. “The bed broke.”

bedbroke

Above: My sad and broken bed. Did it break, or did I break it?

What am I implying here? It’s interesting how this argument relies on a crucial semantic difference between the two sentences:

  1. I broke the bed
  2. The bed broke

The difference is that (1) means I caused the bed to break (eg: by jumping on it), whereas (2) means the bed broke by itself (eg: through normal wear and tear).

This is intuitive to a native speaker, but maybe not so obvious why. One might guess from this example that any intransitive verb when used transitively (“X VERBed Y”) always means “X caused Y to VERB“. But this is not the case: consider the following pair of sentences:

  1. I attacked the bear
  2. The bear attacked

Even though the syntax is identical to the previous example, the semantic structure is quite different. Unlike in the bed example, sentence (1) cannot possibly mean “I caused the bear to attack”. In (1), the bear is the one being attacked, while in (2), the bear is the one attacking something.

broke-attackAbove: Semantic roles for verbs “break” and “attack”.

Sentences which are very similar syntactically can have different structures semantically. To address this, linguists assign semantic roles to the arguments of verbs. There are many semantic roles (and nobody agrees on a precise list of them), but two of the most fundamental ones are Agent and Patient.

  • Agent: entity that intentionally performs an action.
  • Patient: entity that changes state as a result of an action.
  • Many more.

The way that a verb’s syntactic arguments (eg: Subject and Object) line up with its semantic arguments (eg: Agent and Patient) is called the verb’s argument structure. Note that an agent is not simply the subject of a verb: for example, in “the bed broke“, the bed is syntactically a subject but is semantically a patient, not an agent.

Computational linguists have created several corpora to make this information accessible to computers. Two of these corpora are VerbNet and FrameNet. Let’s see how a computer would be able to understand “I didn’t break the bed; the bed broke” using these corpora.

broke-verbnet

Above: Excerpt from VerbNet entry for the verb “break”.

VerbNet is a database of verbs, containing syntactic patterns where the verb can be used. Each entry contains a mapping from syntactic positions to semantic roles, and restrictions on the arguments. The first entry for “break” has the transitive form: “Tony broke the window“.

Looking at the semantics, you can conclude that: (1) the agent “Tony” must have caused the breaking event, (2) something must have made contact with the window during this event, (3) the window must have its material integrity degraded as a result, and (4) the window must be a physical object. In the intransitive usage, the semantics is simpler: there is no agent that caused the event, and no instrument that made contact during the event.

The word “break” can take arguments in other ways, not just transitive and intransitive. VerbNet lists 10 different patterns for this word, such as “Tony broke the piggy bank open with a hammer“. This sentence contains a result (open), and also an instrument (a hammer). The entry for “break” also groups together a list of words like “fracture”, “rip”, “shatter”, etc, that have similar semantic patterns as “break”.

broke-framenet

Above: Excerpt from FrameNet entry for the verb “break”.

FrameNet is a similar database, but based on frame semantics. The idea is that in order to define a concept, you have to define it in terms of other concepts, and it’s hard to avoid a cycle in the definition graph. Instead, it’s sometimes easier to define a whole semantic frame at once, which describes a conceptual situation with many different participants. The frame then defines each participant by what role they play in the situation.

The word “break” is contained in the frame called “render nonfunctional“. In this frame, an agent affects an artifact so that it’s no longer capable of performing its function. The core (semantically obligatory) arguments are the agent and the artifact. There are a bunch of optional non-core arguments, like the manner that the event happened, the reason that the agent broke the artifact, the time and place it happened, and so on. FrameNet tries to make explicit all of the common-sense world knowledge that you need to understand the meaning of an event.

Compared to VerbNet, FrameNet is less concerned with the syntax of verbs: for instance, it does not mention that “break” can be used intransitively. Also, it has more fine-grained categories of semantic roles, and contains a description in English (rather than VerbNet’s predicate logic) of how each semantic argument participates in the frame.

An open question is: how can computers use VerbNet and FrameNet to understand language? Nowadays, deep learning has come to dominate NLP research, so that VerbNet and FrameNet are often seen as relics of a past era, when people still used rule-based systems to do NLP. It turned out to be hard to use VerbNet and FrameNet to make computers do useful tasks.

But recently, the NLP community is realizing that deep learning has limitations when it comes to common-sense reasoning, that you can’t solve just by adding more layers on to BERT and feeding it more data. So maybe deep learning systems can benefit from these lexical semantic resources.

Edge probing BERT and other language models

Recently, there has been a lot of research in the field of “BERTology”: investigating what aspects of language BERT does and doesn’t understand, using probing techniques. In this post, I will describe the “edge probing” technique developed by Tenney et al., in two papers titled “What do you learn from context? Probing for sentence structure in contextualized word representations” and “BERT Rediscovers the Classical NLP Pipeline“. On my first read through these papers, the method didn’t seem very complicated, and the authors only spend a paragraph explaining how it works. But upon closer look, the method is actually nontrivial, and took me some time to understand. The details are there, but hidden in an appendix in the first of the two papers.

The setup for edge probing is you have some classification task that takes a sentence, and a span of consecutive tokens within the sentence, and produces an N-way classification. This can be generalized to two spans in the sentence, but we’ll focus on the single-span case. The tasks cover a range of syntactic and semantic functions, varying from part-of-speech tagging, dependency parsing, coreference resolution, etc.

tenney-examples

Above: Examples of tasks where edge probing may be used. Table taken from Tenney (2019a).

Let’s go through the steps of how edge probing works. Suppose we want to probe for which parts of BERT-BASE contain the most information about named entity recognition (NER). In this NER setup, we’re given the named entity spans and only need to classify which type of entity it is (e.g: person, organization, location, etc). The first step is to feed the sentence through BERT, giving 12 layers, each layer being a 768-dimensional vector.

edge_probe

Above: Edge probing example for NER.

The probing model has several stages:

  1. Mix: learn a task-specific linear combination of the layers, giving a single 768-dimensional vector for each span token. The weights that are learned indicate how much useful information for the task is contained in each layer.
  2. Projection: learn a linear mapping from 768 down to 256 dimensions.
  3. Self-attention pooling: learn a function to generate a scalar weight for each span vector. Then, we normalize the weights to sum up to 1, and take a weighted average. The purpose of this is to collapse the variable-length sequence of span vectors into a single fixed-length 256-dimensional vector.
  4. Feedforward NN: learn a multi-layer perceptron classifier with 2 hidden layers.

For the two-span case, they use two separate sets of weights for the mix, projection, and self-attention steps. Then, the feedforward neural network takes a concatenated 512-dimensional vector instead of a 256-dimensional vector.

The edge probing model needs to be trained on a dataset of classification instances. The probe has weights that are initialized randomly and trained using gradient descent, but the BERT weights are kept constant while the probe is being trained.

This setup was more sophisticated than it looked at first glance, and it turns out the probe itself is quite powerful. In one of the experiments, the authors found that even with randomized input, the probe was able to get over 70% of the full model’s performance!

Edge probing is not the only way of probing deep language models. Other probing experiments (Liu et al., 2019) used a simple linear classifier: this is better for measuring what information can easily be recovered from representations.

References

  1. Tenney, Ian, et al. “What do you learn from context? probing for sentence structure in contextualized word representations.” International Conference on Learning Representations. 2019a.
  2. Tenney, Ian, Dipanjan Das, and Ellie Pavlick. “BERT Rediscovers the Classical NLP Pipeline.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019b.
  3. Liu, Nelson F., et al. “Linguistic Knowledge and Transferability of Contextual Representations.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

What if the government hadn’t released any Coronavirus economic stimulus?

It is March 23, 2020. After a month of testing fiascos, COVID-19 is ravaging the United States, with 40,000 cases in the US and 360,000 worldwide. There is a growing sense of panic as cities begin to lock down. Market circuit breakers have triggered four times in quick succession, with the stock market losing 30% of its value in mere weeks. There’s no sign that the worst is over.

covid-mar-23Above: Global Coronavirus stats on March 23, 2020, when the S&P 500 reached its lowest point during the pandemic (Source).

With businesses across the country closed and millions out of work, it’s clear that a massive financial stimulus is needed to prevent a total economic collapse. However, Congress is divided and is unable to pass the bill. Even when urgent action is needed, they squabble and debate over minor details, refusing to come to a compromise. The president denies that there’s any need for action. Both the democrats and the republicans are willing to do anything to prevent the other side from scoring a victory. The government is in a gridlock.

Let the businesses fail, they say. Don’t bail them out, they took the risk when the times were good, now you reap what you sow. Let them go bankrupt, punish the executives taking millions of dollars of bonuses. Let the free market do its job, after all, they can always start new businesses once this is all over.

April comes without any help from the government. Massive layoffs across all sectors of the economy as companies see their revenues drop to a fraction of normal levels, and layoff employees to try to preserve their cash. Retail and travel sectors are the most heavily affected, but soon, all companies are affected since people are hesitant to spend money. Unemployment numbers skyrocket to levels even greater than during the Great Depression.

Without a job, millions of people miss their rent payments, instead saving their money for food and essential items. Restaurants and other small businesses shut down. When people and businesses cannot pay rent, their landlords cannot pay the mortgages that they owe to the bank. A few small banks go bankrupt, and Wall Street waits anxiously for a government bailout. But unlike 2008, the state is in a deadlock, and there is no bailout coming. In 2020, no bank is too big to fail.

Each bank that goes down takes another bank down with it, until there is soon a cascading domino effect of bank failures. Everyone rushes to withdraw cash from their checking accounts before the bank collapses, which of course makes matters worse. Those too late to withdraw their cash lose their savings. This is devastation for businesses: even for those that escaped the pandemic, but there is no escaping systemic bank failure. Companies have no money in the bank to pay suppliers or make payroll, and thus, thousands of companies go bankrupt overnight.

Across the nation, people are angry at the government’s inaction, and take to the streets in protest. Having depleted their savings, some rob and steal from grocery stores to avoid starvation. The government finally steps in and deploys the military to keep order in the cities. They arrange for emergency supplies, enough to keep everybody fed, but just barely.

The lockdown lasts a few more months, and the virus is finally under control. Everyone is free to go back to work, but the problem is there are no jobs to go back to. In the process of all the biggest corporations going bankrupt, society has lost its complex network of dependencies and organizational knowledge. It only takes a day to lay off 100,000 employees, but to build up this structure from scratch will take decades.

A new president is elected, but it is too late, the damage has been done and cannot be reversed. The economy slowly recovers, but with less efficiency than before, and with workers employed in less productive roles, and the loss of productivity means that everyone enjoys a lower standard of living. Five years later, the virus is long gone, but the economy is nowhere close to its original state. By then, China emerges as the new dominant world power. The year 2020 goes down in history as a year of failure, where through inaction, a temporary health crisis led to societal collapse.


In our present timeline, fortunately, none of the above actually happened. The democrats and republicans put aside their differences and on March 25, swiftly passed a $2 billion dollar economic stimulus. The stock market immediately rebounded.

There was a period in March when it was seemed the government was in gridlock, and it wasn’t clear whether the US was politically capable of passing such a large stimulus bill. Is an economic collapse likely? Not really — no reasonable government would have allowed all of the banks to fail, so we would likely have a recession and not a total collapse. Banks did fail during the Great Depression, but macroeconomic theory was in its infancy at that time, and there’s no way such mistakes would’ve been repeated today. Still, this is the closest we’ve come to an economic collapse in a long time, and it’s fun to speculate about the consequences of what it would be like.