Non-technical challenges of medical NLP research

Machine learning has recently made a lot of headlines in healthcare applications, like identifying tumors from images, or technology for personalized treatment. In this post, I describe my experiences as a healthcare ML researcher: the difficulties in doing research in this field, as well as reasons for optimism.

My research group focuses on applications of NLP to healthcare. For a year or two, I was involved in a number of projects in this area (specifically, detecting dementia through speech). From my own projects and from talking to others in my research group, I noticed that a few recurring difficulties frequently came up in healthcare NLP research — things that rarely occurred in other branches of ML. These are non-technical challenges that take up time and impede progress, and generally considered not very interesting to solve. I’ll give some examples of what I mean.

Collecting datasets is hard. Any time you want to do anything involving patient data, you have to undergo a lengthy ethics approval process. Even with something as innocent as an anonymous online questionnaire, there is a mandatory review by an ethics board before the experiment is allowed to proceed. As a result, most datasets in healthcare ML are small: a few dozen patient samples is common, and you’re lucky to have more than a hundred samples to work with. This is tiny compared to other areas of ML where you can easily find thousands of samples.

In my master’s research project, where I studied dementia detection from speech, the largest available corpus had about 300 patients, and other corpora had less than 100. This constrained the types of experiments that were possible. Prior work in this area used a lot of feature engineering approaches, because it was commonly believed that you needed at least a few thousand examples to do deep learning. With less data than that, deep learning would just learn to overfit.

Even after the data has been collected, it is difficult to share with others. This is again due to the conservative ethics processes required to share data. Data transfer agreements need to be reviewed and signed, and in some cases, data must remain physically on servers in a particular hospital. Researchers rarely open-source their code along with the paper, since there’s no point of doing so without giving access to the data; this makes it hard to reproduce any experimental results.

Medical data is messy. Data access issues aside, healthcare NLP has some of the messiest datasets in machine learning. Many datasets in ML are carefully constructed and annotated for the purpose of research, but this is not the case for medical data. Instead, data comes from real patients and hospitals, which are full of shorthand abbreviations of medical terms written by doctors, which mean different things depending on context. Unsurprisingly, many NLP techniques fail to work. Missing values and otherwise unreliable data are common, so a lot of not-so-glamorous data preprocessing is often needed.


I’ve so far painted a bleak picture of medical NLP, but I don’t want to give off such a negative image of my field. In the second part of this post, I give some counter-arguments to the above points as well as some of the positive aspects of research.

On difficulties in data access. There are good reasons for caution — patient data is sensitive and real people can be harmed if the data falls into the wrong hands. Even after removing personally identifiable information, there’s still a risk of a malicious actor deanonymizing the data and extracting information that’s not intended to be made public.

The situation is improving though. The community recognizes the need to share clinical data, to strike a balance between protecting patient privacy and allowing research. There have been efforts like the relatively open MIMIC critical care database to promote more collaborative research.

On small / messy datasets. With every challenge, there comes an opportunity. In fact, my own master’s research was driven by lack of data. I was trying to extend dementia detection to Chinese, but there wasn’t much data available. So I proposed a way to transfer knowledge from the much larger English dataset to Chinese, and got a conference paper and a master’s thesis from it. If it wasn’t for lack of data, then you could’ve just taken the existing algorithm and applied it to Chinese, which wouldn’t be as interesting.

Also, deep learning in NLP has recently gotten a lot better at learning from small datasets. Other research groups have had some success on the same dementia detection task using deep learning. With new papers every week on few-shot learning, one-shot learning, transfer learning, etc, small datasets may not be too much of a limitation.

Same applies to messy data, missing values, label leakage, etc. I’ll refer to this survey paper for the details, but the take-away is that these shouldn’t be thought of as barriers, but as opportunities to make a research contribution.

In summary, as a healthcare NLP researcher, you have to deal with difficulties that other machine learning researchers don’t have. However, you also have the unique opportunity to use your abilities to help sick and vulnerable people. For many people, this is an important consideration — if this is something you care deeply about, then maybe medical NLP research is right for you.

Thanks to Elaine Y. and Chloe P. for their comments on drafts of this post.