The result is publication bias — if we only publish the results of experiments that succeed, even statistically significant results could be due to random chance, rather than anything actually significant happening. Many areas of science are facing a replication crisis, where published research cannot be replicated.
There is some community discussion of encouraging more negative paper submissions, but as of now, negative results are rarely publishable. If you attempt an experiment but don’t get the results you expected, your best hope is to try a bunch of variations of the experiment until you get some positive result (perhaps on a special case of the problem), after which you pretend the failed experiments never happened. With few exceptions, any positive result is better than a negative result, like “we tried method X on problem Y, and it didn’t work”.
I just described a cynical view of academia, but actually, there’s a good reason why the community prefers positive results. Negative results are simply not very useful, and contribute very little to human knowledge.
Now why is that? When a new paper beats the state-of-the-art results on a popular benchmark, that’s definite proof that the method works. The converse is not true. If your model fails to produce good results, it could be due to a number of reasons:
Above: Only when everything is correct will you get positive results; many things can cause a model to fail. (Source)
So if you try method X on problem Y and it doesn’t work, you gain very little information. In particular, you haven’t proved that method X cannot work. Sure, you found that your specific setup didn’t work, but have you tried making modification Z? Negative results in machine learning are rare because you can’t possibly anticipate all possible variations of your method and convince people that all of them won’t work.
Suppose we’re scientists attending the International Conference of Flying Creatures (ICFC). Somebody mentioned it would be nice if we had dragons. Dragons are useful. You could do all sorts of cool stuff with a dragon, like ride it into battle.
“But wait!” you exclaim: “Dragons don’t exist!”
I glance at you questioningly: “How come? We haven’t found one yet, but we’ll probably find one soon.”
Your intuition tells you dragons shouldn’t exist, but you can’t articulate a convincing argument why. So you go home, and you and your team of grad students labor for a few years and publish a series of papers:
Eventually, the community is satisfied that dragons probably don’t exist, for if they did, someone would have found one by now. But a few scientists still harbor the possibility that there may be dragons lying around in a remote jungle somewhere. We just don’t know for sure.
This remains the state of things for a few years until a colleague publishes a breakthrough result:
You read the paper, and indeed, the logic is impeccable. This settles the matter once and for all: dragons don’t exist (or at least the large, flying sort of dragons).
The research community dislikes negative results because they don’t prove a whole lot — you can have a lot of negative results and still not be sure that the task is impossible. In order for a negative result to be valuable, it needs to present a convincing argument why the task is impossible, and not just a list of experiments that you tried that failed.
This is difficult, but it can be done. Let me give an example from computational linguistics. Recurrent neural networks (RNNs) can, in theory, compute any function defined over a sequence. In practice, however, they had difficulty remembering long-term dependencies. Attempts to train RNNs using gradient descent ran into numerical difficulties known as the vanishing / exploding gradient problem.
Then, Bengio et al. (1994) formulated a mathematical model of an RNN as an iteratively applied function. Using ideas from dynamical systems theory, they showed that as the input sequence gets longer and longer, the result is more and more sensitive to noise. The details are technical, but the gist of it is that under some reasonable assumptions, training RNNs using gradient descent is impossible. This is a rare example of a negative result in machine learning — it’s an excellent paper and I’d recommend reading it.
Above: A Long Short Term Memory (LSTM) network handles long term dependencies by adding a memory cell (Source)
Soon after the vanishing gradient problem was understood, researchers invented the LSTM (Hochreiter and Schmidhuber, 1997). Since training RNNs with gradient descent was hopeless, they added a ‘latching’ mechanism that allows state to persist through many iterations, thus avoiding the vanishing gradient problem. Unlike plain RNNs, LSTMs can handle long term dependencies and can be trained with gradient descent; they are among the most ubiquitous deep learning architectures in NLP today.
After reading the breakthrough dragon paper, you pace around your office, thinking. Large, flying dragons can’t exist after all, as they would collapse under their own weight — but what about smaller, non-flying dragons? Maybe we’ve been looking for the wrong type of dragons all along? Armed with new knowledge, you embark on a new search…
Above: Komodo Dragon, Indonesia
…and sure enough, you find one
]]>Above: Our home and native land
Let’s find out!
The task is to classify each pixel of the Canadian flag as either red or white, given limited data points. First, we read in the image with R and take the red channel:
library(png) library(ggplot2) library(xgboost) img <- readPNG("canada.png") red <- img[,,2] HEIGHT <- dim(red)[1] WIDTH <- dim(red)[2]
Next, we sample 7500 random points for training. Also, to make it more interesting, each point has a probability 0.05 of flipping to the opposite color.
ERROR_RATE <- 0.05 get_data_points <- function(N) { x <- sample(1:WIDTH, N, replace = T) y <- sample(1:HEIGHT, N, replace = T) p <- red[cbind(y, x)] p <- round(p) flips <- sample(c(0, 1), N, replace = T, prob = c(ERROR_RATE, 1 - ERROR_RATE)) p[flips == 1] <- 1 - p[flips == 1] data.frame(x=as.numeric(x), y=as.numeric(y), p=p) } data <- get_data_points(7500)
This is what our classifier sees:
Alright, let’s start training.
XGBoost implements gradient boosted decision trees, which were first proposed by Friedman in 1999.
Above: XGBoost learns an ensemble of short decision trees
The output of XGBoost is an ensemble of decision trees. Each individual tree by itself is not very powerful, containing only a few branches. But through gradient boosting, each subsequent tree tries to correct for the mistakes of all the trees before it, and makes the model better. After many iterations, we get a set of decision trees; the sum of the all their outputs is our final prediction.
For more technical details of how this works, refer to this tutorial or the XGBoost paper.
Fitting an XGBoost model is very easy using R. For this experiment, we use decision trees of height 3, but you can play with the hyperparameters.
fit <- xgboost(data = matrix(c(data$x, data$y), ncol = 2), label = data$p, nrounds = 1, max_depth = 3)
We also need a way of visualizing the results. To do this, we run every pixel through the classifier and display the result:
plot_canada <- function(dataplot) { dataplot$y <- -dataplot$y dataplot$p <- as.factor(dataplot$p) ggplot(dataplot, aes(x = x, y = y, color = p)) + geom_point(size = 1) + scale_x_continuous(limits = c(0, 240)) + scale_y_continuous(limits = c(-120, 0)) + theme_minimal() + theme(panel.background = element_rect(fill='black')) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + scale_color_manual(values = c("white", "red")) } fullimg <- expand.grid(x = as.numeric(1:WIDTH), y = as.numeric(1:HEIGHT)) fullimg$p <- predict(fit, newdata = matrix(c(fullimg$x, fullimg$y), ncol = 2)) fullimg$p <- as.numeric(fullimg$p > 0.5) plot_canada(fullimg)
In the first iteration, XGBoost immediately learns the two red bands at the sides:
After a few more iterations, the maple leaf starts to take form:
By iteration 60, it learns a pretty recognizable maple leaf. Note that the decision trees split on x or y coordinates, so XGBoost can’t learn diagonal decision boundaries, only approximate them with horizontal and vertical lines.
If we run it for too long, then it starts to overfit and capture the random noise in the training data. In practice, we would use cross validation to detect when this is happening. But why cross-validate when you can just eyeball it?
That was fun. If you liked this, check out this post which explores various classifiers using a flag of Australia.
The source code for this blog post is posted here. Feel free to experiment with it.
]]>In general, speech recognition is a difficult problem, but it’s much easier when the vocabulary is limited to a handful of words. We don’t need to use complicated language models to detect phonemes, and then string the phonemes into words, like Kaldi does for speech recognition. Instead, a convolutional neural network works quite well.
The dataset consists of about 64000 audio files which have already been split into training / validation / testing sets. You are then asked to make predictions on about 150000 audio files for which the labels are unknown.
Actually, this dataset had already been published in academic literature, and people published code to solve the same problem. I started with GCommandPytorch by Yossi Adi, which implements a speech recognition CNN in Pytorch.
The first step that it does is convert the audio file into a spectrogram, which is an image representation of sound. This is easily done using LibRosa.
Above: Sample spectrograms of “yes” and “no”
Now we’ve converted the problem to an image classification problem, which is well studied. To an untrained human observer, all the spectrograms may look the same, but neural networks can learn things that humans can’t. Convolutional neural networks work very well for classifying images, for example VGG16:
Above: A Convolutional Neural Network (LeNet). VGG16 is similar, but has even more layers.
For more details about this approach, refer to these papers:
Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting
You might ask: if somebody already implemented this, then what’s there left to do other than run their code? Well, the test data contains “silence” samples, which contain background noise but no human speech. It also has words outside the set we care about, which we need to label as “unknown”. The Pytorch CNN produces about 95% validation accuracy by itself, but the accuracy is much lower when we add these two additional requirements.
For silence detection, I first tried the simplest thing I could think of: taking the maximum absolute value of the waveform and decide it’s “silence” if the value is below a threshold. When combined with VGG16, this gets accuracy 0.78 on the leaderboard. This is a crude metric because sufficiently loud noise would be considered speech.
Next, I tried running openSMILE, which I use in my research to extract various acoustic features from audio. It implements an LSTM for voice activity detection: every 0.05 seconds, it outputs a probability that someone is talking. Combining the openSMILE output with the VGG16 prediction gave a score of 0.81.
I tried a bunch of things to improve my score:
In the end, the top scoring model had a score of 0.91, which beat my model by 4 percentage points. Although not enough to win a Kaggle medal, my model was in the top 15% of all submissions. Not bad!
My source code for the contest is available here.
]]>Usually, the organizers make up weights based on how difficult they believe the problems are. However, they sometimes misjudge the difficulty of problems. Wouldn’t it be better if the weightings were determined from data?
Let’s try Principal Component Analysis!
Principal Component Analysis (PCA) is a statistical procedure that finds a transformation of the data that maximizes the variance. In our case, the first principal component gives a relative weighting of the problems that maximizes the variance of the total scores. This makes sense because we want to separate the good and bad students in a math contest.
The International Mathematics Olympiad (IMO) is an annual math competition for top high school students around the world. It consists of six problems, divided between two days: on each day, contestants are given 4.5 hours to solve three problems.
Here are the 2017 problems, if you want to try them.
Above: Score distribution for IMO 2017
This year, 615 students wrote the IMO. Problems 1 and 4 were the easiest, with the majority of contestants receiving full scores. Problems 3 and 6 were the hardest: only 2 students solved the third problem. Problems 2 and 5 were somewhere in between.
This is a good dataset to play with, because the individual results show what each student scored for every problem.
Let be a matrix containing all the data, where each column represents one problem. There are 615 contestants and 6 problems so has 615 rows and 6 columns.
We wish to find a weight vector such that the variance of is maximized. Of course, scaling up by a constant factor also increases the variance, so we need the constraint that .
First, PCA requires that we center so that the mean for each of the problems is 0, so we subtract each column by its mean. This transformation shifts the total score by a constant, and doesn’t affect the relative weights of the problems.
Now, is a vector containing the total scores of all the contestants; its variance is the sum of squares of its elements, or .
To maximize subject to , we take the singular value decomposition of . Then, the leftmost column of (corresponding to the largest singular value) gives that maximizes . This gives the first principal axis, and we are done.
Running PCA on the IMO 2017 data produced interesting results. After re-scaling the weights so that the minimum possible score is 0 and the maximum possible score is 42 (to match IMO’s scoring), PCA recommends the following weights:
This is the weighting that produces the highest variance. That’s right, solving the hardest problem in the history of the IMO would get you a fraction of 1 point. P4 had the highest variance of the six problems, so PCA gave it the highest weight.
The scores and rankings produced by the PCA scheme are reasonably well-correlated with the original scores. Students that did well still did well, and students that did poorly still did poorly. The top students that solved the harder problems (2, 3, 5, 6) usually also solved the easier problems (1 and 2). The students that would be the unhappiest with this scheme are a small number of people who solved P3 or P6, but failed to solve P4.
Here’s a comparison of score distributions with the original and PCA scheme. There is a lot less separation between the best of the best students and the middle of the pack. It is easy to check that PCA does indeed produce higher variance than weighing all six problems equally.
Now, let me comment on the strange results.
It’s clearly absurd to give 0.15 points to the hardest problem on the IMO, and make P4, a much easier problem, be worth 100 times more. But it makes sense from PCA’s perspective. About 99% of the students scored zero on P3, so its variance is very low. Given that PCA has a limited amount of weight to “spend” to increase the total variance, it would be wasteful to use much of it on P3.
The PCA score distribution has less separation between the good students and the best students. However, by giving a lot of weight to P1 and P4, it clearly separates mediocre students that solve one problem from the ones who couldn’t solve anything at all.
In summary, scoring math contests using PCA doesn’t work very well. Although it maximizes overall variance, math contests are asymmetrical as we care about differentiating between the students on the top end of the spectrum.
If you want to play with the data, I uploaded it as a Kaggle dataset.
The code for this analysis is available here.
Further discussion of this article on /r/math.
]]>A deterministic finite automaton (DFA) is a 5-tuple where: is a finite set of states…
I could practically feel my students falling asleep in their seats. Inevitably, a student asked the one question you should never ask a theorist:
“So… how is this useful in real life?”
I’ve done some theoretical research on formal language theory and DFAs, so my immediate response was why DFAs are important to theorists.
Above: A DFA requires O(1) memory, regardless of the length of the input.
You might have heard of Turing machines, which abstracts the idea of a “computer”. In a similar vein, regular languages describe what is possible to do with a computer with very little memory. No matter how long the input is, a DFA only keeps track of what state it’s currently in, so it only requires a constant amount of memory.
By studying properties of regular languages, we gain a better understanding of what is and what isn’t possible with computers with very little memory.
This explains why theorists care about regular languages — but what are some real world applications?
Regular expressions are a useful tool that every programmer should know. If you wanted to check if a string is a valid email address, you might write something like:
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
Behind the scenes, this regular expression gets converted into an NFA, which can be quickly evaluated to produce an answer.
You don’t need to understand the internals of this in order to use regular expressions, but it’s useful to know some theory so you understand its limitations. Some programmers may try to use regular expressions to parse HTML, but if you’ve seen the Pumping Lemma, you will understand why this is fundamentally impossible.
In every programming language, the first step in the compiler or interpreter is the lexer. The lexer reads in a file of your favorite programming language, and produces a sequence of tokens. For example, if you have this line in C++:
cout << "Hello World" << endl;
The lexer generates something like this:
IDENTIFIER cout LSHIFT << STRING "Hello World" LSHIFT << IDENTIFIER endl SEMICOLON ;
The lexer uses a DFA to go through the source file, one character at a time, and emit tokens. If you ever design your own programming language, this will be one of the first things you will write.
Above: Lexer description for JSON numbers, like -3.05
Another application of finite automata is programming simple agents to respond to inputs and produce actions in some way. You can write a full program, but a DFA is often enough to do the job. DFAs are also easier to reason about and easier to implement.
The AI for Pac-Man uses a four-state automaton:
Typically this type of automaton is called a Finite State Machine (FSM) rather than a DFA. The difference is that in a FSM, we do an action depending on the state, whereas in a DFA, we care about accepting or rejecting a string — but they’re the same concept.
What if we took a DFA, but instead of fixed transition rules, the transitions were probabilistic? This is called a Markov Chain!
Above: 3 state Markov chain to model the weather
Markov chains are frequently used in probability and statistics, and have lots of applications in finance and computer science. Google’s PageRank algorithm uses a giant Markov chain to determine the relative importance of web pages!
You can calculate things like the probability of being in a state after a certain number of time steps, or the expected number of steps to reach a certain state.
In summary, DFAs are powerful and flexible tools with myriad real-world applications. Research in formal language theory is valuable, as it helps us better understand DFAs and what they can do.
]]>Imagine you’re analyzing a dataset. You perform a bunch of statistical tests, and one day you get a p-value of 0.02. This must be significant, right? Not so fast! If you tried a lot of tests, then you’ve fallen into the multiple comparisons fallacy — the more tests you do, the higher chance you get a p-value < 0.05 by pure chance.
Here’s an xkcd comicÂ that illustrates this:
They conducted 20 experiments and got a p-value < 0.05 on one of the experiments, thus concluding that green jelly beans cause acne. Later, other researchers will have trouble replicating their results — I wonder why?
What should they have done differently? Well, if they knew about the Bonferroni Correction, they would have divided the p-value 0.05 by the number of experiments, 20. Then, only a p-value smaller than 0.0025 is a truly significant correlation between jelly beans and acne.
Let’s dive in to explain why this division makes sense.
Time for some basic probability. What’s the chance that the scenario in the xkcd comic would happen? In other words, if we conduct 20 experiments, each with probability 0.05 of producing a significant p-value, then how likely will at least one of the experiments produce a significant p-value? Assume all the experiments are independent.
The probability of an experiment not being significant is , so the probability of all 20 experiments not being significant is . Therefore the probability of at least 1 of 20 experiments being significant is . Not too surprising now, isn’t it?
We want the probability of accidentally getting a significant p-value by chance to be 0.05, not 0.64 — the definition of p-value. So flip this around — we need to find an adjusted p-value to give an overall p-value 0.05:
Solving for :
Okay, this seems reasonably close to 0.0025. In general, if the overall p-value is and we are correcting for comparisons, then
This is known as the Å idÃ¡k Correction in literature.
Å idÃ¡k’s method works great, but eventually people started complaining thatÂ Å idÃ¡k’s name had too many diacritics and looked for something simpler (also, it used to be difficult to compute nth roots back when they didn’t have computers). How can we approximate this formula?
Approximate? Use Taylor series, of course!
Assume is constant, and define:
We take the first two terms of the Taylor series of centered at 0:
Now and so . Therefore,
That’s the derivation for the Bonferroni Correction.
Since we only took the first two terms of the Taylor series, this produces a that’s slightly lower than necessary.
In the real world, is close to zero, so in practice it makes little difference whether we use the exactÂ Å idÃ¡k Correction or the Bonferroni approximation.
That’s it for now. Next time you do multiple comparisons, just remember to divide your p-value by . Now you know why.
]]>With over 5000 teams entering the competition, it was the largest Kaggle competition ever. I guess this is because it’s a fairly well-understood problem (binary classification) with a reasonably sized dataset, making it accessible to beginning data scientists.
Kaggle is a machine learning competition platform filled with thousands of smart data scientists, machine learning experts, and statistics PhDs, and I am not one of them. Still, I was curious to see how my relatively simple tools would fare against the sophisticated techniques on the leaderboard.
The first thing I tried was logistic regression. All you had to do was load the data into memory, invoke the glm() function in R, and output the predictions. Initially my logistic regression wasn’t working properly and I got a negative score. It took a day or so to figure out how to do logistic regression properly, which got me a score of 0.259 on the public leaderboard.
Next, I tried gradient boosted decision trees, which I had learned about in a stats class but never actually used before. In R, this is simple — I just needed to change the glm() call to gbm() and fit the model again. This improved my score to 0.265. It was near the end of the competition so I stopped here.
At this point, the top submission had a score of 0.291, and 0.288 was enough to get a gold medal. Yet despite being within 10% of the top submission in overall accuracy, I was still in the bottom half of the leaderboard, ranking in the 30th percentile.
The public leaderboard looked like this:
Above: Public leaderboard of the Porto Seguro Kaggle competition two days before the deadline. Line in green is my submission, scoring 0.265.
This graph illustrates the nature of this competition. At first, progress is easy, and pretty much anyone who submitted anything that was not “predict all zeros” got over 0.200. From there, you make steady, incremental progress until about 0.280 or so, but afterwards, any further improvements is limited.
The top of the leaderboard is very crowded, with over 1000 teams having the score of 0.287. Many teams used ensembles of XGBoost and LightGBM models with elaborate feature engineering. In the final battle for the private leaderboard, score differences of less than 0.001 translated to hundreds of places on the leaderboard and spelled the difference between victory and defeat.
Above: To run 90% as fast as Usain Bolt, you need to run 100 meters in 10.5 seconds. To get 90% of the winning score in Kaggle, you just need to call glm().
This pattern is common in Kaggle and machine learning — often, a simple model can do quite well, at least the same order of magnitude as a highly optimized solution. It’s quite remarkable that you can get a decent solution with a day or two of work, and then, 5000 smart people working for 2 months can only improve it by 10%. Perhaps this is obvious to someone doing machine learning long enough, but we should look back and consider how rare this is. The same does not apply to most activities. You cannot play piano for two days and become 90% as good as a concert pianist. Likewise, you cannot train for two days and run 90% as fast as Usain Bolt.
Simple models won’t win you Kaggle competitions, but we shouldn’t understate their effectiveness. Not only are they quick to develop, but they are also easier to interpret, and can be trained in a few seconds rather than hours. It’s comforting to see how far you can get with simple solutions — the gap between the best and the rest isn’t so big after all.
Read further discussion of this post on theÂ Kaggle forums!
]]>The contents of the paper are fairly technical. In this blog post, I will explain the background and motivation for the problem, and give a statement of the main result that we proved.
The subject of this paper is a formal language operation called overlap assembly, which models annealing of DNA strands. When you cool a solution containing a lot of short DNA strands, they tend to stick together in a predictable manner: they stick together to form longer strands, but only if the ends “match”.
We can view this as a formal operation on strings. If we have two strings a and b such that some suffix of a matches some prefix of b, then the overlap assembly is the string we get by combining them together:
Above: Example of overlap assembly of two strings
In some cases, there might be more than one way of combining the strings together, for example, a=CATA and b=ATAG — then both CATATAG and CATAG are possible overlaps. Therefore, overlap assembly actually produces a set of strings.
We can extend this idea to languages in the obvious way. The overlap assembly of two languages is the set of all the strings you get by overlapping any two strings in the respective languages. For example, if L1={ab, abab, ababab, …} and L2={ba, baba, bababa, …}, then the overlap language is {aba, ababa, …}.
It turns out that if we start with two regular languages, then the overlap assembly language will always be regular too. I won’t go into the details, but it suffices to construct an NFA that recognizes the overlap language, given two DFAs recognizing the two input languages.
Above: Example of construction of an overlap assembly NFA (Figure 2 of our paper)
Given that overlap assembly is closed under regular languages, a natural question to ask is: how “complex” is the regular language that gets produced? One measure of complexity of regular languages is state complexity: the number of states in the smallest DFA that recognizes the language.
State complexity was first studied in 1994 by Sheng Yu et al. Some operations do not increase state complexity very much: if two regular languages have state complexities m and n, then their union has state complexity at most mn. On the other hand, the reversal operation can blow up state complexity exponentially — it’s possible for a language to have state complexity n but its reversal to have state complexity .
Here’s a table of the state complexities of a few regular language operations:
Over the years, state complexity has been studied for a wide range of other regular language operations. Overlap assembly is another such operation — the paper studies the state complexity of this operation.
In our paper, we proved that the state complexity of overlap assembly (for two languages with state complexities m and n) is at most:
Further, we constructed a family of DFAs that achieve this bound, so the bound is tight.
That’s it for my not-too-technical summary of this paper. I glossed over a lot of the details, so check out the paper for the full story!
]]>ETHWaterloo is a hackathon themed around the cryptocurrency Ethereum, which is gaining a lot of traction lately. You have to build something using Ethereum technology, and you can listen to many talks given by experts in the cryptocurrency community. It was a great way to learn more about this new blockchain technology.
Above: Ethereum founder Vitalik Buterin speaking in the opening ceremony
The hackathon opened with a series of talks. I used to be good friends with Vitalik back in first year of university, but haven’t talked to him much after he dropped out of university to work on Ethereum. It was good to see him in person again at this hackathon.
After the opening ceremony, we went to a few tutorials on how to set up the Solidity environment, then brainstormed ideas for what to build. Eventually, we came up with the this:
EulerCoin: our new Ethereum-based ICO
We often hear the media describe Bitcoin as “computers solving hard math problems to generate money”. Well, we made a new cryptocurrency, Eulercoin, where tokens are created by correctly submitting the answers to Project Euler problems!
So how does this work?
I used to maintain a list of Project Euler answers since 2009, but recently it’s been hard to convince people to post their answers publicly. Something like EulerCoin would incentivize people to contribute their answers.
After working on it for a day or so, we got a prototype working:
Above: Screenshot of EulerCoin prototype
Here’s a simple web UI, where clicking “Submit” opens up Metamask to initiate an Ethereum transaction. To avoid actually spamming Project Euler for a hackathon project, we mocked up the answer-checking component, but everything else works.
This was the first time programming in Ethereum / Solidity for all of us, so it was a novel experience. After doing this hackathon, I have a much better understanding of what Ethereum does. Bitcoin is digital currency that can only be transferred and not much else. Ethereum is a whole platform where you can run your own custom, Turing-complete smart contracts, all on the blockchain.
Ethereum also has some major limitations that may hinder widespread adoption. Here’s a few facts about Ethereum that I was surprised to learn:
I’m curious to see how Ethereum will solve these big problems in the upcoming years.
Above: Team EulerCoin (Left to Right: Andrei Danciulescu, Chuyi Liu, Shala Chen, Me)
Our submission didn’t make the final round of judging, but we had a good time. Before the closing ceremony, our team gathered for a group photo.
That’s it for now. I’m not sure if I want to invest my own money into Ethereum just yet, but certainly I will read more about cryptocurrencies. Bubble or not, the mathematical ideas and technologies are worth looking into.
]]>Alzheimer’s disease is a disease that you might have heard of, but it doesn’t get much attention in the media, unlike cancer and stroke. It is a neurodegenerative disease that mostly affects elderly people. 5 million Americans are living with Alzheimer’s, including 1 in 9 over the age of 65, and 1 in 3 over the age of 85.
Alzheimer’s is also the most expensive disease in America. After diagnosis, patients may continue to live for over 10 years, and during much of this time, they are unable to care for themselves and require a constant caregiver. In 2017, 68% of Medicare and Medicaid’s budget is spent on patients with Alzheimer’s, and this number is expected to increase as the elderly population grows.
Despite a lot of recent advances in our understanding of the disease, there is currently no cure for Alzheimer’s. Since the disease is so prevalent and harmful, research in this direction is highly impactful.
One of the early signs of Alzheimer’s is having difficulty remembering things, including words, leading to a decrease in vocabulary. A reliable way to test for this is a retrieval question like the following (Monsch et al., 1992):
In the next 60 seconds, name as many items as possible that can be found in a supermarket.
A healthy person could rattle out about 20-30 items in a minute, whereas someone with Alzheimer’s could only produce about 10. By setting the threshold at 16 items, they could classify even mild cases of Alzheimer’s with about 92% accuracy.
This doesn’t quite capture the signs of Alzheimer’s disease though. Patients with Alzheimer’s tend to be rambly and incoherent. This can be tested with a picture description task, where the patient is given a picture and asked to describe it with as much detail as possible (Giles, Patterson, Hodges, 1994).
Above: Boston Cookie Theft picture used for picture description task
There is no time limit, and the patients talked until they indicated they had nothing more to say, or if they didn’t say anything for 15 seconds.
Patients with Alzheimer’s disease produced descriptions with varying degrees of incoherence. Here’s an example transcript, from the above paper:
Experimenter: Tell me everything you see going on in this picture
Patient: oh yes there’s some washing up going on / (laughs) yes / …… oh and the other / ….. this little one is taking down the cookie jar / and this little girl is waiting for it to come down so she’ll have it / ………. er this girl has got a good old splash / she’s left the taps on (laughs) she’s gone splash all down there / um …… she’s got splash all down there
You can clearly tell that something’s off, but it’s hard to put a finger on exactly what the problem is. Well, time to apply some machine learning!
Fraser’s 2016 paper uses data from the DementiaBank corpus, consisting of 240 narrative samples from patients with Alzheimer’s, and 233 from a healthy control group. The two groups were matched to have similar age, gender, and education levels. Each participant was asked to describe the Boston Cookie Theft picture above.
Fraser’s analysis used both the original audio data, as well as a detailed computer-readable transcript. She looked at 370 different features covering all sorts of linguistic metrics, like ratios of different parts of speech, syntactic structures, vocabulary richness, and repetition. Then, she performed a factor analysis and identified a set of 35 features that achieves about 81% accuracy in distinguishing between Alzheimer’s patients and controls.
According to the analysis, a few of the most important distinguishing features are:
Shortly after this research was published, my adviser Frank Rudzicz co-founded WinterLight Labs, a company that’s working on turning this proof-of-concept into an actual usable product. It also diagnoses various other cognitive disorders like Primary Progressive Aphasia.
A few other grad students in my research group are working on Talk2Me, which is a large longitudinal study to collect more data from patients with various neurodegenerative disorders. More data is always helpful for future research.
So this is the starting point for my research. Stay tuned for updates!
]]>