Hypothesis testing for difference in Pearson / Spearman correlations

The Pearson and Spearman correlation coefficients measure how closely two variables are correlated. They’re useful as an evaluation metric in certain machine learning tasks, when you want the model to predict some kind of score, but the actual value of the score is arbitrary, and you only care that the model puts high-scoring items above low-scoring items.

An example of this is the STS-B task in the GLUE benchmark: the task is to rate pairs of sentences on how similar they are. The task is evaluated using Pearson and Spearman correlations against the human ground-truth. Now, if model A has Spearman correlation of 0.55 and model B has 0.51, how confident are you that model A is actually better?

Recently, the NLP research community has advocated for more significance testing (Dror et al., 2018): report a p-value when comparing two models, to distinguish between true improvements and fluctuations due to random chance. However, hypothesis testing is rarely done for Pearson and Spearman metrics — it’s not mentioned in the hitchhiker’s guide linked above, and not supported by the standard ML libraries in Python and R. In this post, I describe how to do significance testing for a difference in Pearson / Spearman correlations, and give some references to the statistics literature.

Definitions and properties

The Pearson correlation coefficient is defined by:

r_{xy} = \frac{\sum_i^n (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum_i^n (x_i - \bar{x})^2} \sqrt{\sum_i^n (y_i - \bar{y})^2}}

The Spearman rank-order correlation is defined as the Pearson correlation between the ranks of the two variables, and measures the relative order between them. Both correlation coefficients range between -1 and 1.

Pearson’s correlation is simpler, has nicer statistical properties, and is default option in most software packages. However, de Winter et al. (2016) argues that Spearman’s correlation works better with non-normal data and is more robust to outliers, so is generally preferred over Pearson’s correlation.

Significance testing

Suppose we have the predictions of model A and model B, and we wish to compute a p-value for whether their Pearson / Spearman correlation coefficients are different. Start by computing the correlation coefficients for both models against the ground truth.

Then, apply the Fisher transformation to each correlation coefficient:

z = \frac{1}{2} \log(\frac{1+r}{1-r})

This transforms r which is between -1 and 1 into z, which ranges the whole real number line. It turns out that z is approximately normal, with nearly constant variance that only depends on N (the number of data points) and not on r.

For Pearson correlation, the standard deviation of the estimator \hat r_p is given by:

\mathrm{SD}(\hat r_p) = \sqrt{\frac{1}{N-3}}

For Spearman rank-order correlation, the standard deviation of the estimator \hat r_s is given by:

\mathrm{SD}(\hat r_s) = \sqrt{\frac{1.060}{N-3}}

Now, we can compute the p-value because the difference of the two z values follows a normal distribution with known variance.

R implementation

The following R function computes a p-value for the two-tailed hypothesis test, given a ground truth vector and two model output vectors:

cor_significance_test <- function(truth, x1, x2, method="pearson") {
  n <- length(truth)
  cor1 <- cor(truth, x1, method=method)
  cor2 <- cor(truth, x2, method=method)
  fisher1 <- 0.5*log((1+cor1)/(1-cor1))
  fisher2 <- 0.5*log((1+cor2)/(1-cor2))
  if(method == "pearson") {
    expected_sd <- sqrt(1/(n-3))
  }
  else if(method == "spearman") {
    expected_sd <- sqrt(1.060/(n-3))
  }
  2*(1-pnorm(abs(fisher1-fisher2), sd=expected_sd))
}

Naturally, the one-tailed p-value is half of the two-sided one.

For details of other similar computations involving Pearson and Spearman correlations (eg: confidence intervals, unpaired hypothesis tests), I recommend the Handbook of Parametric and Nonparametric Statistical Procedures (Sheskin, 2000).

Caveats and limitations

The formula for the Pearson correlation is solid and very accurate. For Spearman, the constant 1.060 has no theoretical backing and was rather derived experimentally by Fieller et al. (1957), by running simulations using variables from a bivariate normal distribution. Fieller claimed that the approximation was accurate for correlations between -0.8 and 0.8. Borkowf (2002) warns that this approximation may be off if the distribution is far from a bivariate normal.

The procedure here for Spearman correlation may not be appropriate if the correlation coefficient is very high (above 0.8) or if the data is not approximately normal. In that case, you might want to try permutation tests or bootstrapping methods — refer to Bishara and Hittner (2012) for a detailed discussion.

References

  1. Dror, Rotem, et al. “The hitchhiker’s guide to testing statistical significance in natural language processing.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.
  2. de Winter, Joost CF, Samuel D. Gosling, and Jeff Potter. “Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data.” Psychological methods 21.3 (2016): 273.
  3. Sheskin, David J. “Parametric and nonparametric statistical procedures.” Chapman & Hall/CRC: Boca Raton, FL (2000).
  4. Fieller, Edgar C., Herman O. Hartley, and Egon S. Pearson. “Tests for rank correlation coefficients. I.” Biometrika 44.3/4 (1957): 470-481.
  5. Borkowf, Craig B. “Computing the nonnull asymptotic variance and the asymptotic relative efficiency of Spearman’s rank correlation.” Computational statistics & data analysis 39.3 (2002): 271-286.
  6. Bishara, Anthony J., and James B. Hittner. “Testing the significance of a correlation with nonnormal data: comparison of Pearson, Spearman, transformation, and resampling approaches.” Psychological methods 17.3 (2012): 399.

On Multiple Hypothesis Testing and the Bonferroni Correction

When you’re doing a statistical analysis, it’s easy to run into the multiple comparisons problem.

Imagine you’re analyzing a dataset. You perform a bunch of statistical tests, and one day you get a p-value of 0.02. This must be significant, right? Not so fast! If you tried a lot of tests, then you’ve fallen into the multiple comparisons fallacy — the more tests you do, the higher chance you get a p-value < 0.05 by pure chance.

Here’s an xkcd comic that illustrates this:

significant.png

They conducted 20 experiments and got a p-value < 0.05 on one of the experiments, thus concluding that green jelly beans cause acne. Later, other researchers will have trouble replicating their results — I wonder why?

What should they have done differently? Well, if they knew about the Bonferroni Correction, they would have divided the p-value 0.05 by the number of experiments, 20. Then, only a p-value smaller than 0.0025 is a truly significant correlation between jelly beans and acne.

Let’s dive in to explain why this division makes sense.

Šidák Correction

Time for some basic probability. What’s the chance that the scenario in the xkcd comic would happen? In other words, if we conduct 20 experiments, each with probability 0.05 of producing a significant p-value, then how likely will at least one of the experiments produce a significant p-value? Assume all the experiments are independent.

The probability of an experiment not being significant is 1 - 0.05, so the probability of all 20 experiments not being significant is (1-0.05)^{20}. Therefore the probability of at least 1 of 20 experiments being significant is 1 - (1-0.05)^{20} = 0.64. Not too surprising now, isn’t it?

We want the probability of accidentally getting a significant p-value by chance to be 0.05, not 0.64 — the definition of p-value. So flip this around — we need to find an adjusted p-value p_{adj} to give an overall p-value 0.05:

1 - (1 - p_{adj})^{20} = 0.05

Solving for p_{adj}:

p_{adj} = 1 - 0.95^{1/20} \approx 0.00256

Okay, this seems reasonably close to 0.0025. In general, if the overall p-value is p and we are correcting for N comparisons, then

p_{adj} = 1 - (1 - p)^{1/N}

This is known as the Šidák Correction in literature.

Bonferroni Correction

Šidák’s method works great, but eventually people started complaining that Šidák’s name had too many diacritics and looked for something simpler (also, it used to be difficult to compute nth roots back when they didn’t have computers). How can we approximate this formula?

Approximate? Use Taylor series, of course!

Assume N is constant, and define:

f(p) = 1 - (1-p)^{1/N}

We take the first two terms of the Taylor series of f(p) centered at 0:

f(p) = f(0) + f'(0)p + O(p^2)

Now f(0) = 0 and f'(p) = \frac{1}{N} (1-p)^{-(N-1)/N} so f'(0) = \frac1N. Therefore,

f(p) = p_{adj} \approx \frac{p}{N}.

That’s the derivation for the Bonferroni Correction.

Since we only took the first two terms of the Taylor series, this produces a p_{adj} that’s slightly lower than necessary.

In the real world, p is close to zero, so in practice it makes little difference whether we use the exact Šidák Correction or the Bonferroni approximation.

That’s it for now. Next time you do multiple comparisons, just remember to divide your p-value by N. Now you know why.

What’s the difference between Mathematics and Statistics?

Statistics has a sort of funny and peculiar relationship with mathematics. In a lot of university departments, they’re lumped together and you have a Department of Mathematics and Statistics”. Other times, it’s grouped as a branch in applied math. Pure mathematicians tend to either think of it as an application of probability theory, or dislike it because it’s not rigorous enough”.

After having studied both, I feel it’s misleading to say that statistics is a branch of math. Rather, statistics is a separate discipline that uses math, but differs in fundamental ways from other branches of math, like combinatorics or differential equations or group theory. Statistics is the study of uncertainty, and this uncertainty permeates the subject so much that mathematics and statistics are fundamentally different modes of thinking.

mathstats.png

Above: if pure math and statistics were like games

 

Definitions and Proofs

Math always follows a consistent definition-theorem-proof structure. No matter what branch of mathematics you’re studying, whether it be algebraic number theory or real analysis, the structure of a mathematical argument is more or less the same.

You begin by defining some object, let’s say a wug. After defining it, everybody can look at the definition and agree on which objects are wugs and which objects are not wugs.

Next, you proceed to prove interesting things about wugs, using marvelous arguments like proof by contradiction and induction. At every step of the proof, the reader can verify that indeed, this step follows logically from the definitions. After several of these proofs, you now understand a lot of properties of wugs and how they connect to other objects in the mathematical universe, and everyone is happy.

In statistics, it’s common to define things with intuition and examples, so you know it when you see it”; things are rarely so black-and-white like in mathematics. This is born out of necessity: statisticians work with real data, which tends to be messy and doesn’t lend itself easily to clean, rigorous definitions.

Take for example the concept of an outlier”. Many statistical methods behave badly when the data contains outliers, so it’s a common practice to identify outliers and remove them. But what exactly constitutes an outlier? Well, that depends on many criteria, like how many data points you have, how far it is from the rest of the points, and what kind of model you’re fitting.

scatterplot1

In the above plot, two points are potentially outliers. Should you remove them, or keep them, or maybe remove one of them? There’s no correct answer, and you have to use your judgment.


For another example, consider p-values. Usually, when you get a p-value under 0.05, it can be considered statistically significant. But this value is merely a guideline, not a law – it’s not like 0.048 is definitely significant and 0.051 is not.

Now let’s say you run an A/B-test and find that changing a button to blue results in higher clicks, with p-value of 0.059. Should you recommend to your boss that they make the change? What if you get 0.072, or 0.105? At what point does it become not significant? There is no correct answer, you have to use your judgment.


Take another example: heteroscedasticity. This is a fancy word that means the variance is unequal for different parts of your dataset. Heteroscedasticity is bad because a lot of models assume that the variance is constant, and if this assumption is violated then you’ll get wrong results, so you need to use a different model.

 

1-qYXXQN-1eumcnYgTJZnaww

Is this data heteroscedastic, or does it only look like the variance is uneven because there are so few points to the left of 3.5? Is the problem serious enough that fitting a linear model is invalid? There’s no correct answer, you have to use your judgment.


Another example: consider a linear regression model with two variables. When you plot the points on a graph, you should expect the points to roughly lie on a straight line. Not exactly on a line, of course, just roughly linear. But what if you get this:

Rplot

There is some evidence of non-linearity, but how much bendiness” can you accept before the data is definitely not roughly linear” and you have to use a different model? Again, there’s no correct answer, and you have to use your judgment.


I think you see the pattern here. In both math and statistics, you have models that only work if certain assumptions are satisfied. However, unlike math, there is no universal procedure that can tell you whether your data satisfies these assumptions.

Here are some common things that statistical models assume:

  • A random variable is drawn from a normal (Gaussian) distribution
  • Two random variables are independent
  • Two random variables satisfy a linear relationship
  • Variance is constant

Your data is not going to exactly fit a normal distribution, so all of these are approximations. A common saying in statistics goes: all models are wrong, but some are useful”.

On the other hand, if your data deviates significantly from your model assumptions, then the model breaks down and you get garbage results. There’s no universal black-and-white procedure to decide if your data is normally distributed, so at some point you have to step in and apply your judgment.

Aside: in this article I’m ignoring Mathematical Statistics, which is the part of statistics that tries to justify statistical methods using rigorous math. Mathematical Statistics follows the definition-theorem-proof pattern and is very much like any other branch of math. Any proofs you see in a stats course likely belongs in this category.

 

Classical vs Statistical Algorithms

You might be wondering: without rigorous definitions and proofs, how do you be sure anything you’re doing is correct? Indeed, non-statistical (i.e. mathematical) and statistical methods have different ways of judging correctness”.

Non-statistical methods use theory to justify their correctness. For instance, we can prove by induction that Dijkstra’s algorithm always returns the shortest path in a graph, or that quicksort always arranges an array in sorted order. To compare running time, we use Big-O notation, a mathematical construct that formalizes runtimes of programs by looking at how they behave as their inputs get infinitely large.

Non-statistical algorithms focus primarily on worst-case analysis, even for approximation and randomized algorithms. The best known approximation algorithm for the Traveling Salesman problem has an approximation ratio of 1.5 – this means that even for the worst possible input, the algorithm gives a path that’s no more than 1.5 times longer than the optimal solution. It doesn’t make a difference if the algorithm performs a lot better than 1.5 for most practical inputs, because it’s always the worst case that we care about.

A statistical method is good if it can make inferences and predictions on real-world data. Broadly speaking, there are two main goals of statistics. The first is statistical inference: analyzing the data to understand the processes that gave rise to it; the second is prediction: using patterns from past data to predict the future. Therefore, data is crucial when evaluating two different statistical algorithms. No amount of theory will tell you whether a support vector machine is better than a decision tree classifier – the only way to find out is by running both on your data and seeing which one gives more accurate predictions.

2 Above: the winning neural network architecture for ImageNet Challenge 2012. Currently, theory fails at explaining why this method works so well.

In machine learning, there is still theory that tries to formally describe how statistical models behave, but it’s far removed from practice. Consider, for instance, the concepts of VC dimension and PAC learnability. Basically, the theory gives conditions under which the model eventually converges to the best one as you give it more and more data, but is not concerned with how much data you need to achieve a desired accuracy rate.

This approach is highly theoretical and impractical for deciding which model works best for a particular dataset. Theory falls especially short in deep learning, where model hyperparameters and architectures are found by trial and error. Even with models that are theoretically well-understood, the theory can only serve as a guideline; you still need cross-validation to determine the best hyperparameters.

 

Modelling the Real World

Both mathematics and statistics are tools we use to model and understand the world, but they do so in very different ways. Math creates an idealized model of reality where everything is clear and deterministic; statistics accepts that all knowledge is uncertain and tries to make sense of the data in spite of all the randomness. As for which approach is better – both approaches have their advantages and disadvantages.

Math is good for modelling domains where the rules are logical and can be expressed with equations. One example of this is physical processes: just a small set of rules is remarkably good for predicting what happens in the real world. Moreover, once we’ve figured out the mathematical laws that govern a system, they are infinitely generalizable — Newton’s laws can accurately predict the motion of celestial bodies even if we’ve only observed apples falling from trees. On the other hand, math is awkward at dealing with error and uncertainty. Mathematicians create an ideal version of reality, and hope that it’s close enough to the real thing.

Statistics shines when the rules of the game are uncertain. Rather than ignoring error, statistics embraces uncertainty. Every value has a confidence interval where you can expect it to be right about 95% of the time, but we can never be 100% sure about anything. But given enough data, the right model will separate the signal from the noise. This makes statistics a powerful tool when there are many unknown confounding factors, like modelling sociological phenomena or anything involving human decisions.

The downside is that statistics only works on the sample space where you have data; most models are bad at extrapolating past the range of data that it’s trained on. In other words, if we use a regression model with data of apples falling from trees, it will eventually be pretty good at predicting other apples falling from trees, but it won’t be able to predict the path of the moon. Thus, math enables us to understand the system at a deeper, more fundamental level than statistics.

Math is a beautiful subject that reduces a complicated system to its essence. But when you’re trying to understand how people behave, when the subjects are not always rational, learning from data is the way to go.

How a simple trick decreased my elevator waiting time by 33%

Last month, when I traveled to Hong Kong, I stayed at a guesthouse in a place called the Chungking Mansions. Located in Tsim Sha Tsui, it’s one of the most crowded, sketchiest, and cheapest places to stay in Hong Kong.

5262623923_99b6c39b21.jpgChungking Mansions in Tsim Sha Tsui

Of the 17 floors, the first few are teeming with Indian and African restaurants and various questionable businesses. The rest of the floors are guesthouses and private residences. One thing that’s unusual about the building is the structure of its elevators.

The building is partitioned into five disjoint blocks, and each block has two elevators. One of the elevators only goes to the odd numbered floors, and the other elevator only goes to the even numbered floors. Neither elevator goes to the second floor because there are stairs.

1.pngElevator Schematic of Chungking Mansions

I lived on the 14th floor, and man, those elevators were slow! Because of the crazy population density of the building, the elevator would stop on several floors on the way up and down. Even more, people often carried furniture on the elevators, which took a long time to load and unload.

To pass the time, I timed exactly how long it took between arriving at the elevator on the ground floor, waiting for the elevator to come, riding the elevator up, and getting off at the 14th floor. After several trials, the average time came out to be about 4 minutes. Clearly, 4 minutes is too long, especially when waiting in 35 degrees weather without air condition, so I started to look for optimizations.

The bulk of the time is spent waiting for the elevator to come. The best case is when the elevator is on your floor and you get in, then the waiting time is zero. The worst case is when the elevator has just left and you have to wait a full cycle before you can get in. After you get in, it takes a fairly constant amount of time to reach your floor. Therefore, your travel time is determined by your luck with the elevator cycle. Assuming that the elevator takes 4 minutes to make a complete cycle (and you live on the top floor), the best case total elevator time is 2 minutes, the worst case is 6 minutes, and the average case is 4 minutes.

It occurred to me that just because I lived on the 14th floor, I don’t necessarily have to take the even numbered elevator! Instead, if the odd numbered elevator arrives first, it’s actually faster to take the elevator to the 13th floor and climb the stairs to the 14th floor. Compared to the time to wait for the elevator, the time to climb one floor is negligible. I started doing this trick and timed how long it took. Empirically, this optimization seemed to speed my time by about 1 minute on average.

Being a mathematician at heart, I was unsatisfied with empirical results. Theoretically, exactly how big is this improvement?


Let us model the two elevators as random variables X_1 and X_2, both independently drawn from the uniform distribution [0,1]. The random variables represent model the waiting time, with 0 being the best case and 1 being the worst case.

With the naive strategy of taking the even numbered elevator, our waiting time is X_1 with expected value E[X_1] = \frac{1}{2}. Using the improved strategy, our waiting time is \min(X_1, X_2). What is the expected value of this random variable?

For two elevators, the solution is straightforward: consider every possible value of X_1 and X_2 and find the average of \min(X_1, X_2). In other words, the expected value of \min(X_1, X_2) is

{\displaystyle \int_0^1 \int_0^1 \min(x_1, x_2) \mathrm{d} x_1 \mathrm{d} x_2}

Geometrically, this is equivalent to calculating the volume of the square pyramid with vertices at (0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0), and (1, 1, 1). Recall from geometry that the volume of a square pyramid with known base and height is \frac{1}{3} bh = \frac{1}{3}.

2.png

Therefore, the expected value of \min(X_1, X_2) is \frac{1}{3}, which is a 33% improvement over the naive strategy with expected value \frac{1}{2}.


Forget about elevators for now; let’s generalize!

We know that the expected value of two uniform [0,1] random variables is \frac{1}{3}, but what if we have n random variables? What is the expected value of the minimum of all of them?

I coded a quick simulation and it seemed that the expected value of the minimum of n random variables is \frac{1}{n+1}, but I couldn’t find a simple proof of this. Searching online, I found proofs here and here. The proof isn’t too hard, so I’ll summarize it here.

Lemma: Let M_n(x) be the c.d.f for \min(X_1, \cdots, X_n), where each X_i is i.i.d with uniform distribution [0,1]. Then the formula for M_n(x) is

M_n(x) = 1 - (1-x)^n

Proof:

\begin{array}{rl} M_n(x) & = P(\min(X_1, \cdots, X_n) < x) \\ & = 1 - P(X_1 \geq x, \cdots, X_n \geq x) \\ & = 1 - (1-x)^n \; \; \; \square \end{array}

Now to prove the main claim:

Claim: The expected value of \min(X_1, \cdots, X_n) is \frac{1}{n+1}

Proof:

Let m_n(x) be the p.d.f of \min(X_1, \cdots, X_n), so m_n(x) = M'_n(x) = n(1-x)^{n-1}. From this, the expected value is

\begin{array}{rl} {\displaystyle \int_0^1 x m_n(x) \mathrm{d}x} & = {\displaystyle \int_0^1 x n (1-x)^{n-1} \mathrm{d} x} \\ & = {\displaystyle \frac{1}{n+1}} \end{array}

This concludes the proof. I skipped a bunch of steps in the evaluation of the integral because Wolfram Alpha did it for me.


For some people, this sort of travel frustration would lead to complaining and an angry Yelp review, but for me, it led me down this mathematical rabbit hole. Life is interesting, isn’t it?

I’m not sure if the locals employ this trick or not: it was pretty obvious to me, but on the other hand I didn’t witness anybody else doing it during my stay. Anyhow, useful trick to know if you’re staying in the Chungking Mansions!

Read further discussion of this post on Reddit!

Neat Ideas in Coding Theory

It’s been a while since I’ve written anything mathematical on this blog, but one of my courses this term was really neat: Coding Theory (CO331). I liked it because it lies in the intersection between pure math and computer science: it applies abstract algebra to the concrete problem of sending messages.

Coding theory deals with encoding a message so that even if some errors occur during transmission and a few digits get flipped, the original message can still be recovered. This is not the same as cryptography — we’re not trying to hide the message, only correct errors.

What is a code?

How would you encode a message (assume it’s in binary), so that if one bit is flipped, you can still recover the message?

Well, an obvious way is to send each letter 3 times, so 0110 becomes 000111111000. Then to decode, we take the majority in each block of 3 letters, so as long as each block has only one error, we will always decode correctly. This is a an example of a code, with the codewords being {000, 111}.

1.png

In general, a code is a set of strings (codewords) of the same length. We send a codeword through the channel, where a few bits get flipped. Then, the decoder corrects the received word to the closest word in the code (nearest neighbor decoding).

The distance of a code is the minimum number of bit flips to transform one codeword to a different one. In order for a code to be able to correct a lot of errors, the distance should be high. In fact, if the distance is d, then the code can correct \lfloor \frac{d-1}{2} \rfloor errors.

We can visualize a code as a collection of spheres, centered around the codewords. Each sphere corresponds to the words that get corrected to the codeword in the middle.

2.png

A desirable code needs to have a high distance and a large number of codewords. This way, it can correct a lot of errors, without being too redundant. The rate of a code is

R = \frac{\mathrm{log_2(number\;of\;codewords)}}{\mathrm{length\;of\;codeword}}

This represents the efficiency: how much we’re paying to achieve the desired error correcting capability. The triplication code in the example has 2 codewords, each of length 3, so the rate is \frac{1}{3}.

Also, in coding theory, we’re not really concerned with how to transform a plaintext message into a stream of codewords. Given a code consisting of a set of codewords and a procedure to correct a received word to the nearest codeword, the rest is “just an engineering problem”.

These definitions set the scene for what’s coming next. As I said before, a good code should have a high distance as well as a large number of codewords. It also needs to be efficient to decode — with general codes, there is no better way to decode than to check every codeword to find the closest one to your received word — clearly, this is no good. To do better, we need to construct codes with algebraic structure.

Linear Codes

A linear code is a code where a linear combination of codewords is another codeword.

This is useful because now we can apply a lot of linear algebra machinery. The set of codewords now form a vector space, and from linear algebra, every vector space is spanned by a set of basis vectors. Then, to specify a code, we can write the basis vectors as rows of a matrix — this is called a generator matrix of a code.

Let’s do a concrete example. Consider a code where each word is 5 digits (each between 0-9), where the 5th digit is a “checksum” of the previous 4 digits modulo 10. For example, 98524 is valid because 9+8+5+2=24. This code cannot correct any mistakes, but can detect if a single typo has been made.

A generator matrix for this code is:

G = \left[\begin{array}{ccccc}1&0&0&0&1\\ 0&1&0&0&1 \\ 0&0&1&0&1 \\ 0&0&0&1&1 \end{array}\right]

If we add up any linear combination xG of the rows of G (remember to do arithmetic modulo 10), we get a codeword. For example, if x = [9,8,5,2], then x G = [9,8,5,2,4].

For any linear code, we can find a parity-check matrix H, which lets us quickly check if a word is in the code or not: given a word r, if Hr^T = 0, then r is a codeword. In our case, a parity-check matrix is:

H = \left[ \begin{array}{cccccc} 1&1&1&1&-1 \end{array} \right]

Using the parity-check matrix, we can easily detect errors, but correcting to the nearest codeword is still NP-hard for general linear codes. However, certain classes of linear codes have tractable decoding algorithms:

  • Hamming Codes: class of single-error correcting codes with distance 3.
  • Binary Golay Code: block length 24 and distance 8, so it can correct 3 errors.

Cyclic Codes

A cyclic code is a linear code where a rotation of any codeword is another codeword. (Since all cyclic codes are linear codes, any linear combination of codewords is still a codeword).

Cyclic codes are closely connected to polynomials, and we can write a codeword as coefficients of a polynomial: for example, 703001 is 7 + 3x^2 + x^5. A cyclic code is then the set of polynomial multiples of a generator polynomial g(x):

\langle g(x) \rangle = \{a(x) g(x)\}

Furthermore, g(x) must have some additional properties:

  • g(x) is an irreducible polynomial
  • g(x) | x^n-1 where n is the block length. For algebraic reasons that are difficult to explain, this ensures that x^n = 1, and we can always get rid of higher powers by replacing them with lower power terms.

For example, consider a binary cyclic code of block length 3, with generator polynomial g(x) = 1+x. Then the codewords are:

  • 0 (1+x) = 0 \to 000
  • 1 (1+x) = 1+x \to 110
  • x (1+x) = x+x^2 \to 011
  • (1+x) (1+x) = 1+x^2 \to 101

The codewords are {000, 110, 011, 101}. A rotation of a word to the right is the same as multiplying the corresponding polynomial by x, so cyclic codes are modeled nicely by polynomials.

Cyclic codes are very good at correcting burst errors, where errors are assumed to occur consecutively, instead of distributed evenly across the entire word. However, cyclic codes do not always have high distance. Some special classes of cyclic codes do have high distance, and are more suitable for general error correction:

  • BCH Codes: cyclic codes constructed to have high distance. They can be decoded efficiently, but the algorithm is a bit complicated.
  • Reed-Solomon Codes: special class of BCH codes suitable for use in practice. These are widely used for error correction in CDs, DVDs, and QR codes.

A Brief Introduction to DNA Computing

DNA computing is the idea of using chemical reactions on biological molecules to perform computation, rather than silicon and electricity. We often hear about quantum computers, and there is a lot of discussion about whether it will actually work, or crack RSA, stuff like that. In the domain of alternative computers, DNA computers are often overlooked, but they’re easy to understand (none of the quantum weirdness), and have potential to do massively parallel computations efficiently.

I first heard about DNA computers when doing my undergrad research project this term. I won’t bore you with the details, but it has to do with the theoretical aspects of DNA self-assembly which we will see is related to DNA computing.

The study of DNA computing is relatively new: the field was started by Leonard Adleman who published a breakthrough paper in Science in 1994. In this paper, he solved the directed Hamiltonian Path problem on 7 vertices using DNA reactions. This was the first time anything like this had been done. In this article, I will summarize this paper.

Operations on DNA

DNA is a complicated molecule with a lot of interesting properties, but we can view it as a string over a 4 letter alphabet (A, C, G, T). Each string has a Watson-Crick complement, where A is complement to T, and C is complement to G.

Without delving too deep into chemistry, I’ll describe some of the operations we can do with DNA.

1. Synthesis. We can use a machine to create a bunch of single DNA strands of any string we like. The technical term for these is oligonucleotides, but they’re just short DNA pieces. One limitation is we can only make strands of 20-25 nucleotides with current lab techniques.

2. Amplify. Given a test tube with only a few strands of DNA, we can amplify them into millions of strands using a process called polymerase chain reaction (PCR).

3. Annealing. Given a test tube with a lot of single stranded DNA, cooling it will cause complementary strands to attach to each other to form double strands.

4. Sort by length. By passing an electrical field through a solution, we can cause longer DNA strands to move to one side of the solution, a technique called gel electrophoresis. If desired, we can extract only strands of a certain length.

5. Extract pattern. Given a test tube of DNA, we can extract only those that contain a given pattern as a substring. To do this, put the complement of the pattern string into the solution and cause it to anneal. Only strands that contain the pattern will anneal, and the rest can be washed away.

This list is by no means exhaustive, but gives a sample of what operations are possible.

Solving Directed Hamiltonian Path with DNA

The Directed Hamiltonian Path problem asks, given a directed graph, does there exist a path from s to that goes through all the vertices?

For example, in this graph, if s=1 and t=3, then 1->4->2->3 is a directed Hamiltonian path.

This problem is related to the Travelling Salesman Problem, and is particularly interesting because it is NP-complete, so conventional computers can’t solve it efficiently. It would be really nice if DNA could solve it better than normal computers.

Here I’ll describe the process Leonard Adleman performed in 1994. He solved an instance of Directed Hamiltonian Path on 7 vertices, which is obviously trivial, and yet this took him 7 days of laboratory time. Early prototypes often tend to be laughably impractical.

Main Idea: we represent each vertex as a random string of 20 nucleotides, divided into two parts, each of 10 nucleotides. We represent a directed uv-edge by taking the second half of u and the first half of v, and taking the complement of all that.

The idea is that now, a directed path consists of vertex strands interleaved with edge strands, in a brick wall pattern, like this:

When we put all the vertex and edge strands into a test tube, very quickly the solution will anneal (and not just one, but millions of copies of it). However, the test tube also contains all kinds of strands that don’t represent Hamiltonian paths at all. We have to do a tricky sequence of chemical reactions to filter out only the DNA strands representing valid solutions.

Step 1. Keep only paths that start on s and end on t. This is done by filtering only strands that start and end with a given sequence, and this is possible with a variation of PCR using primers.

Step 2. Sort the DNA by length, and only keep the ones that visit exactly n vertices. Since each vertex is encoded by a string of length 20, in our example we would filter for strands of length 80.

Step 3. For each vertex, perform an extract operation to filter only paths that visit this vertex. After doing this n times, we are left with paths that visit every vertex. This is the most time consuming step in the whole process.

Step 4. Any strands remaining at this point correspond to Hamiltonian paths, so we just amplify them with PCR, and detect if any DNA remain in the test tube. If yes, there exists a directed Hamiltonian path from s to t in the graph.

That’s it for the algorithm. Adleman went on to describe the incredible potential of DNA computers. A computer today can do about 10^9 operations a second, but you can easily have 10^{20} DNA molecules in a test tube.

DNA Computing since 1994

Shortly after Adleman’s paper, researchers applied similar ideas to solve difficult problems, like 3-SAT, the maximal clique problem, the shortest common superstring problem, even breaking DES. Usually it was difficult to implement these papers in the lab, for example, Richard Lipton proposed a procedure to solve 3-SAT in 1995, but only in 2002 did Adleman solve an instance of 3-SAT with 20 variables in the lab.

On the theoretical side, there was much progress in formalizing rules and trying to construct “universal” DNA computers. Several different models of DNA computing were proven Turing complete (actually my research adviser Lila Kari came up with one of them). It has been difficult to build these computers, because some of the enzymes required for some operations don’t exist yet.

There has been progress on the practical side as well. Since Adleman, researchers have looked into other models of using biological molecules for computation, like solving 3-SAT with hairpin formation or solving the knight’s tour problem with RNA instead of DNA.

In 2006, a simplified DNA computer was built that had the ability to detect if a combination of enzymes were present, and only release medicine if they are all present (indicating that the patient had a disease). In 2013, researchers built the “transcriptor”: DNA versions of logic gates. One reason these are important because transcriptors are reusable, whereas previously all reagents have to be thrown away after each operation.

Current Limitations of DNA Computing

Clearly, the method I described is very time consuming and labor intensive. Each operation takes hours of lab work. This is not really a fundamental problem, in the future we might use robots to automate these lab operations.

The biggest barrier to solving large instances is that right now, we can’t synthesize arbitrary long strands of DNA (oligonucleotides). We can synthesize strands of 20-25 nucleotides with no problem, but as this number increases, the yield quickly becomes too low to be practical. The longest we can synthesize with current technology is a strand of length about 60. (Edit: Technology has improved since the papers I was looking at were written. According to my adviser, we can do 100 to a few thousand base pairs now in 2016).

Why do we need to synthesize long oligonucleotides? To represent larger problem instances, each vertex needs a unique encoding. If the encoding is too short, there will be a high probability of random sections overlapping by accident when they’re not supposed to, thereby ruining the experiment.

One promising direction of research is DNA self-assembly, so instead of painstakingly building oligonucleotides one base at a time, we put short strands in a test tube and let them self-assemble into the structures we want. My URA project this term deals with what kind of patterns can be constructed with self-assembly.

Today, if you need to solve a Hamiltonian path problem, like finding the optimal way to play Pokemon Go, you would still use a conventional computer. But don’t forget that within 100 years, computers have turned from impractical contraptions into devices that everyone carries in their pockets. I’ll bet that DNA computers will do the same.

References

  1. Adleman, Leonard. “Molecular computation of solutions to combinatorial problems“. Science, volume 266, 1994.
  2. Lipton, Richard. “DNA solution of hard computational problems“. Science, volume 268, 1995.

Visualizing Quaternions with Unity

How do you model the position and orientation of an airplane?

Position is easy, just represent it with a point in 3D space. But how do you specify its orientation — which direction it’s pointing?

At first glance, it seems a vector will do. After all, a vector points in some direction, right? If the plane is pointing east, represent its orientation by a unit vector pointing east.

Unfortunately, we quickly run into trouble when we try to roll. If we’re facing east, and we roll 90 degrees, we’re still facing east. Clearly we’re missing something.

Euler Angles

When real pilots talk about their orientation, they talk about roll, yaw, pitch. Pitch is going up or down, yaw is going left or right, roll is, well, roll.

Any change in orientation can be described by some combination of roll, yaw, pitch. This is the basis for Euler Angles. We use three angles to represent the airplane’s orientation.

This is all fine and dandy if we want to represent the orientation of a static object in space. But when we try to adjust our orientation, we start to run into problems.

You’re thinking, this should be simple! When we turn left or right, we just increment the yaw variable, right? Yes, it seems to work, at least initially. You can turn left and right, up and down, and roll around.

Implement it in Unity and play around a bit, however, and you begin to notice that things don’t quite behave the way you expect.

In this animation, I’m holding down the right button:

The plane does rotate to the right, but it’s not rotating relative to itself. Instead it’s rotating around some invisible y-axis. If it was rotating relative to itself, the green arrow shouldn’t be moving.

The problem becomes more and more severe when the pitch of the plane becomes higher and higher. The worst case is when the airplane is pointing straight up: then roll and yaw become the same thing! This is called gimbal lock: we have lost a degree of freedom and we can only rotate in 2 dimensions! Definitely not something desirable if we’re controlling a plane or spaceship.

It turns out that no matter what we do, we will suffer from some form of gimbal lock. As long as we use Euler Angles, there is one direction where if we turn too far, everything starts to screw up.

Practical Introduction to Quaternions

All is not lost, however. There is a way to represent orientation that represents all axes equally and does not suffer from gimbal lock. This mythical structure is called the quaternion. Unlike Euler Angles which describe your orientation relative to a fixed set of axes, quaternions do not rely on any fixed axis.

The drawback is that quaternions are unintuitive to understand for humans. There is no way to “look” at a quaternion and be able to visualize what rotation it represents. Fortunately for us, it’s not that difficult to make use of quaternions, even if we can’t visualize quaternions.

There is a lot of theory behind how quaternions work, but in this article, I will gloss over the theory and give a quick primer to quaternions, just the most common facts you need to use them. At the same time, I will implement the operations I describe in C#, so I can integrate them with Unity. If you don’t know C#, you can freely ignore the code.

Definition

A quaternion is an ordered pair of 4 real numbers (w,x,y,z). We write this as

w+xi+yj+zk

The letters i,j,k are not variables. Rather, they are independent axes. If you like, you can think of the quaternions as a 4 dimensional vector space.

The defining property of quaternions is:

i^2 = j^2 = k^2 = ijk = -1

Play around with it a bit and you can derive 6 more identites:

ij = k

jk = i

ki = j

ji = -k

kj = -i

ik = -j

If you’ve worked with complex numbers, this should seem familiar. Instead of 2 parts of a complex number (the real and imaginary parts), we have 4 parts for a quaternion.

The similarity doesn’t end here. Multiplying complex numbers represents a rotation in 2 dimensions. Similarly, multiplying by a quaternion represents a rotation in 3D.

One curious thing to note: we have ij=k and ji=-k. We switched around the terms and the product changed. This means that multiplying quaternions is kind of like multiplying matrices — the order matters. So multiplication is not commutative.

Here’s a framework for a quaternion in C#:

public class Quat{
	// Represents w + xi + yj + zk
	public float w, x, y, z;
	public Quat(float w, float x, float y, float z){
		this.w = w;
		this.x = x;
		this.y = y;
		this.z = z;
	}
}

Normalizing Quaternions

The norm of a quaternion is

N(\mathbf{q}) = \sqrt{w^2+x^2+y^2+z^2}

When we use quaternions to represent rotations, we typically want unit quaternions: quaternions with norm 1. This is straightforward: to normalize a quaternion, divide each component by the norm.

In C#:

public float Norm(){
  return Mathf.Sqrt (w * w + x * x + y * y + z * z);
}

public Quat Normalize(){
  float m = Norm ();
  return new Quat (w / m, x / m, y / m, z / m);
}

Multiplying Quaternions

Multiplying is simple, just a little tedious. If we have two quaternions:

(w_1 + x_1i + y_1j + z_1k) (w_2+x_2i+y_2j+z_2k)

Then their product is this ugly mess:

\begin{array}{l} w_1w_2-x_1x_2-y_1y_2-z_1z_2 \\ + (w_1x_2+x_1w_2+y_1z_2-z_1y_2)i \\ + (w_1y_2+y_1w_2-x_1z_2+z_1x_2) j \\ + (w_1z_2+z_1w_2+x_1y_2-y_1x_2) k \end{array}

In C#:

// Returns a*b
public static Quat Multiply(Quat a, Quat b){
  float w = a.w * b.w - a.x * b.x - a.y * b.y - a.z * b.z;
  float x = a.w * b.x + a.x * b.w + a.y * b.z - a.z * b.y;
  float y = a.w * b.y + a.y * b.w - a.x * b.z + a.z * b.x;
  float z = a.w * b.z + a.z * b.w + a.x * b.y - a.y * b.x;
  return new Quat (w,x,y,z).Normalize();
}

Since multiplication is not commutative, I made this function static to avoid confusing left and right multiplication. Also, I normalize the product so that floating point errors don’t accumulate.

Constructing Rotation Quaternions

Every rotation operation can be written as a rotation of some angle, \theta, around some vector (u_x, u_y, u_z):

The following formula gives a quaternion that represents this rotation:

\mathbf{q} = \cos \frac{\theta}{2} + (u_x i + u_y j + u_z k) \sin \frac{\theta}{2}

For our purposes, \theta is a very small number, say 0.01, and we use one of the three basis vectors to rotate around. For example, if we are rotating around (1,0,0) then our quaternion is

\cos \frac{0.01}{2} + \sin \frac{0.01}{2}i

That’s it: given any quaternion, multiplying on the left by our quaternion rotates it slightly around the x axis.

In C#, our code might look like this:

Quat qx = new Quat (Mathf.Cos (0.01 / 2), 0, 0, Mathf.Sin (0.01 / 2));
Quat qy = new Quat (Mathf.Cos (0.01 / 2), 0, Mathf.Sin (0.01 / 2), 0);
Quat qz = new Quat (Mathf.Cos (0.01 / 2), Mathf.Sin (0.01 / 2), 0, 0);

Putting it together

That’s all we need to do interesting things with quaternions. Let’s combine everything we have. Here’s our quaternion class thus far:

public class Quat{
	// Represents w + xi + yj + zk
	public float w, x, y, z;
	public Quat(float w, float x, float y, float z){
		this.w = w;
		this.x = x;
		this.y = y;
		this.z = z;
	}

	public float Norm(){
		return Mathf.Sqrt (w * w + x * x + y * y + z * z);
	}

	public Quat Normalize(){
		float m = Norm ();
		return new Quat (w / m, x / m, y / m, z / m);
	}

	// Returns a*b
	public static Quat Multiply(Quat a, Quat b){
		float w = a.w * b.w - a.x * b.x - a.y * b.y - a.z * b.z;
		float x = a.w * b.x + a.x * b.w + a.y * b.z - a.z * b.y;
		float y = a.w * b.y + a.y * b.w - a.x * b.z + a.z * b.x;
		float z = a.w * b.z + a.z * b.w + a.x * b.y - a.y * b.x;
		return new Quat (w,x,y,z).Normalize();
	}

	public Quaternion ToUnityQuaternion(){
		return new Quaternion (w, x, y, z);
	}
}

Now we just need to read the input, perform our calculations, and output the rotation quaternion to Unity:

public class Airplane : MonoBehaviour {
  public GameObject airplane;
  public Quat quat = new Quat (0, 0, 0, -1);
  public float speed = 0.01f;

  void FixedUpdate(){
    float inputX = Input.GetAxis("UpDown");
    float inputY = Input.GetAxis("LeftRight");
    float inputZ = Input.GetAxis("Roll");

    Quat qx = new Quat (Mathf.Cos (speed * inputX / 2), 0, 0, Mathf.Sin (speed * inputX / 2));
    Quat qy = new Quat (Mathf.Cos (speed * inputY / 2), 0, Mathf.Sin (speed * inputY / 2), 0);
    Quat qz = new Quat (Mathf.Cos (speed * inputZ / 2), Mathf.Sin (speed * inputZ / 2), 0, 0);

    quat = Quat.Multiply (qx, quat);
    quat = Quat.Multiply (qy, quat);
    quat = Quat.Multiply (qz, quat);

    airplane.transform.rotation = quat.ToUnityQuaternion ();
  }
}

In Unity, the input is not given to us as a single true/false value, but a float between -1 and 1. So holding right increases the LeftRight input gradually until it reaches 1, avoiding a sudden jump in movement.

What’s ToUnityQuaternion? Well, it turns out that Unity already has a Quaternion class that does everything here and much more, so all this could have literally been implemented in one line if we wanted.

Anyways, let’s see the result.

As you can see, holding right turns the plane relative to itself now, and the green arrow stays still. Hooray!

Beginner’s comparison of Computer Algebra Systems (Mathematica / Maxima / Maple)

I’ve never been very good at doing manual computations, and whenever I need to do a tedious computation for an assignment, I like to automate it by writing a computer program. Usually I implemented an ad-hoc solution using Haskell, either using a simple library or rolling my own implementation if the library didn’t have it. But I found this solution to be unsatisfactory: my Haskell programs worked with integers and floating numbers and I couldn’t easily generalize it to work with symbolic expressions. So I looked to learn a CAS (computer algebra system), so in the future I won’t have to hack together buggy code for common math operations.

I have no experience with symbolic computing, so it wasn’t clear to me where to begin. To start off, there are many different competing computer algebra systems, all incompatible with each other, and it’s far from clear which one is best for my needs. I began to experiment with several systems, but after a few days I still couldn’t decide which one was the winner.

I narrowed it down to 3 platforms. Here’s my setup (all running on Windows 7):

  • Mathematica 8.0
  • Maxima 5.32 with wxMaxima 13.04
  • Maple 18.00

So I came up with a trial — I had a short (but nontrivial) problem representative of the type of problem I’d be looking at, and I would try to solve it in all 3 languages, to determine which one was easiest to work with.

The Problem

This problem came up as a part of a recent linear algebra assignment.

Let the field be \mathbb{Z}_5 (so all operations are taken modulo 5). Find all 2×2 matrices P such that

P^T \left( \begin{array}{cc} 2 & 0 \\ 0 & 3 \end{array} \right) P = I

We can break this problem into several steps:

  • Enumerate all lists of length 4 of values between 0 to 4, that is, [[0,0,0,0],[0,0,0,1],…,[4,4,4,4]]. We will probably do this with a cartesian product or list comprehension.
  • Figure out how to convert a list into a 2×2 matrix form that the system can perform matrix operations on. For example, [1,2,3,4] might become matrix([1,2],[3,4])
  • Figure out how to do control flow, either by looping over a list (procedural) or with a map and filter (functional)
  • Finally, multiply the matrices modulo 5 and check if it equals the identity matrix, and output.

This problem encompasses a lot of the challenges I have with CAS software, that is, utilize mathematical functions (in this case, we only use matrix multiplication and transpose), yet at the same time express a nontrivial control flow. There are 5^4=625 matrices to check, so performance is not a concern; I am focusing on ease of use.

For reference, here is the answer to this problem:

These are the 8 matrices that satisfy the desired property.

I have no prior experience in programming in any of the 3 languages, and I will try to solve this problem with the most straightforward way possible with each of the languages. I realize that my solutions will probably be redundant and inefficient because of my inexperience, but it will balance out in the end because I’m equally inexperienced in all of the languages.

Mathematica

I started with Mathematica, a proprietary system by Wolfram Research and the engine behind Wolfram Alpha. Mathematica is probably the most powerful out of the three, with capabilities with working with data well beyond what I’d expect from a CAS.

What I found most jarring about Mathematica is its syntax. I’ve worked with multiple procedural and functional languages before, and there are certain things that Mathematica simply does differently from everybody else. Here are a few I ran across:

  • To use a pure function (equivalent of a lambda expression), you refer to the argument as #, and the function must end with the & character
  • The preferred shorthand for Map is /@ (although you can write the longhand Map)
  • To create a cartesian product of a list with itself n times, the function is called Tuples, which I found pretty counterintuitive

Initially I wanted to convert my flat list into a nested list by pattern matching Haskell style, ie f [a,b,c,d] = [[a,b],[c,d]], but I wasn’t sure how to do that, or if the language supports pattern matching on lists. However I ran across Partition[xs,2] which does the job, so I went with that.

Despite the language oddities, the functions are very well documented, so I was able to complete the task fairly quickly. The UI is fairly streamlined and intuitive, so I’m happy with that. I still can’t wrap my head around the syntax — I would like it more if it behaved more like traditional languages — but I suppose I’ll get the hang of it after a while.

Here’s the program I came up with:

SearchSpaceLists := Tuples[Range[0, 4], 4]
SearchSpaceMatrices := 
 Map[Function[xs, Partition[xs, 2]], SearchSpaceLists]
Middle := {{2, 0}, {0, 3}}
FilteredMatrices := 
 Select[SearchSpaceMatrices, 
  Mod[Transpose[#].Middle.#, 5] == IdentityMatrix[2] &]
MatrixForm[#] & /@ FilteredMatrices

Maxima

Maxima is a lightweight, open source alternative to Mathematica; I’ve had friends recommend it as being small and easy to use.

The syntax for Maxima is more natural, with things like lists and loops and lambda functions working more or less the way I expect. However, whenever I tried to do something with a function that isn’t the most common use case, I found the documentation lacking and often ended up combing through old forum posts.

Initially I tried to generate a list with a cartesian product like my Mathematica version, but I couldn’t figure out how to do that, eventually I gave up and used 4 nested for loops because that was better documented.

Another thing I had difficulty with was transforming a nested list into a matrix using the matrix command. Normally you would create a matrix with matrix([1,2],[3,4]), so by passing in two parameters. The function doesn’t handle passing in matrix([[1,2],[3,4]]), so to get around that you need to invoke a macro: funmake(‘matrix,[[1,2],[3,4]]).

Overall I found that the lack of documentation made the system frustrating to work with. I would however use it for simpler computations that fall under the common use cases — these are usually intuitive in Maxima.

Here’s the program I came up with:

Middle:matrix([2,0],[0,3]);
Ident:identfor(Middle);
for a:0 thru 4 do
  for b:0 thru 4 do
    for c:0 thru 4 do
      for d:0 thru 4 do
        (P:funmake('matrix,[[a,b],[c,d]]),
         P2:transpose(P).Middle.P,
         if matrixmap(lambda([x],mod(x,5)),P2) = Ident then
             print(P));

Shortly after writing this I realized I didn’t actually need the funmake macro, since there’s no need to generate a nested list in the first place, I could simply do matrix([a,b],[c,d]). Oh well, the point still stands.

Maple

Maple is a proprietary system developed by Maplesoft, a company based in Waterloo. Being a Waterloo student, I’ve had some contact with Maple: professors used it for demonstrations, some classes used it for grading. Hence I felt compelled to give Maple a shot.

At first I was pleasantly surprised that matrix multiplication in a finite field was easy — the code to calculate A*B in \mathbb{Z}_5 is simply A.B mod 5. But everything went downhill after that.

The UI for Maple feels very clunky. Some problems I encountered:

  • It’s not clear how to halt a computation that’s in a an infinite loop. It doesn’t seem to be possible within the UI, and the documentation suggests it’s not possible in all cases (it recommends manually terminating the process). Of course, this loses all unsaved work, so I quickly learned to save before every computation.
  • I can’t figure out how to delete a cell without googling it. It turns out you have to select your cell and a portion of the previous cell, then hit Del.
  • Copy and pasting doesn’t work as expected. When I tried to copy code written inside Maple to a text file, all the internal formatting and syntax highlighting information came with it.
  • Not an UI issue, but error reporting is poor. For example, the = operator works for integers, but when applied to matrices, it silently returns false. You have to use Equals(a,b) to compare matrices (this is kind of like java).

In the end, I managed to complete the task but the poor UI made the whole process fairly unpleasant. I don’t really see myself using Maple in the future; if I had to, I would try the command line.

Here’s the program I came up with:

with(LinearAlgebra):
with(combinat, cartprod):
L := [seq(0..4)]:
T := cartprod([L, L, L, L]):
Middle := <2,0;0,3>:
while not T[finished] do
  pre_matrix := T[nextvalue]();
  matr := Matrix(2,2,pre_matrix);
  if Equal(Transpose(matr).Middle.matr mod 5, IdentityMatrix(2)) then
    print(matr);
  end if
end do:

Conclusion

After the brief trial, there is still no clear winner, but I have enough data to form some personal opinions:

  • Mathematica is powerful and complete, but has a quirky syntax. It has the most potential — definitely the one I would go with if I were to invest more time into learning a CAS.
  • Maxima is lightweight and fairly straightfoward, but because of lack of documentation, it might not be the best tool to do complicated things with. I would keep it for simpler calculations though.
  • Maple may or may not be powerful compared to the other two, I don’t know enough to compare it. But its UI is clearly worse and it would take a lot to compensate for that.

I’ve posted this discussion on some discussion boards for Mathematica and Maple, you might find them informative.

Splitting utility costs between roommates is NP-Complete

Here’s an easy problem.

You live in a house with 4 people. For simplicity, I will call them Andrei, Bai, Darin, and Young. One person pays for electricity, another person pays for gas, another person pays for water, and the last person pays for internet. However, the utilities cost different amounts, and it is agreed that the total cost should be split equally.

It has come to the time to wrap up the bills. After tallying up the receipts, you find that Andrei has paid $650, Bai has paid $240, Darin has paid $190, and Young has paid $120. What transfers do you make to distribute the costs fairly?

Well that’s easy. Add up all the numbers and you find that the group paid $1200 in total. A quarter of that is $300 — that’s the amount each person should pay in the end. If you’ve already paid $240, then the difference, $60, is the amount you have to pay to compensate.

To see this even more clearly, let us define balance as the difference between what you’re supposed to pay and what you actually paid. From now on, I will use a negative balance to mean you paid more than you supposed to and you are owed money; a positive balance means you owe money to others.

In this case, it’s obvious how to balance the bills. Since Andrei is the only person with a negative balance, everyone simply transfers the correct sum of money to Andrei, problem solved.

But in general…

Being a computer science major, this left me wondering: what if I lived with 20 people? And what if, throughout the term, we lend each other money, so that multiple people have a negative balance, and multiple people have a positive balance? How do we solve this problem then?

For simplicity, from now on I will assume the preliminary calculations have been done, and we will work solely with the balance column. I will also assume that all values are integers.

One immediate observation is the balances always add up to 0. So given a list of integers than add up to 0, how do we find an efficient set of transfers to balance the bill?

What do we mean by efficient? Well, let’s explore several possibilities.

Roommate Problem, version 1

Given a list of balances that add up to 0, find the smallest number of transfers to balance the bill.

This seems at first glance to be the criterion we’re looking for. Writing cheques is a hassle, so we don’t want to write more than what is absolutely necessary.

But if you think about it, there’s a really cheap way to solve this problem:

Sort the list. Starting from the highest number, give all your money to the second highest number, repeat n-1 times.

Somehow this doesn’t feel very satisfying. If there are a lot of people, the people in the middle are going to be handling enormous amounts of money. Let’s try again.

Roommate Problem, version 2

Given a list of balances that add up to 0, minimize the total money transferred to balance the bill.

Perhaps what we really want is to minimize the money transferred? Maybe the bank charges $0.01 for each $1 you transfer?

Unfortunately, this problem can also be solved in a cheap way:

We don’t care how many transfers we make, so let’s just transfer $1 at a time! As long as we always transfer from positive to negative, it doesn’t matter how we do it, we’re always going to transfer a fixed amount of money. Let’s try again.

Roommate Problem, version 3

Given a list of balances that add up to 0, find the smallest set of transfers to balance the bill, with the limitation that transfers are only allowed from a positive to a negative balance.

This captures our intuition that a person should either be transferring money or receiving money, not both.

Version 3 doesn’t fall immediately to a cheap trick like its two predecessors. Instances of this problem can get pretty tricky at times — here are some examples of some optimal solutions:

I couldn’t come up with an efficient algorithm to solve this problem. The best I could come up with was a greedy algorithm:

Assume the input is [-8,-4,5,7]. On each step, look for the number with the least absolute value (-4). Without loss of generality, assume this number is negative. Then ‘zero’ this number by cancelling it with the smallest number on the other side — so transfer $4 from 5 to 4, giving us [-8,1,7]. Repeat this until all numbers are zero.

How bad is this algorithm? Let’s say there are M negative numbers and N positive numbers. Then this algorithm requires at most M+N-1 transfers, since each step zeroes at least one number, and the last step zeroes two numbers.

The optimal solution takes at least max(M,N) transfers. This proves that my greedy algorithm never takes more than 2 times the optimal number of transfers. Not too bad, but not great either.

Unable to progress any further, I asked around in the TopCoder forums. Surprisingly, I got an answer that hinted the problem was impossible to solve efficiently — it is NP-Complete!

NP-Complete by Reduction from SUBSET-SUM

To prove a problem can be solved efficiently, you simply describe an algorithm that solves the problem, then prove this algorithm is efficient. But how do you prove a problem cannot be solved efficiently?

There are certain problems in computer science that are known to be hard: one of them is the Subset Sum problem. Given a set of positive integers and a positive integer N, is it possible to find a subset that sums to exactly N? Return YES if this is possible, or NO otherwise.

For example, say our set is {3,5,7,8,11}. Can we make 16? The answer is YES, because 5+11=16. Can we make 17? The answer is NO — if you check all the possibilities, you discover that no subset sums to exactly 17.

We can leverage the fact that the Subset Sum problem is hard using a proof by contradiction. Assume that there exists some efficient algorithm to solve the Roommate problem. In the diagram, I symbolize it with a black box.

Assume there is also a converter routine: an easy way to convert an input for the Subset Sum problem into an input for the Roommate problem. I’ll get to the details of this converter shortly; right now, assume it exists.

Then combining the Roommate solver with the converter, we have created a Subset Sum solver! If the Roommate solver is efficient, then this Subset Sum solver is also efficient. But we know that no efficient Subset Sum solver exists. Ergo, no efficient Roommate solver exists either.

The only missing piece is to reduce an instance of the Subset Sum problem to an input to the Roommate problem.

Here’s how. For each number in your set, create a roommate with that number as a positive balance. Then create a roommate with a balance of -N (the number you’re trying to sum up to). Then create one final roommate with the exact balance so that all the numbers sum to 0.

Here’s the input for {3,5,7,8,11} and N=16:

There are 5 numbers in the set, and the Roommate solver finds a solution requiring 5 transfers.

By contrast, here’s the input for {3,5,7,8,11} and N=17:

The Roommate solver can’t do better than 6 transfers.

So to solve the Subset Sum problem, plug it into the Roommate solver and see how many transfers it outputs. If it outputs exactly 1 transfer for every element in your set, then output YES. Otherwise, if there are more transfers than elements in your set, output NO.

This proves that the Roommate problem is as least as hard as Subset Sum, so it’s NP-Complete.

Research in Existing Literature and Application to Biology

While researching for this blog post, I came upon this research paper titled “On the Minimum Common Integer Partition Problem” published in 2006 by Xin Cheng, Lan Liu, Zheng Liu, and Tao Jiang.

They investigate a problem they call Minimum Common Integer Partition (MCIP). Given two lists of integers, say [4,8] and [5,7], find the smallest common partition — in this case, [3,4,5].

Compare this to the Roommate problem with input [-4,-8,5,7], and it’s clear that the Roommate problem is identical to 2-MCIP. (The 2 just means we’re finding the smallest partition between 2 lists, the paper also investigates finding the smallest partition between more than 2 lists).

Skimming through this paper, it derives an algorithm similar to my greedy algorithm which approximates the problem by a factor of 2. Using more complicated techniques, it manages to produce an algorithm with a 5/4 approximation.

Doing a bit more searching, it turns out that a more recent paper by David Woodruff reduces the approximation ratio for 2-MCIP down to 1.228; an even better paper reduces it down to 1.125 using network flow techniques. At this point, I think I’m way too sidetracked from the original problem, so I didn’t investigate the details.

What surprised me more was that this research was motivated not by roommates sharing utilities, but by biologists studying genome sequences! Biology is not my area of expertise, so I won’t comment further on that. But I’ll leave you these slides (taken from a presentation by the above-mentioned David Woodruff):

So in short, we can’t solve the Roommate problem perfectly, but with cutting-edge algorithms, we can guarantee ourselves to be off by no more than 12.5%!

Simple experimentation with jQuery

This term, I got hired for a co-op internship at a small software company in Kitchener.

The job posting required primarily Java programming, but the company uses a combination of Java (for the back end) and Javascript (for the front end). I did not have much experience with Javascript and web programming, so they asked me to learn jQuery and Ajax, and a bunch of other things.

After a few days of playing with jQuery, this is what I came up with:

It’s a “Trivial Collatz Simulator”. The user types in a number, and the program simulates the Collatz procedure (with animations!) until we reach 1.

The program is written using jQuery. On each iteration, it uses Ajax to query a local server (written in PHP), to do the arithmetic and return the next number in the sequence. That’s about it.