A few days ago I saw an interesting programming challenge on Proggit (more commonly known as /r/programming, or the programming subreddit). The problem is found here.

This is how the problem goes:

You are given a rectangular grid, with houses scattered across it:

The objective is to place *bacon dispensers* (I’ll call them bacons from now on) at various places so the people in the houses can get the bacon.

I have no clue why they chose *bacon*, out of all objects to choose from. Alas, that is not the point.

So given a limited number of bacons, you must distribute them effectively among the houses by putting them on empty squares. In the example, you have *three *bacons to place.

For each house, the score is the distance to the nearest bacon (using the Manhattan, not Euclidean metric). Your total score is the sum of the scores for each house. Thus, you are trying to minimize your total score.

### Optimal solutions

Here is the optimal solution for the problem:

If you add them up, you can see that the total score for this configuration is 10.

Question is, how do you arrive at this configuration?

It turns out that this isn’t as easy as it looks. This problem is **NP-Hard**, meaning there is no algorithm that can solve it both quickly and optimally. By “quickly”, it’s understood to mean *polynomial *or *non-exponential* complexity; if this is impossible then the best algorithm is not significantly better than just brute force.

In order to solve the problem in a reasonable amount of time, we have to trade optimality for speed and rely on less than optimal, probabilistic approaches.

### Introducing the k-means algorithm

We will now transform the problem into a *data clustering* problem.

If we have *k* bacons to place, then we must find *k* distinct *clusters*. After this, we can place the bacons in the **centroid** of each cluster to achieve the optimal score. In other words, we are finding clusters such that the distance from a point to the center of a cluster is minimized.

The best known algorithm for this problem is **Lloyd’s algorithm**, more commonly referred to as the **k-means algorithm**. Let’s try an example to demonstrate this algorithm.

Suppose we want to find *two *clusters in these points:

We start by choosing two centers *randomly *from the sample space, let’s make them green and red:

We assign each point to its nearest center:

Then, we move each center to the **centroid **of its cluster:

Notice now how some of the points are closer to a different center than the center they’re assigned now. Indeed, they belong to a different cluster.

So we reassign the clusters:

Again we calculate the centroids:

We repeat these steps as many times as we need to, usually until the clusters do not change anymore. Depending on the data it may take more or less iterations, but it normally converges fairly quickly.

This method, unfortunately, does not always achieve an optimal result. Technically it always converges on a **local optimum**, which is not always the **global optimum**. The local optimum can be arbitrarily worse than the global optimum.

Take note of how the result of the algorithm depends entirely on the results of the random starting positions of the clusters.

If you’re very very lucky, they might as well end up at exactly the optimal locations.

If you’re really unlucky, however, they may end up all in a corner of the map; and the result configuration would be far from optimal. We might even end up with most of the clusters completely empty. The thing is that they’re assigned completely randomly.

We can do better than that.

### Improving the k-means: introducing the k-means++ algorithm

The k-means++ algorithm addresses some of the problems with the k-means algorithm, by seeking better starting clusters. Its results are almost always better than the standard algorithm.

Let’s try this.

The first thing we do is put a cluster right on top of a random point:

For each point that doesn’t already have a cluster on it, calculate the distance to the nearest cluster (which is not always the same cluster):

Next we assign a probability to each of the points, **proportional to the squares of the distances**:

The next cluster is chosen with this weighted probability. We repeat this until we have all *k* clusters distributed on top of *k* different points.

Then, we proceed with the regular k-means algorithm.

The result of this way of choosing is that the starting clusters tend to be spread out more evenly; moreover there’s no empty clusters. Notice how a point twice as far from the nearest cluster is four times more likely to be chosen for the next cluster.

Although this drastically improves the k-means algorithm, it still does notĀ guaranteeĀ an optimal configuration.

### Repeated simulation

There is one more thing we can do to increase our score. Being a probabilistic algorithm, the results depend heavily on the random numbers generated. Using different random numbers would achieve better or worse results.

To get the better results, we can run the algorithm multiple times, each time with a different set of random numbers. As the number of iterations increase, the score will get closer and closer to the optimum.

### Implementation in Haskell

It took me about two days to write a program for this task; I’ve submitted the program to the challenge. There the source code is available, as well as various benchmarks.

Looking through and running some of the other entries, it seems that my program beats most of them. One exception is the entry (entries) by *idevelop*, which produces considerably better scores than mine for the extremely large input sets. On the other hand, my program does better on most other inputs (medium and large) by repeating the simulation a few hundred times, (usually) arriving at the optimum solution.