One year of math blogging

February 11, 2011

One year ago on February 11, 2010, I created this blog. In one year, this blog has had:

  • 68 posts
  • 741 spam comments
  • 111 actual comments
  • 30,000 hits
A few months ago I installed the live traffic feed (visible on the right of the page), and for about two weeks afterwards I collected different countries that visit the blog (inspired by a post by Brian Bi) and I got 110 of them before I got bored:
Algeria
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahrain
Bangladesh
Belarus
Belgium
Belize
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Brunei
Bulgaria
Cambodia
Canada
Chile
China
Columbia
Costa Rica
Croatia
Cyprus
Czech Republic
Denmark
Dominican Republic
El Salvador
Eritrea
Estonia
Ethiopia
Fiji
Finland
France
Georgia
Germany
Ghana
Greece
Guadeloupe
Guatemala
Hong Kong
Hungary
Iceland
India
Iran
Ireland
Israel
Italy
Jamaica
Japan
Kazakhstan
Kenya
Kuwait
Latvia
Lebanon
Lithuania
Luxembourg
Macedonia
Malawi
Malaysia
Malta
Martinique
Mauritius
Moldova
Mongolia
Montenegro
Morocco
Namibia
Nepal
Netherlands
New Zealand
Nigeria
Norway
Pakistan
Peru
Philippines
Poland
Portugal
Puerto Rico
Qatar
Romania
Russia
Saudi Arabia
Serbia
Singapore
Slovakia
Slovenia
South Africa
South Korea
Spain
Sri Lanka
Sudan
Sweden
Switzerland
Syria
Taiwan
Tajikstan
Thailand
Trinidad and Tobago
Turkey
U.K
U.S.A
US Virgin Islands
Uganda
Ukraine
United Arab Emirates
Venezuela
Vietnam
Zimbabwe

Obviously there is some geographical diversity among visitors.

I’m not much of a statistics person, but I like looking at a few graphs of pageviews over time:

Apparently there’s a slight decrease of pageviews as I write less the past few months, but I seem to get around 3000 to 4000 hits a month. Yay.


The Proggit Bacon Challenge: a probabilistic and functional approach

July 14, 2010

A few days ago I saw an interesting programming challenge on Proggit (more commonly known as /r/programming, or the programming subreddit). The problem is found here.

This is how the problem goes:

You are given a rectangular grid, with houses scattered across it:

The objective is to place bacon dispensers (I’ll call them bacons from now on) at various places so the people in the houses can get the bacon.

I have no clue why they chose bacon, out of all objects to choose from. Alas, that is not the point.

So given a limited number of bacons, you must distribute them effectively among the houses by putting them on empty squares. In the example, you have three bacons to place.

For each house, the score is the distance to the nearest bacon (using the Manhattan, not Euclidean metric). Your total score is the sum of the scores for each house. Thus, you are trying to minimize your total score.

Optimal solutions

Here is the optimal solution for the problem:

If you add them up, you can see that the total score for this configuration is 10.

Question is, how do you arrive at this configuration?

It turns out that this isn’t as easy as it looks. This problem is NP-Hard, meaning there is no algorithm that can solve it both quickly and optimally. By “quickly”, it’s understood to mean polynomial or non-exponential complexity; if this is impossible then the best algorithm is not significantly better than just brute force.

In order to solve the problem in a reasonable amount of time, we have to trade optimality for speed and rely on less than optimal, probabilistic approaches.

Introducing the k-means algorithm

We will now transform the problem into a data clustering problem.

If we have k bacons to place, then we must find k distinct clusters. After this, we can place the bacons in the centroid of each cluster to achieve the optimal score. In other words, we are finding clusters such that the distance from a point to the center of a cluster is minimized.

The best known algorithm for this problem is Lloyd’s algorithm, more commonly referred to as the k-means algorithm. Let’s try an example to demonstrate this algorithm.

Suppose we want to find two clusters in these points:

We start by choosing two centers randomly from the sample space, let’s make them green and red:

We assign each point to its nearest center:

Then, we move each center to the centroid of its cluster:

Notice now how some of the points are closer to a different center than the center they’re assigned now. Indeed, they belong to a different cluster.

So we reassign the clusters:

Again we calculate the centroids:

We repeat these steps as many times as we need to, usually until the clusters do not change anymore. Depending on the data it may take more or less iterations, but it normally converges fairly quickly.

This method, unfortunately, does not always achieve an optimal result. Technically it always converges on a local optimum, which is not always the global optimum. The local optimum can be arbitrarily worse than the global optimum.

Take note of how the result of the algorithm depends entirely on the results of the random starting positions of the clusters.

If you’re very very lucky, they might as well end up at exactly the optimal locations.

If you’re really unlucky, however, they may end up all in a corner of the map; and the result configuration would be far from optimal. We might even end up with most of the clusters completely empty. The thing is that they’re assigned completely randomly.

We can do better than that.

Improving the k-means: introducing the k-means++ algorithm

The k-means++ algorithm addresses some of the problems with the k-means algorithm, by seeking better starting clusters. Its results are almost always better than the standard algorithm.

Let’s try this.

The first thing we do is put a cluster right on top of a random point:

For each point that doesn’t already have a cluster on it, calculate the distance to the nearest cluster (which is not always the same cluster):

Next we assign a probability to each of the points, proportional to the squares of the distances:

The next cluster is chosen with this weighted probability. We repeat this until we have all k clusters distributed on top of k different points.

Then, we proceed with the regular k-means algorithm.

The result of this way of choosing is that the starting clusters tend to be spread out more evenly; moreover there’s no empty clusters. Notice how a point twice as far from the nearest cluster is four times more likely to be chosen for the next cluster.

Although this drastically improves the k-means algorithm, it still does notĀ guaranteeĀ an optimal configuration.

Repeated simulation

There is one more thing we can do to increase our score. Being a probabilistic algorithm, the results depend heavily on the random numbers generated. Using different random numbers would achieve better or worse results.

To get the better results, we can run the algorithm multiple times, each time with a different set of random numbers. As the number of iterations increase, the score will get closer and closer to the optimum.

Implementation in Haskell

It took me about two days to write a program for this task; I’ve submitted the program to the challenge. There the source code is available, as well as various benchmarks.

Looking through and running some of the other entries, it seems that my program beats most of them. One exception is the entry (entries) by idevelop, which produces considerably better scores than mine for the extremely large input sets. On the other hand, my program does better on most other inputs (medium and large) by repeating the simulation a few hundred times, (usually) arriving at the optimum solution.


Follow

Get every new post delivered to your Inbox.

Join 68 other followers