Point in a polygon

An important, yet basic problem in computational geometry is the following task: given a polygon (a sequence of points) and a point, determine if the point is inside, or outside the polygon.

To a human eye, this problem is trivial. After all, it seems obvious that point A is outside the polygon, whereas point B is inside it.

Computers, on the other hand, lack this simple intuition. It is not entirely obvious how to go about creating an algorithm that correctly determines this.

Triangles containing the origin

Let us first focus on an easier version of the problem. Instead of any arbitrary polygon, let our polygon be limited to three points, that is, a triangle. Instead of any point in space, let us determine if the triangle contains the origin.

We observe that we can draw a ray from our point outwards. Any direction will do. If the point is inside the triangle, then the ray will eventually hit a side of it.

On the other hand, if the point is outside the triangle, the ray may hit two sides of the triangle, or it may hit none. However, it will never hit exactly one side:

Aha; this is the approach. Now we can transpose our diagram to a Cartesian plane. For simplicity, let our ray be a line directly upwards from the origin, that is, the entire y-axis above the origin.

Our algorithm now takes the three line segments of the triangle, and counts how many of them intersect the y-axis above the origin.

There are three cases. In the first case, the triangle doesn’t intersect the y-intercept at all above the origin, thus the triangle doesn’t contain the origin:

In the second scenario, the triangle intersects the y-axis twice above the origin, in which case the triangle still does not contain the origin:

In the last case, the triangle intersects the y-axis once above the origin, in which case the origin is in the triangle:

So the algorithm is, for every line segment (pair of coordinates) in the triangle:

  1. Calculate the y-intercept, say, b
  2. If b<0 this line segment doesn’t cross the y-axis above the origin so throw it away. Otherwise, continue.
  3. Make sure that the line segment actually crosses the y-axis at all. Or, check that the two points are on opposite sides of the y-axis.

Here’s my implementation of the algorithm in C++:


// Slope of a line
double slope(double x1, double y1, double x2, double y2){
  return (y1-y2) / (x1-x2);
}

// Y intercept given slope and point
double yint(double px, double py, double m){
  return py - m*px;
}

// Determine if a line crosses the y axis
bool cyax(double ax, double bx){
  if(ax<0 && bx>0) return true;
  if(bx<0 && ax>0) return true;
  return false;
}

// Contains origin, or not?
// Input is (A_x, A_y, B_x, B_y, C_x, C_y).
bool cto(double ax, double ay, double bx, double by, double cx, double cy){

  // Find slopes
  double mAB = slope(ax,ay,bx,by);
  double mBC = slope(bx,by,cx,cy);
  double mAC = slope(ax,ay,cx,cy);

  // Find y intercepts
  double yAB = yint(ax,ay,mAB);
  double yBC = yint(bx,by,mBC);
  double yAC = yint(cx,cy,mAC);

  // How many times it crosses the y intercept above 0
  int c = 0;

  if(yAB>0 && cyax(ax,bx)) c++;
  if(yBC>0 && cyax(bx,cx)) c++;
  if(yAC>0 && cyax(ax,cx)) c++;

  return c==1;
}

Generalizing the algorithm

We can proceed to generalize the algorithm to not only triangles containing the origin. First, for the problem of determining if an arbitrary point is in a triangle, we can create a bijection:

Translate all the points so that the test point lies on the origin. The position of this point relative to the triangle is exactly the same, so we can proceed with the y-intercept algorithm.

For a polygon with more than 3 points, the same algorithm applies. Sort of. With more complicated polygons, we may encounter this edge case:

Here the ray crosses the polygon three times.

There’s an easy fix to this problem, however. It turns out that if the ray crosses the polygon an odd number of times, then the point is inside the polygon.

The Proggit Bacon Challenge: a probabilistic and functional approach

A few days ago I saw an interesting programming challenge on Proggit (more commonly known as /r/programming, or the programming subreddit). The problem is found here.

This is how the problem goes:

You are given a rectangular grid, with houses scattered across it:

The objective is to place bacon dispensers (I’ll call them bacons from now on) at various places so the people in the houses can get the bacon.

I have no clue why they chose bacon, out of all objects to choose from. Alas, that is not the point.

So given a limited number of bacons, you must distribute them effectively among the houses by putting them on empty squares. In the example, you have three bacons to place.

For each house, the score is the distance to the nearest bacon (using the Manhattan, not Euclidean metric). Your total score is the sum of the scores for each house. Thus, you are trying to minimize your total score.

Optimal solutions

Here is the optimal solution for the problem:

If you add them up, you can see that the total score for this configuration is 10.

Question is, how do you arrive at this configuration?

It turns out that this isn’t as easy as it looks. This problem is NP-Hard, meaning there is no algorithm that can solve it both quickly and optimally. By “quickly”, it’s understood to mean polynomial or non-exponential complexity; if this is impossible then the best algorithm is not significantly better than just brute force.

In order to solve the problem in a reasonable amount of time, we have to trade optimality for speed and rely on less than optimal, probabilistic approaches.

Introducing the k-means algorithm

We will now transform the problem into a data clustering problem.

If we have k bacons to place, then we must find k distinct clusters. After this, we can place the bacons in the centroid of each cluster to achieve the optimal score. In other words, we are finding clusters such that the distance from a point to the center of a cluster is minimized.

The best known algorithm for this problem is Lloyd’s algorithm, more commonly referred to as the k-means algorithm. Let’s try an example to demonstrate this algorithm.

Suppose we want to find two clusters in these points:

We start by choosing two centers randomly from the sample space, let’s make them green and red:

We assign each point to its nearest center:

Then, we move each center to the centroid of its cluster:

Notice now how some of the points are closer to a different center than the center they’re assigned now. Indeed, they belong to a different cluster.

So we reassign the clusters:

Again we calculate the centroids:

We repeat these steps as many times as we need to, usually until the clusters do not change anymore. Depending on the data it may take more or less iterations, but it normally converges fairly quickly.

This method, unfortunately, does not always achieve an optimal result. Technically it always converges on a local optimum, which is not always the global optimum. The local optimum can be arbitrarily worse than the global optimum.

Take note of how the result of the algorithm depends entirely on the results of the random starting positions of the clusters.

If you’re very very lucky, they might as well end up at exactly the optimal locations.

If you’re really unlucky, however, they may end up all in a corner of the map; and the result configuration would be far from optimal. We might even end up with most of the clusters completely empty. The thing is that they’re assigned completely randomly.

We can do better than that.

Improving the k-means: introducing the k-means++ algorithm

The k-means++ algorithm addresses some of the problems with the k-means algorithm, by seeking better starting clusters. Its results are almost always better than the standard algorithm.

Let’s try this.

The first thing we do is put a cluster right on top of a random point:

For each point that doesn’t already have a cluster on it, calculate the distance to the nearest cluster (which is not always the same cluster):

Next we assign a probability to each of the points, proportional to the squares of the distances:

The next cluster is chosen with this weighted probability. We repeat this until we have all k clusters distributed on top of k different points.

Then, we proceed with the regular k-means algorithm.

The result of this way of choosing is that the starting clusters tend to be spread out more evenly; moreover there’s no empty clusters. Notice how a point twice as far from the nearest cluster is four times more likely to be chosen for the next cluster.

Although this drastically improves the k-means algorithm, it still does notĀ guaranteeĀ an optimal configuration.

Repeated simulation

There is one more thing we can do to increase our score. Being a probabilistic algorithm, the results depend heavily on the random numbers generated. Using different random numbers would achieve better or worse results.

To get the better results, we can run the algorithm multiple times, each time with a different set of random numbers. As the number of iterations increase, the score will get closer and closer to the optimum.

Implementation in Haskell

It took me about two days to write a program for this task; I’ve submitted the program to the challenge. There the source code is available, as well as various benchmarks.

Looking through and running some of the other entries, it seems that my program beats most of them. One exception is the entry (entries) by idevelop, which produces considerably better scores than mine for the extremely large input sets. On the other hand, my program does better on most other inputs (medium and large) by repeating the simulation a few hundred times, (usually) arriving at the optimum solution.