Roboroast: upload your photo to get an algorithmically generated insult!

I’d like to share a side project I’ve been working on for the past few weeks. Roboroast is an app that automatically generates humorous insults for you or a friend based on how you look. It was written in collaboration with my friend Andrei Danciulescu.

The basic operation is as follows. There’s a subreddit called /r/RoastMe where random people post a picture of themselves, and other people proceed to “roast” the person with funny comments making fun of his appearance.

Our app takes your photo and uses a face recognition algorithm to find a poster in /r/RoastMe who looks like you. Then we display the comments for your closest matches.

You can try it at roboroast.tk.

Sample Results

Here’s some roasts for myself:

Here’s some for Andrei:

High Level Overview

The project comprises of roughly 3 parts:

Part 1 is the Reddit scraper. We use the PRAW API to go through all posts on the /r/RoastMe subreddit, saving comments to MongoDB and saving images to the filesystem.

Part 2 is the Face++ uploader. Face++ is a cloud service with a REST API that handles our face matching. To use it, we upload all the images from part 1 into a “faceset” which we can query later.

The first two components only need to be run periodically, maybe once a month to update the faceset with new posts from Reddit. Part 3 is the webapp, which is the use facing component. It accepts user uploads, searches for matches using the Face++ API, and renders a list of insults to the user.

Technology Stack

As mentioned before, we used a number of third party APIs; PRAW for scraping Reddit posts, and Face++ for face recognition.

All the backend code is written in Python. The web app uses the Flask web framework, and is wrapped with NGINX and Gunicorn to handle connections and serve static files. We use MongoDB for the database.

The frontend is built with Bootstrap. We also use javascript libraries jQuery and handlebars.js.

The whole thing is hosted on a single AWS EC2 instance.

How good is the face matching?

The face matching is actually decent. Face++ produces reasonable matches most of the time.

To see the matching results for yourself, you can append ?r=1 to the end of the URL (on the results page). This is hidden by default.

Do the insults make sense?

Although the face matching does a decent job, we found that the quality of results were somewhat hit-or-miss.

When we envisioned the concept for this app, we assumed that most insults were going to make fun of the subject’s face. However, many insults refer to their non-facial appearance, or clothing, or objects in the background. Since we only do face matching, these comments will make no sense.

Other times, comments will refer to the title of the post — in other words, an insult depends on both the submission title and the picture. Again, these make no sense with only the picture.

We attempt to mitigate this with heuristics that analyze the comment, in order to exclude roasts which refer to the title or articles of clothing. This approach had limited success because natural language processing is hard.

Conclusion

When Andrei initially proposed this idea for an app, I thought the concept was pretty cool and unique. In a month or so we had a prototype, and I spent a few more weeks polishing the project for release. The quality of results you get is still highly variable, but we’re working on improving our algorithms.

In any case, it’s my first time with a lot of these technologies, and I had fun and learned a lot building it.

The Proggit Bacon Challenge: a probabilistic and functional approach

A few days ago I saw an interesting programming challenge on Proggit (more commonly known as /r/programming, or the programming subreddit). The problem is found here.

This is how the problem goes:

You are given a rectangular grid, with houses scattered across it:

The objective is to place bacon dispensers (I’ll call them bacons from now on) at various places so the people in the houses can get the bacon.

I have no clue why they chose bacon, out of all objects to choose from. Alas, that is not the point.

So given a limited number of bacons, you must distribute them effectively among the houses by putting them on empty squares. In the example, you have three bacons to place.

For each house, the score is the distance to the nearest bacon (using the Manhattan, not Euclidean metric). Your total score is the sum of the scores for each house. Thus, you are trying to minimize your total score.

Optimal solutions

Here is the optimal solution for the problem:

If you add them up, you can see that the total score for this configuration is 10.

Question is, how do you arrive at this configuration?

It turns out that this isn’t as easy as it looks. This problem is NP-Hard, meaning there is no algorithm that can solve it both quickly and optimally. By “quickly”, it’s understood to mean polynomial or non-exponential complexity; if this is impossible then the best algorithm is not significantly better than just brute force.

In order to solve the problem in a reasonable amount of time, we have to trade optimality for speed and rely on less than optimal, probabilistic approaches.

Introducing the k-means algorithm

We will now transform the problem into a data clustering problem.

If we have k bacons to place, then we must find k distinct clusters. After this, we can place the bacons in the centroid of each cluster to achieve the optimal score. In other words, we are finding clusters such that the distance from a point to the center of a cluster is minimized.

The best known algorithm for this problem is Lloyd’s algorithm, more commonly referred to as the k-means algorithm. Let’s try an example to demonstrate this algorithm.

Suppose we want to find two clusters in these points:

We start by choosing two centers randomly from the sample space, let’s make them green and red:

We assign each point to its nearest center:

Then, we move each center to the centroid of its cluster:

Notice now how some of the points are closer to a different center than the center they’re assigned now. Indeed, they belong to a different cluster.

So we reassign the clusters:

Again we calculate the centroids:

We repeat these steps as many times as we need to, usually until the clusters do not change anymore. Depending on the data it may take more or less iterations, but it normally converges fairly quickly.

This method, unfortunately, does not always achieve an optimal result. Technically it always converges on a local optimum, which is not always the global optimum. The local optimum can be arbitrarily worse than the global optimum.

Take note of how the result of the algorithm depends entirely on the results of the random starting positions of the clusters.

If you’re very very lucky, they might as well end up at exactly the optimal locations.

If you’re really unlucky, however, they may end up all in a corner of the map; and the result configuration would be far from optimal. We might even end up with most of the clusters completely empty. The thing is that they’re assigned completely randomly.

We can do better than that.

Improving the k-means: introducing the k-means++ algorithm

The k-means++ algorithm addresses some of the problems with the k-means algorithm, by seeking better starting clusters. Its results are almost always better than the standard algorithm.

Let’s try this.

The first thing we do is put a cluster right on top of a random point:

For each point that doesn’t already have a cluster on it, calculate the distance to the nearest cluster (which is not always the same cluster):

Next we assign a probability to each of the points, proportional to the squares of the distances:

The next cluster is chosen with this weighted probability. We repeat this until we have all k clusters distributed on top of k different points.

Then, we proceed with the regular k-means algorithm.

The result of this way of choosing is that the starting clusters tend to be spread out more evenly; moreover there’s no empty clusters. Notice how a point twice as far from the nearest cluster is four times more likely to be chosen for the next cluster.

Although this drastically improves the k-means algorithm, it still does not guarantee an optimal configuration.

Repeated simulation

There is one more thing we can do to increase our score. Being a probabilistic algorithm, the results depend heavily on the random numbers generated. Using different random numbers would achieve better or worse results.

To get the better results, we can run the algorithm multiple times, each time with a different set of random numbers. As the number of iterations increase, the score will get closer and closer to the optimum.

Implementation in Haskell

It took me about two days to write a program for this task; I’ve submitted the program to the challenge. There the source code is available, as well as various benchmarks.

Looking through and running some of the other entries, it seems that my program beats most of them. One exception is the entry (entries) by idevelop, which produces considerably better scores than mine for the extremely large input sets. On the other hand, my program does better on most other inputs (medium and large) by repeating the simulation a few hundred times, (usually) arriving at the optimum solution.