Simple models in Kaggle competitions

This week I participated in the Porto Seguro Kaggle competition. Basically, you’re asked to predict a binary variable — whether or not an insurance claim will be filed — based on a bunch of numerical and categorical variables.

With over 5000 teams entering the competition, it was the largest Kaggle competition ever. I guess this is because it’s a fairly well-understood problem (binary classification) with a reasonably sized dataset, making it accessible to beginning data scientists.

Kaggle is a machine learning competition platform filled with thousands of smart data scientists, machine learning experts, and statistics PhDs, and I am not one of them. Still, I was curious to see how my relatively simple tools would fare against the sophisticated techniques on the leaderboard.

The first thing I tried was logistic regression. All you had to do was load the data into memory, invoke the glm() function in R, and output the predictions. Initially my logistic regression wasn’t working properly and I got a negative score. It took a day or so to figure out how to do logistic regression properly, which got me a score of 0.259 on the public leaderboard.

Next, I tried gradient boosted decision trees, which I had learned about in a stats class but never actually used before. In R, this is simple — I just needed to change the glm() call to gbm() and fit the model again. This improved my score to 0.265. It was near the end of the competition so I stopped here.

At this point, the top submission had a score of 0.291, and 0.288 was enough to get a gold medal. Yet despite being within 10% of the top submission in overall accuracy, I was still in the bottom half of the leaderboard, ranking in the 30th percentile.

The public leaderboard looked like this:

Rplot.pngAbove: Public leaderboard of the Porto Seguro Kaggle competition two days before the deadline. Line in green is my submission, scoring 0.265.

This graph illustrates the nature of this competition. At first, progress is easy, and pretty much anyone who submitted anything that was not “predict all zeros” got over 0.200. From there, you make steady, incremental progress until about 0.280 or so, but afterwards, any further improvements is limited.

The top of the leaderboard is very crowded, with over 1000 teams having the score of 0.287. Many teams used ensembles of XGBoost and LightGBM models with elaborate feature engineering. In the final battle for the private leaderboard, score differences of less than 0.001 translated to hundreds of places on the leaderboard and spelled the difference between victory and defeat. To run 90% as fast as Usain Bolt, you need to run 100 meters in 10.5 seconds. To get 90% of the winning score in Kaggle, you just need to call glm().

This pattern is common in Kaggle and machine learning — often, a simple model can do quite well, at least the same order of magnitude as a highly optimized solution. It’s quite remarkable that you can get a decent solution with a day or two of work, and then, 5000 smart people working for 2 months can only improve it by 10%. Perhaps this is obvious to someone doing machine learning long enough, but we should look back and consider how rare this is. The same does not apply to most activities. You cannot play piano for two days and become 90% as good as a concert pianist. Likewise, you cannot train for two days and run 90% as fast as Usain Bolt.

Simple models won’t win you Kaggle competitions, but we shouldn’t understate their effectiveness. Not only are they quick to develop, but they are also easier to interpret, and can be trained in a few seconds rather than hours. It’s comforting to see how far you can get with simple solutions — the gap between the best and the rest isn’t so big after all.

Read further discussion of this post on theĀ Kaggle forums!