What’s the hardest bug you’ve ever debugged?

In a recent interview, I was asked this question: “what’s the most difficult bug you’ve encountered, and how did you fix it?” I thought this was an interesting question because there are so many answers you could give to this question, and the sort of answer you give demonstrated your level of experience with developing software.

I thought for a moment, recalling all the countless bugs I had seen and fixed. Which one was the most difficult and interesting? In this article I’m going to describe my most difficult bug to date.

It was an iOS app. I was working as a four-month intern at the time. “We’ve been seeing reports from our users that the app randomly display a black screen,” my boss explained one afternoon. “No error message, no crash log, nothing. The app is simply stuck at a black screen state until you kill it.”

“Fair enough. How do I reproduce it?”

He shrugged. “I don’t know. Users are reporting it happens randomly. Here’s what you gotta do: grab an iPad, download the game off the app store. Create an account and play the game until you hit the bug.”

So I did. I was reduced to one of these typewriter monkeys, banging away mindlessly at the keyboard until I stumble upon the sequence of button presses to trigger the undiscovered bug by sheer coincidence.

For an afternoon I monkeyed away, but no matter what buttons I pressed, the mythical black screen would not appear. I left the office, defeated and mentally exhausted.

The next morning I checked into the office, picked up the iPad, and resumed my monkeying. But this time my fortune was different: within 15 minutes, lo and behold, the screen flashed white, followed by an unrepentant screen of black.

What did I do to trigger this? I retraced my steps, trying to repeat the miracle. It happened again. Methodically I searched for a deterministic sequence of actions that brought our app to its knees. Go to the profile page. Hit button X. Go to page Y and back to the profile page. Hit button Z. The screen flickered for a millisecond, the black. Ten times out of ten.

With a sigh of relief, I jotted down this strange choreography and went for a walk. Returning with a fresh mind ready to tackle the next stage of the problem, I executed the sequence one more time, just to make sure. But the bug was nowhere to be seen.

I racked my brain for an explanation. The same sequence of actions now produced different results, I reasoned. Which meant something must have changed. But what?

It occurred to me that the page looked a little different now from when I was able to reproduce the bug. In the morning, when I came in, there was a little countdown timer in the corner of the screen that indicated the time until an upcoming event. The timer was not there anymore. Could it be the culprit…

To test this hypothesis, I produced a build that pointed the game to the dev server, and fired up a system event. The timer appeared. I executed my sequence — profile, tap, home screen, back to profile, tap, and sure enough, with a flicker the black screen appeared. I turned off the timer, repeated the sequence — profile, tap, home, profile, tap — no black screen. I had finally discovered the heart of the matter. There was some strange interaction going on between the timer and other things on the page.

At this point, with 100% reproducibility, the worst was over. It took a few more hours for me to investigate the issue and come up with a fix. My patch was quickly rolled out to production, and users stopped complaining about random black screens. Then my team went out for some celebratory beer.

I will now describe exactly what happened — and why did a timer cause such an insidious bug.

The timer widget was implemented using an NSTimer which made a callback every second. To do this, the timer holds a reference to the parent view which contains it. This is not too unusual, and is generally innocent and harmless — until you combine it with Objective C’s garbage collection system.

Objective C’s garbage collection system uses a reference counting algorithm. I’ll remind you what this means. The garbage collector maintains, for each object, a count of how many references lead to it. When this reference count reaches zero, it means your object is dead, since there is no way to reach it from anywhere in the system. Thus the garbage collector is free to delete it.

This doesn’t work for NSTimer, though. When two objects hold references to each other, their reference count remain at least 1, which means they can never get garbage collected. In our app, this meant that whenever the view with the timer goes out of view, it doesn’t get disposed, but remains in the background forever. A memory leak.

A memory leak, by itself, can go unnoticed for a long time with no impact. The last part of the puzzle that brought everything crashing down had to do with the way a certain button was implemented. This button, when pressed, broadcasted a message, which would then be received by the profile view.

When the timer is active, it is possible to get the system into a state with two profile views — a real one and a zombie one kept alive by a reference cycle with the timer.

Then when the message is broadcasted, both the real and zombie views receive the message in parallel. The button logic is executed twice in rapid succession, which understandably causes the whole system to give in.

With this mechanism in mind, the fix was easy. Just invalidate the timer when the view goes out of view. Without the reference cycle, the profile view is disposed of correctly and all is well again.

I think this story demonstrates a fundamental truth about debugging: in order to debug effectively, you need to have a deep understanding of your technology stack. This is not always true of programming in general — quite often you can write code that works yet not really understand what it’s doing. When developing a feature in an unfamiliar technology, the typical workflow is, if you don’t know how to do something, copy something similar from StackOverflow or a different part of your code base, make some changes until it works. And that’s a fine way to do things.

But debugging requires a more structured methodology. When many things are breaking in haphazard ways, you need to narrow down the problem to its very core, to identify precisely which component is broken. In this case it was a reference cycle that wouldn’t get released. The core of the problem may be buried within layers upon layers of an API, even an API you believe to be bulletproof. It might require digging into assembly code, even hardware.

To find that core requires an understanding of a mind-boggling stack of technologies that software today sits upon. That’s what it takes to become a master debugger.

So, what’s the hardest bug you’ve ever debugged?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s