Space probe software bug

NASA lost contact with the space probe Deep Impact in 2013 because of a single software bug.

Tempel 1 comet from Deep Impact space probe
NASA/JPL-Caltech/UMD, Public domain, via Wikimedia Commons

Deep space probes present a unique programming challenge. One error can throw a multi-million dollar piece of equipment spinning off into the void. It’s no trivial feat to send software updates millions of kilometres. And every mission is unique, which makes field testing the software rather tricky too.

Modern spacecraft include fault management processes, which are designed to detect problems and solve them on the fly. But this is reliant on software too, and in the case of one deep space probe, the fault protection software itself became the problem.

The primary mission of the Deep Impact probe was to blast a hole in a comet and see what came out. Soon after it launched in 2005, the probe’s fault protection software kicked in and put the probe into safe mode. It was a simple problem to fix – the software’s temperature limit was too low, so standard conditions sent it into a panic – but it wasn’t the final problem.

Fast forward eight years. Deep Impact’s primary mission was a success, and it was on its way to observe other comets. But then, suddenly, we lost contact with the probe. What happened?

It was a software bug. More specifically, it was a software bug in the fault protection software. More more specifically, it was a software bug in the fault protection software’s internal clock. And we only worked that out because of one extremely telling number: we lost contact with the probe exactly 429,496,729.6 seconds after midnight, January 1, 2000.

Now, to me and (probably) you, that number doesn’t mean much. But multiply it by 10 and convert it to binary, and it becomes much clearer: 1000000000000000000000000000000000. That’s 1 followed by 32 zeroes, or 232.

This is what happened. The fault protection software’s internal clock recorded time in tenths of a second since the year 2000, and it recorded it as a 32-bit integer. That meant the space for the time value only went up to 4,294,967,296, When it hit that limit, it probably reset back to zero, and that probably set off a perpetual reboot cycle that paralysed the space probe – permanently.

(This, by the way, is very similar to the Year 2038 problem back here on Earth. So we have that to look forward to.)

Leave a Reply