New Horizons' Radiation-Hardened Playstation

Following GameSpot's report, the hive mind squinted a little: turns out that New Horizons has (more or less...) the same CPU as used in the original PlayStation. Of course, with space being like Chernobyl in a really bad day and all, New Horizons has a radiation-hardened version of the MIPS R3000. Ever wondered what that "radiation-hardened" stuff means?

Confusa deus ex machina

The foundation of modern computing is, in a way, remarkably fragile: it essentially relies on precisely controlling how electrons get shoved through carefully laid-out crystals. Carefully laid-out is something you should take very literally: we're talking "if you look from above, you should see atoms stacked up one over the other, and in between the stacks you can see the ground below"-carefully laid out. Unexpectedly, the part we suck more at is precisely controlling how electrons move. We're actually not that precise at it -- but the digital abstraction takes care of that, in that it doesn't matter exactly how many electrons are in a place at one point: we just have to be able to meaningfully distinguish between "a lot" and "very few" electrons (the performance implication, of course, is that it takes some time to go from "a lot" to "a few", and the more "a lot" means, the longer it takes).

This implies that there are two kinds of things that can go bad in a chip: either the crystal lattice is damaged (over-pretentious slang for "crystal gets messed up and now we have stuff where we shouldn't have stuff"), or a glitch happens in the careful control of electrons (pretentious slang for "stuff's moving when it shouldn't be").

The former -- crystal lattice damage -- tends to lead to long-lasting damage; obviously, if the crystal lattice isn't laid out the way we expected it to be when we manufactured the chip, controlling the electron flow through it isn't going to work reliably anymore. The latter -- glitches due to unexpected particle motion under external influences -- tends to lead to transient failures: we ask the CPU to do something, but it does something else without the control circuitry inside it really knowing.

The effect is similar: the little man that does all the calculations inside the machine gets confused, and the fundamental assumption of our model of computation, that the machine sequentially executes instructions and modifies data only through executing instructions that work on it, is violated.

Let that sink in for a moment: some step in your algorithm may or may not get executed -- and your algorithm may sometimes end up having random extra steps. It's basically like giving detailed instructions to a teenager who got high on The Sex Pistols' supply on how to clean their room. Over the phone. And the view through his room's window is literally an asteroid, 'cause he just passed Mars. How the hell do people manage that?

All these things can happen when the chip is exposed to radiation. The underlying mechanism is similar: high-energy particles (which is what "radiation" really means) run into the chip. Depending on just what that particle is and on how high its energy is, it can either change the silicon lattice somehow, or cause electrons to move, appear or disappear where they shouldn't be.

Crystal protection

When people think about radiation hardening, they usually go straight to ECC and redundancy, which are essential, but do rely on at least a reasonable degree of protection to the crystalline lattice. In the most extreme case, it should be obvious that no amount of error-connection will help if you chip is, literally, fried.

The most obvious starting point is a radiation shield: a reasonably thick layer of metal can offer some degree of protection. The most typical solution is a layer of Aluminum and Tungsten, which can stop an unexpectedly high amount of radiation. Nonetheless, further measures to protect the crystal lattice can be taken.

A fairly popular solution is to use a special passivation layer made of depleted boron. The passivation layer is a protective coating grown over the surface of the semiconductors; you want to do it anyway, but in particular, it helps if you can use a layer largely composed of boron-11. Generally, passivation layers do use boron, but the more usual boron-10 undergoes fission when impacted by a neutron, and produces (among others) a lithium ion, so it can introduce additional charges into the underlying semiconductor. Boron-11 is not affected by neutron radiation (which occurs either "naturally", or as a result of cosmic radiation hitting the metal case of the instruments and the ship itself), and is technologically useful, too, because boron-11 is a natural byproduct of the nuclear industry.

The problem can also be approached from another angle: instead of ensuring that the crystal doesn't get damaged, it can be easier to just make it so that it can withstand more damage. Larger, bulkier features, where a few atoms getting shuffled around doesn't cause that much damage, can make the system more robust. However, by doing so, you naturally sacrifice some of the system's performance, so it's an engineering trade-off that is not easily made. "Larger, bulkier features" means "more charge carriers that you have to shuffle around" -- and, naturally, more energy required to pass them around at a given speed, slower transitions and so on.

A somewhat less compromising option is to use materials with a larger bandgap, such as GaN: the large bandgap means that there's less of a chance for a high-energy particle to damage the lattice in such a way that "new" charge carriers are spontaneously created (a phenomenon we'll discuss a little further down in more detail). Thus, they allow us to retain the small size, because fewer potentially-disastrous changes in the lattice can occur for a given level of radiation (compared to traditional CMOS devices, that is). However, wide-bandgap materials bring along their own technological problems; while they're fairly widely-used in high-frequency and optoelectronic ICs, I don't know if there are any truly large-scale integrated circuits, such as microprocessors, built with wide-bandgap materials.

Single- and transient-event tolerance

Assuming we're at that point where we can be reasonably sure that, as long as no cosmic event of catastrophic proportions occurs, a chip will survive for the duration of a mission, we can also start thinking about how to ensure correct execution even when events which influence the chip -- but do not destroy it -- do happen.

By and large, there are two major categories of protective measures: we can either ensure that, when a high-energy particle hits the chip, it doesn't influence its behaviour, or we can try to ensure that, when a high-energy particle does hit the chip and does influence the behaviour of some element, this influence can be detected and corrected. The former category is comprised largely of technological, physical-level solution, whereas the latter generally includes logic-level solutions.

In terms of technology, the most useful solution is to use something called an insulated substrate. Most modern digital integrated circuits (ICs) are built by taking a single, large, pure silicon crystal, diffusing various impurities into it, and adding several layers of material on top of those. The large hulk of silicon into which we diffuse impurities is called the substrate, and we generally pretend that the various miniature components that we create by diffusing those impurities (transistors, diodes and so on) do not influence each other.

We diffuse those impurities in order to "create" free charge carriers that we can move around (or, truer in physical terms, in order to propel electrons into an energy state where they can hop around the lattice, and if they all hop in one direction, they alter how the change is distributed; or, we make sure that, in some given region, there are more empty places than electrons, so that there are more places to sit than there are electrons, and when they all start picking their places in one direction, they alter how the charge is distributed).

The problem with silicon is that additional charge carriers can be "created" by high-energy particles, too. When a high-energy particle hits the lattice, it can propel an electron past the bandgap, from a state where it can't hop around to a state where it can, a phenomenon referred to as "charge collection". That's actually one of the principles used by particle detectors: the mysterious particle hits the lattice and we know that happened due to a momentary increase in current (i.e. the electron hopping around). If you're curious about the physics behind it, there's a good introductory-ish material here.

Obviously, we don't want electrons spontaneously hopping around in a digital IC: if we fix the output of a logic gate to a low state, we don't want it to open independent of its inputs just because a star went nova a thousand years ago and we got the news only now. And the most straightforward way to ensure that doesn't happen is to not use silicon for the substrate: instead of diffusing impurities into a huge chunk of silicon, we grow a thin layer of silicon over a hunk of insulator (usually sapphire or silicon dioxide), and diffuse our impurities in this thin layer. The effect is that there's now a lot less chance of charge collection happening, because there's a lot less silicon where it could happen. Honeywell has a sketchy application note with drawings that make the cause of this improvement very obvious.

Architectural solutions that minimize vulnerability to such events when they do occur are also employed. For instance, capacitor-based DRAM can sometimes be replaced with more expensive (but also less vulnerable) SRAM. There's a small crowd that's convinced MRAM -- which is not based on wiggling charges around, but on altering magnetic properties -- is the future for such applications, but technological difficulties are making this rather unlikely in my opinion.

Logic-level solutions -- which rely on detecting and correcting errors -- are what most people think about when they hear about radiation hardening. Transient events can affect logic in two ways: they can alter either instruction operands (or, in more down-to-Earth terms, they screw up data held in RAM or data transmitted through the data paths), or they can alter the instructions or their execution themselves (a bit is toggled back to high after it was set to low, for example. We'll leave the discussion of "what category does toggling a bit in an instruction opcode fall into?" to Haskell fans). Naturally, there are two major classes of approaches.

The first one attempts to ensure there are no data errors. This is where error-correcting codes and CRC checksums come into play. Data read from RAM, for instance, is checked for parity, and not only when the data is accessed, but continuously: a module called "scrubber" sweeps RAM memory, reading data, checking for parity, and writing it back (possibly corrected). This has to be done periodically because random errors (i.e. random bits being flipped by particles randomly flying into the RAM array) accumulate, and we can only detect and correct finitely many errors with error correcting codes. If too much time passes between accesses to a memory region (and, consequently, too many errors accumulate), the data there can become completely garbled.

This approach can ensure, with reasonable success, that data reaching the CPU's execution unit(s) is correct. After that, though, all bets are off. You can't meaningfully (and efficiently) employ error correction and detection checks for data as it's shuffled around the CPU's execution units and operations are performed on it -- not to mention the aforementioned metaphysical problem of "what happens when the wrong instruction is performed on the right data because neutrons turned that ADD into a CMN".

This is where redundancy comes into play. Multiple elements are used -- either at a system level (e.g. multiple CPUs, multiple computers), or at a circuit level (e.g. multiple bistables for the same bit, multiple latches for the same latched output) -- all of them executing the same operations, and the result is considered correct only if all elements (or a majority of elements, if a "voting scheme" is employed) agree on it. If not all elements agree on it, the ones that disagree with the majority can be asked to recompute the result, or the whole thing can be scrapped and computed again by all units. If one unit consistently disagrees (e.g. because none of the things in the "Crystal Protection" section worked and now a chip or a module of a chip is fried), the system eventually marks it as broken and shuts it down completely.

Redundancy looks like a free meal -- we could, ideally, just shove a lot of computers in there and not really care about physical hardening anymore. If one or two of them die or start going awry, we just take them offline. That's part of why this whole cloud thing is so successful after all.

Except that doesn't really work in space, where there's not much room avaliable for your systems to occupy, and your systems have to survive on very little power. Redundancy takes additional space (on chips, mainboards or instrument racks) without adding speed or performance, so a conscious trade-off is being made there, and -- since each "thing" needs "electricity" to run -- more things mean more power, which you don't usually have too much of. So there's only so much redundancy that can be sustained by a system.

Disaster recovery

Needless to say, none of these techniques are bullet-proof. They either minimize the chances of things going wrong, or try to alleviate the effect of things going wrong, which means that more things can go wrong before the system crashes, or becomes physically damaged to the point where it can't work anymore.

Physical damage to the point where a system can't work anymore is generally the end of the line (at least for that system; it happened to Galileo's cameras, for instance -- but it was the only system that broke down, so Galileo continued to be useful after that). But crashes don't have to be.

Ironically, the best technique we have for dealing with crashes is flipping the reboot button. That's done using a watchdog timer: a module inside the CPU that counts down from a given number, very quickly. If it reaches 0, the system reboots. A correctly-running system, however, will periodically reset it to its initial value before it gets the chance to expire. If the system hangs, the watchdog timer's count reaches 0 and the CPU is promptly reset.

This is, literally, the best we have right now. There's some research going on into alternative, maybe less intrusive approaches, which would allow us to see what went wrong and correct it, rather than just panic that something went wrong and reset, but I doubt this will catch up with space technology too son. It's useful down here on the ground, where we can rely on logic going the way it should go; up there, where logic getting randomly messed up is problematic, it's pretty obvious that adding logic to see where other logic is getting randomly messed up is probably not a sound approach.

A quick note on bugs

Part of the reason why NASA is using such an old processor is verification. Of course, back in 2006, when New Horizons was launched, the MIPS R3000 was a lot newer than it is now, but not by much. MIPS R3000 made its debut in 1988.

Processors have bugs (Intel has a particularly bad reputation for that) and you cannot really live-patch them. Sometimes you can work around them in software, but that's not always the case. Furthermore, an older design can generally be guaranteed to have a stable and well-tested toolchain, good static- and timing-analysis tools (which are vital in real-time systems) and so on. That's why seemingly legacy stuff is useful in such systems: it offers sufficient assurance that the one kind of bugs which you most definitely cannot solve by uploading new code (because the bug is not in the code) is not present, and it has a rich and well-understood ecosystem of tools which you can use to test, profile and make solid predictions about your code.

An improbable fight

This is, in very broad terms, a basic overview of what radiation hardening means. The costs of such systems are, by the way, huge. Back in 2002, a radiation-hardened computer board (that's board, yes, because a radiation-hardened CPU all by itself doesn't do much) could cost 200,000 USD.. The components themselves can be expensive (because they use special technologies and aren't mass-produced), but there are a lot of associated costs, too -- extensive testing, careful documentation and so on. It's a very improbable -- but remarkably interesting -- kind of wrestling with the physical world.