ECC Memory

TECHNOLOGY EXPLAINED

ECC memory is something you often hear about, but what is it? Every so often, RAM will have an unrecoverable error. This is something that is unavoidable. With 16 GB RAM, there are a total of 137,438,953,472 (that’s over 137 billion bits which can be either one or zero, meaning that a single incorrect bit is an error rate of only 0.00000000073%. As insignificant as this may seem, it is enough to cause an application to crash or a blue screen or death. If this error is in a non-critical area, you might not even notice it in day to day use.

These errors are most often caused by electric, magnetic or even cosmic interference, and result in a bit spontaneously changing its state. As an example, the string 00101110 (which is binary for 46) could have a bit flip resulting in 10101110 (which is binary for 174).

In an enterprise environment, that error could result in something as major an incorrect amount being transferred between bank accounts, or even the banking system grinding to a half while the server restarts. In a less critical environment it could result in a 3D render that has been running for days or weeks failing and needing to be started from scratch – with time being money, that can still be something with dire consequences. Either way, this is something that could result in losses of millions of dollars. In the case of computers controlling aeroplane traffic and even the planes themselves, the effects could be fatal.

You might not think that the average user has any use for ECC memory, but we actually rely on it every day that we interact with the world electronically. Cloud services, banking services, weather services and more all rely on ECC memory to keep our data safe, our transactions in check, and predictions accurate. Even though we don’t use it directly, we all make use of ECC memory indirectly in some way.

The servers and supercomputers used for this type of work need to have a way of recovering from such errors, and that is where ECC (error-correcting code, or sometimes incorrectly error checking and correcting) memory comes into play. The way ECC memory works is by having an extra bit for every word (eight bytes) for parity as well as an algorithm known as hamming code for detecting where that error occurred. You get two types of parity – even parity, which ensures that the total is always an even number, and odd, which ensures that the total is always an odd number.

To explain how these two types of parity work, let’s look at an example with one bit of parity for every seven bits of data. With the string 0000000, the total number of ones is zero, so even parity would result in a parity bit of 0 and odd parity would result in a parity bit of 1. If we take the string 0001101 there are three ones, so even parity would give a parity bit of 1 and odd parity a parity bit of 0.

If a system using even parity has a total (including parity bit) result of an odd number (or vice versa), the system knows that an error has occurred. There isn’t enough information available in the parity bit alone to identify what the error is, just that there is an error. All of the information monitored by that specific parity bit would have to be retransferred, and if there is a noisy signal or bad bit that could take a very long time or even never occur.

In order to account for this and altogether remove the need to retransfer the information, an algorithm knowing as hamming code is used. Hamming code ensures that a single flipped bit is easily identified and can be corrected. To explain how that works, let’s assume we are working with a two bit dataset. There are four possible combinations for a two bit dataset:

  • 00
  • 01
  • 10
  • 11

A way of storing these numbers that they are unique enough that a single flipped bit doesn’t corrupt them is needed, and that’s exactly what hamming code does. Here is an example of hamming code:

  • 00 = 11010
  • 01 = 00000
  • 10 = 01111
  • 11 = 10101

No matter how you manipulate that four number, they will always have at least two unique bits of data. If we take 10101 and change one of the bits so it becomes 11101, we can easily see that it is closest to 10101, as it has very little in common with the other three possibilities. If we manipulate two of the bits so it becomes 11011, it is still has unique information in the way of the first and last digit being unique to 10101, although there is the possibility of 11010 having changed to 11011. As such, this algorithm is capable of detecting and correcting single bit errors and detecting two bit errors.

A study by Google showed that the average error rate of RAM is one five errors per hour per gigabyte of memory, so the chances of getting two errors simultaneously in the same word of data are infinitely small. Using the 16 GB example at the top of this article, the expected error rate is 80 errors per hour, so you can see that the chances of two simultaneously are close to zero.

Furthermore, such systems can normally monitor and log all correctable errors (EC) and uncorrectable errors (UC), allowing for monitoring of failing modules that need replacement. All of this extra protection comes at a higher cost and also have lower performance than non-ECC memory, but in a mission critical environment those costs are most certainly worthwhile.

FORUM DISCUSSION