Enabling error resilience throughout the embedded system
Decreasing semiconductor device geometries allow ever higher levels of integration in System-on-Chip (SoC) devices. In the domain of FPGAs, this results in very high capacity programmable hardware devices. At 28-nm, the latest trend in FPGAs is to combine FPGA fabric with a high-performance SoC. Dubbed an “SoC FPGA”, these devices contain a dual-core ARM Cortex A9 processor, level 2 cache, a rich set of peripherals, up to four memory controllers, high-speed transceivers, and a low-power, low-cost 28-nm FPGA fabric. Such a concentration of computational performance drives embedded systems to carrying abundance in memory capacity. Several gigabytes of DDR is no exception, and with that more attention must be paid to the probability and avoidance of soft errors.
What are soft errors?
Commonly used memory bit cells retain their programmed value in the form of an electrical charge. Writing a memory bit cell consists of reprogramming and forcing the electrical charge to represent the new desired value. Memory bit cells will retain their value indefinitely, as long as basic requirements are met, e.g. power is applied, and – for dynamic memory types – a refresh method is active.
The stored charge can be negatively impacted by injection of a charge foreign to the memory device. Cosmic energy may affect a memory bit cell, as the earth atmosphere is a significant, but not flawless barrier. Alpha particles are emitted by decay of materials, and while the chip packaging is engineered for very low emission rates, the problem can’t be totally ignored.
The event in which an external energy injection inadvertently modifies the value of a memory bit cell is referred to as a single event upset (SEU). The class of these errors is soft errors, as the error is not caused by a defect in the device, but instead by the device being subject to an outside disturbance. If the correct data is subsequently rewritten, it is not likely to undergo the same upset. As such, the likelihood of such an event is extremely small, while it increases with growing memory capacity.
The acceptability of an SEU rate depends on the application domain. Developers of applications used at high altitudes will be concerned with higher soft error rates (SER) due to cosmic rays. Military, automotive, high-performance computing, communication, and industrial customers will be concerned with degradation of safety, security and reliability.
Based on heuristic probabilities of soft errors, identified as low and high boundaries, Figure 1 shows expected soft error rates for a number of capacities of memories. As an example, an embedded system with one gigabyte of dynamic memory is expected to have a mean time between failures (MTBF) in the range of a few times per year to once every few years.
Figure 1. Expected soft error rates for different memory capacities.
Implications of soft errors
Memory data corruption is often fatal to the operation of an embedded system. In a processor-based system, memory errors result in incorrect values in either instruction or data streams. Modern processors will detect illegal instructions, commonly forcing a reboot of the system. Errors in data streams may cause the program flow to derail, which often results in illegal access to protected memory. These events have their equivalent in the desktop world as a “blue screen of death” or a “core dump.”
While a crash is undesirable in embedded systems, the alternative is worse. Errors that are not immediately detected can linger in the system for an extended period of time. Undetected memory errors can multiply as the faulty data is used to calculate new data. Once faulty data has been detected, the originating point and the subsequent induced damage may be difficult to correct or even identify. Embedded systems often operate for extended periods of time and are not frequently rebooted as one may see with desktop computers. This gives embedded systems the additional disadvantage that errors will accumulate over time.
The effects of data corruption or system crashes are numerous. Misbehaving systems will annoy users and make customers unhappy. Maintenance costs may increase, as customer complaints trigger expensive investigations for error sources that are non replicable. A sudden system failure may cause an unsafe environment around heavy machinery, and errors in secure systems may provide access via unintended backdoor entry methods. Often, the maintenance cost and the number of unsatisfied customers are key factors driving the need for a solution.
Adding resilience to soft errors
Because soft errors are unavoidable, methods have been developed to make systems resilient to many such errors. That is, when an error occurs, it can be detected, corrected, and the corrected value passed on, and thus the system continues uninterrupted. This feat is accomplished by adding bits to memory data words, whereby the widened word carries sufficient information to detect and correct errors. The more bits are added to a data word, the more errors in a word can be corrected. This makes error correction a function of cost and desired reliability.
A method that allows correction of a single error and detection of two errors in a word is both cost-effective and proven to provide excellent error resilience in embedded systems. This technology, widely deployed in the industry, is referred to as error correction code (ECC).
Basic implementation of ECC
ECC is implemented by making the memory wider and adding a limited amount of combinatorial logic in the path to and from that extra memory. The logic required for ECC encoding is based on well-established polynomial Hamming algorithms. An ECC bit generator creates the ECC bits out of the data being stored and stores the ECC data together with the regular data. An ECC detection and correction logic function is inserted at the output of the memory. When reading the memory, this function will check the combination of ECC data and regular data. If no error is detected, it will pass the regular data through unchanged. If a single bit error is detected, it will correct the error bit pass through the regular data, now with all bits correct. If two errors are detected, it will raise a flag, allowing the system to gracefully respond to the event.
ECC logic can indicate a system health status. As mentioned, for any single bit error in a word, the ECC logic will correct the error. In addition, it signals a failure status to the processor, and the operator can take measures relevant to the required reliability of that system. This method turns system degradation into a maintenance task that can be scheduled, as opposed to a response to an unexpected fatal system error condition.
Based on the heuristic probabilities of soft errors referenced earlier, the following table shows that the addition of ECC effectively increases the MTBF from being shorter than the life time of the product to longer than the life time of the universe.
In order to add ECC capability to a memory, one needs to provide additional space to store the ECC data, plus the additional input generator and output detection/correction logic. Altera’s SoC FPGA devices have this additional memory and logic built-in for many of the on-chip memories, thereby strongly supporting error resilience.
The generator and detection/correction logic needed for supporting ECC in the external memory is already integrated in the SoC FPGA device. The only change required to support ECC on external memory, is to assure that storage space for the ECC bits is available. For example, on a 32 bit wide data bus to external memory, widening the data bus to 40 bit is all that is needed. In this scenario, one commonly sees 2 memory devices of 16 bit wide, and an additional 8 bit wide device for ECC storage. The latter is also frequently seen implemented using the same device as used for regular data storage, for a total of 3 times a 16 bit memory device, where of the device containing the ECC bits half of the 16 bit is not used.
External NAND flash can be connected to the SoC FPGA device, both as one of the possible boot configuration memories and as storage for a file system. NAND flash is rather sensitive to soft errors, and there is a strong desire for ECC protection. NAND flash devices provide additional storage buffers in the device to retain ECC data bits. ECC is done on larger data blocks, 512 or 1024 bytes, and the SoC FPGA device has the logic functionality built-in to correct up to 24 bits of errors.
Altera built in ECC functionality into its SoC FPGA device for a large number of on-chip memory instances. The level 2 cache, the scratch RAM, memory inside the FPGA fabric, and memories that serve as data buffer in peripherals are each widened and equipped with ECC generator and correction logic. As these are built-in, using the ECC feature carries no additional cost.
Data and instruction caches are relatively small in their physical size and thus less prone to soft errors. They do need to operate at high performance levels, and to avoid additional latency when reading the level 1 caches, a simple parity check method is used.
The configuration bits inside the FPGA fabric are not organized in wide data words and do not lend to ECC implementation. Instead, the FPGA fabric has a built-in hardware engine that allows for cyclically checking for the correctness of the configuration bits, raising a flag when an error is detected. This error correction method is referred to as “scrubbing,” and is an industry-standard method for this class of devices.