There are two sub-areas of on-chip data protection: “data at rest” and “data in-flight”. Protecting data at rest requires some mechanism to ensure data stored in a memory array doesn’t change while “resting” in that array. In the early days of on-chip SRAMs, failure rates and random error rates were high, so designers included protection mechanisms like parity and/or redundancy in attempts to guard against unintended data changes. As CMOS processes matured, these concerns lessened and designers in many markets chose to accept unprotected SRAMs to cut down on area overhead for protection against increasingly less likely error events. However, with the rapid shrinking of silicon geometries and the change from planar to FinFET transistors, concern over such “soft” or “random” errors appears to be growing again. Fortunately, the increased gate counts possible with modern silicon processes make more advanced techniques such as Error Correcting Code (ECC) feasible – and for automotive applications, arguably mandatory as they provide much stronger protection against data corruption.
While precise details vary by the ECC chosen, today’s SoC designer should be able to get full Single Error Correct, Double Error Detect (SECDED) protection at a cost of around 8 bits of additional storage for every 64 bits of data. The additional logic complexity is outweighed by the additional capability for a system to survive single-bit errors. It is particularly important for the automotive SoC designer to ensure that both correctable and uncorrectable errors are logged and reported to software. By logging both the failed data bit(s) and SRAM line address, application or diagnostic software will have the information necessary to identify potentially failing hardware from patterns of even soft errors over time. Data at rest is generally in transition from layer to layer in PCI Express designs, so the SoC designer will not find a benefit in rewriting any corrected data values back into their originating memory as once passed to the next layer, the original memory locations will be reused for a later packet.
Protecting data in-flight is the process of ensuring correct data is carried through the various non-storage data paths of the SoC. For designers using ECC on their memories, carrying the ECC code along with the data certainly accomplishes the desired protection but the additional ECC checks may not be desirable due to area or timing closure. Given that even cutting-edge FinFET flip-flops are considered to be fairly reliable, the industry practice of carrying simple parity is likely sufficient – even in automotive applications.
When uncorrectable errors are detected anywhere on the outbound path to the PCI Express link, SoC designers must implement some type of error recovery handshake with the application logic. Because packets are often pipelined, simply invalidating an outbound packet and notifying the application logic may not be able to prevent a subsequent packet from being transmitted. Worst case, that packet might indicate a higher-level protocol “successful completion” message related to the corrupted data. Even though the bad packet was never transmitted, the system memory (intended to be updated by the now invalidated packet) will not have valid data, and so receiving a “success” message would be catastrophic.