|
|
||
|
SWAT Home |
The SWAT project deals with devising low-overhead reliability solutions that handle multiple sources of failure. As devices continue to scale, future shipped hardware is more likely to fail due to in-the-field hardware faults. As traditional redundancy-based hardware reliability solutions are too expensive to be broadly deployable, recent research has focused on low-overhead reliability solutions. One approach is to employ detection (always-on) techniques that detect errors due to multiple sources of failure with a low-overhead and pay a higher overhead for diagnosis (rarely invoked). To this end, we are developing SWAT (SoftWare Anomaly Treatment) -- a low-cost firmware-level reliability solution that effectively handles multiple sources of faults. We observe that detecting the effect of faults through monitoring anomalous software behavior results in a low-overhead detection scheme that is capable of handing multiple sources of hardware faults. Our results indicate that such simple detection schemes detect over 95% of the permanent hardware faults in the underlying processor. We also show that using more sophisticated detection schemes, such as using software-level invariants, further improves this coverage by reducing the advent of SDC events. Post-detection, diagnosis is required to isolate both the type of fault (hardware or software). In the case of a hardware fault, diagnosis is also responsible for identifying the faulty microarchitectural component so that the entire core is not decomissioned. We present such a diagnosis framework that exploits a checkpoint/replay based recovery mechanism to identify the faulty unit. We show that over 96% of the detected faults can be correctly diagnosed to effect fine-grain reconfiguration/repair. We continue to expand the SWAT infrastructure to multicore and multithreaded environments, to enhance the resilience of such systems as well. We envision that future systems would require low-overhead schemes, such as SWAT, to provide for heightened resilience with the advent of a plethora of failure modes. | |