SWAT: Designing Reliable Hardware SWAT logo
Sarita Adve's Research Group
University of Illinois at Urbana-Champaign
 
   
 


Overview

The SWAT (SoftWare Anomaly Treatment) project at Illinois has developed a novel and extremely low-cost design paradigm for affordable hardware reliability across all segments of the computing market.

Devices in shipped chips are increasingly expected to fail for many reasons, threatening future improvements in functionality and performance of computing systems.  At the same time, computers are becoming pervasive and society increasingly depends on their reliable operation. Previous solutions for hardware resiliency required significant redundancy and are too expensive to be widely deployed. In contrast, SWAT enables reliable operation at near-zero cost, making reliability affordable and pervasive.


 SWAT image

 

Instead of heavyweight always-on redundant computing, SWAT employs extremely low cost monitors that look for anomalous software behavior as symptoms of hardware faults. In the relatively rare event of such symptom detection, SWAT employs a sophisticated operation to rescue and recover the system from the impact of the fault. Since this rescue happens relatively rarely, it can be done largely in software at very low cost. This is analogous to the deployment of the Special Weapons And Tactics (S.W.A.T.) team that remains latent in the common case, and is called for only in high-risk situations.  A SWAT enabled system is thus equipped with very simple, low-cost symptom detectors, a specialized diagnosis procedure that can identify the source of the problem, and a recovery mechanism that can seamlessly recover the system to a fault-free state. 

The project has attracted significant interest from industry. Last year, the Semiconductor Research Corporation (SRC), a consortium of semiconductor companies, provided funding for prototyping SWAT as a step towards validation and potential transfer of the technology to industry. Additionally, a SWAT student will be spending several months at Intel to demonstrate SWAT for systems and fault models relevant to Intel. In recognition of the potential of the work, another student received a fellowship from Intel and a scholarship from IBM. We were also able to win a Computing Innovations fellowship grant through which a female postdoctoral student has now joined the SWAT team.

In summary, SWAT is addressing a critical challenge, its novel solution strategy is attracting much industry attention, and it has the potential to be a game-changing innovation for not only the microprocessor industry but also for the computing industry at large.


SWAT Framework Components


SWAT has following framework components:
1. Detection: SWAT detects hardware faults by monitoring software misbehavior.
2. Recovery: SWAT relies on a checkpointing mechanism to recover the state of the system. On a detection the system is rolled back to a prestine state.
3. Diagnosis: After detecting a fault a diagnosis procedure is invoked that replays the fault activating trace repeatedly on the faulty hardware to identify the source of the fault.
4. Repair: On diagnosing the source of the fault, appropriate repair action is performed depending on the availability of the redundant hardware components.
All these components are controled by the flexible firmware layer.

SWAT Framework