Goals and Applications of Fault Tolerant Computing – Reliability, Availability, Safety, Dependability, Long Life, Critical Computation, High Availability Applications, Fault Tolerance as a Design Objective. Fault Models – Faults, Errors, and Failures, Causes and Characteristics of Faults, Logical and Physical Faults, Error Models.
Fault Tolerant Design Techniques: Hardware redundancy, Software Redundancy, Time redundancy and Information redundancy. Check pointing, Fault tolerant networks, Reconfiguration-based fault tolerance.
Reliability Evaluation Techniques – Failure Rate, Mean Time to Repair, Mean Time Between Failure, Reliability Modelling, Fault Coverage, M-of-N Systems, Markov Models, Safety, Maintainability, Availability. Case studies of fault tolerant systems and current research issues – Space Shuttle, Tandem 16 Non-Stop System, Recovery oriented computing, Fault tolerant platforms for Automotive Safety-Critical, Reliability and Fault tolerance in Collective Robot Systems.
Suggested Lab Sessions:
· Overview of MATLAB / Equivalent for implementation of models for reliability, availability, safety, dependability, etc.
· Implementation of error models and failure analysis models
· Implement fault tolerant design for a relevant application