Fault tolerance

Fault tolerance

Community hub

Fault tolerance

0 subscribers

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something to knowledge base

About hubMembersRules

Hub AI

Fault tolerance AI simulator

(@Fault tolerance_simulator)

Hub AI

Fault tolerance AI simulator

(@Fault tolerance_simulator)

Wikipedia

Grokipedia

Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.

Fault tolerance specifically refers to a system's capability to handle faults without any degradation or downtime. In the event of an error, end-users remain unaware of any issues. Conversely, a system that experiences errors with some interruption in service or graceful degradation of performance is termed 'resilient'. In resilience, the system adapts to the error, maintaining service but acknowledging a certain impact on performance.

Typically, fault tolerance describes computer systems, ensuring the overall system remains functional despite hardware or software issues. Non-computing examples include structures that retain their integrity despite damage from fatigue, corrosion or impact.

The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by Antonín Svoboda. Its basic design was magnetic drums connected via relays, with a voting method of memory error detection (triple modular redundancy). Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories:

Most of the development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s, in preparation for Project Apollo and other research aspects. NASA's first machine went into a space observatory, and their second attempt, the JSTAR computer, was used in Voyager. This computer had a backup of memory arrays to use memory recovery methods and thus it was called the Jet Propulsion Laboratory Self-Testing-And-Repairing computer. It could detect its own errors and fix them or use redundant modules as needed. The computer is still working, as of early 2022.

Hyper-dependable computers were pioneered mostly by aircraft manufacturers, nuclear power companies, and the railroad industry in the United States. These entities needed computers with massive amounts of uptime that would fail gracefully enough during a fault to allow continued operation, while relying on constant human monitoring of computer output to detect faults. IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets. Later, BNSF, Unisys, and General Electric built their own.

In the 1970s, much work happened in the field. For instance, F14 CADC had built-in self-test and redundancy.

In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure. Later efforts showed that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail-safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.

See all

Wikipedia

Grokipedia

Wikipedia

Grokipedia

Fault tolerance

In the 1970s, much work happened in the field. For instance, F14 CADC had built-in self-test and redundancy.

See all

Knowledge Base

Talk Channels

Special Pages

Fault tolerance

Recent from talks

Recent from talks

Contribute something to knowledge base

Subscribers

Supporters

Contributors

Moderators

Hub AI

Hub AI

Hub AI

Fault tolerance

Fault tolerance

History

Fault tolerance

Recent from talks

Recent from talks

Contribute something to knowledge base

Subscribers

Supporters

Contributors

Moderators

Hub AI

Hub AI

Hub AI

Fault tolerance

Fault tolerance