Recent from talks
Contribute something to knowledge base
Content stats: 0 posts, 0 articles, 0 media, 0 notes
Members stats: 0 subscribers, 0 contributors, 0 moderators, 0 supporters
Subscribers
Supporters
Contributors
Moderators
Hub AI
Fault tolerance AI simulator
(@Fault tolerance_simulator)
Hub AI
Fault tolerance AI simulator
(@Fault tolerance_simulator)
Fault tolerance
Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.
Fault tolerance specifically refers to a system's capability to handle faults without any degradation or downtime. In the event of an error, end-users remain unaware of any issues. Conversely, a system that experiences errors with some interruption in service or graceful degradation of performance is termed 'resilient'. In resilience, the system adapts to the error, maintaining service but acknowledging a certain impact on performance.
Typically, fault tolerance describes computer systems, ensuring the overall system remains functional despite hardware or software issues. Non-computing examples include structures that retain their integrity despite damage from fatigue, corrosion or impact.
The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by Antonín Svoboda. Its basic design was magnetic drums connected via relays, with a voting method of memory error detection (triple modular redundancy). Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories:
Most of the development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s, in preparation for Project Apollo and other research aspects. NASA's first machine went into a space observatory, and their second attempt, the JSTAR computer, was used in Voyager. This computer had a backup of memory arrays to use memory recovery methods and thus it was called the Jet Propulsion Laboratory Self-Testing-And-Repairing computer. It could detect its own errors and fix them or use redundant modules as needed. The computer is still working, as of early 2022.
Hyper-dependable computers were pioneered mostly by aircraft manufacturers, nuclear power companies, and the railroad industry in the United States. These entities needed computers with massive amounts of uptime that would fail gracefully enough during a fault to allow continued operation, while relying on constant human monitoring of computer output to detect faults. IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets. Later, BNSF, Unisys, and General Electric built their own.
In the 1970s, much work happened in the field. For instance, F14 CADC had built-in self-test and redundancy.
In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure. Later efforts showed that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail-safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.
Fault tolerance
Fault tolerance is the ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission-critical, or even life-critical systems.
Fault tolerance specifically refers to a system's capability to handle faults without any degradation or downtime. In the event of an error, end-users remain unaware of any issues. Conversely, a system that experiences errors with some interruption in service or graceful degradation of performance is termed 'resilient'. In resilience, the system adapts to the error, maintaining service but acknowledging a certain impact on performance.
Typically, fault tolerance describes computer systems, ensuring the overall system remains functional despite hardware or software issues. Non-computing examples include structures that retain their integrity despite damage from fatigue, corrosion or impact.
The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by Antonín Svoboda. Its basic design was magnetic drums connected via relays, with a voting method of memory error detection (triple modular redundancy). Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories:
Most of the development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s, in preparation for Project Apollo and other research aspects. NASA's first machine went into a space observatory, and their second attempt, the JSTAR computer, was used in Voyager. This computer had a backup of memory arrays to use memory recovery methods and thus it was called the Jet Propulsion Laboratory Self-Testing-And-Repairing computer. It could detect its own errors and fix them or use redundant modules as needed. The computer is still working, as of early 2022.
Hyper-dependable computers were pioneered mostly by aircraft manufacturers, nuclear power companies, and the railroad industry in the United States. These entities needed computers with massive amounts of uptime that would fail gracefully enough during a fault to allow continued operation, while relying on constant human monitoring of computer output to detect faults. IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets. Later, BNSF, Unisys, and General Electric built their own.
In the 1970s, much work happened in the field. For instance, F14 CADC had built-in self-test and redundancy.
In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure. Later efforts showed that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail-safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.
