Recent from talks
Contribute something
Nothing was collected or created yet.
Reliability engineering
View on Wikipedia
Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability is defined as the probability that a product, system, or service will perform its intended function adequately for a specified period of time; or will operate in a defined environment without failure.[1] Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.
The reliability function is theoretically defined as the probability of success. In practice, it is calculated using different techniques, and its value ranges between 0 and 1, where 0 indicates no probability of success while 1 indicates definite success. This probability is estimated from detailed (physics of failure) analysis, previous data sets, or through reliability testing and reliability modeling. Availability, testability, maintainability, and maintenance are often defined as a part of "reliability engineering" in reliability programs. Reliability often plays a key role in the cost-effectiveness of systems.
Reliability engineering deals with the prediction, prevention, and management of high levels of "lifetime" engineering uncertainty and risks of failure. Although stochastic parameters define and affect reliability, reliability is not only achieved by mathematics and statistics.[2][3] "Nearly all teaching and literature on the subject emphasize these aspects and ignore the reality that the ranges of uncertainty involved largely invalidate quantitative methods for prediction and measurement."[4] For example, it is easy to represent "probability of failure" as a symbol or value in an equation, but it is almost impossible to predict its true magnitude in practice, which is massively multivariate, so having the equation for reliability does not begin to equal having an accurate predictive measurement of reliability.
Reliability engineering relates closely to quality engineering, safety engineering, and system safety, in that they use common methods for their analysis and may require input from each other. It can be said that a system must be reliably safe.
Reliability engineering focuses on the costs of failure caused by system downtime, cost of spares, repair equipment, personnel, and cost of warranty claims.[5]
History
[edit]The word reliability can be traced back to 1816 and is first attested to the poet Samuel Taylor Coleridge.[6] Before World War II the term was linked mostly to repeatability; a test (in any type of science) was considered "reliable" if the same results would be obtained repeatedly. In the 1920s, product improvement through the use of statistical process control was promoted by Dr. Walter A. Shewhart at Bell Labs,[7] around the time that Waloddi Weibull was working on statistical models for fatigue. The development of reliability engineering was here on a parallel path with quality. The modern use of the word reliability was defined by the U.S. military in the 1940s, characterizing a product that would operate when expected and for a specified period.
In World War II, many reliability issues were due to the inherent unreliability of electronic equipment available at the time, and to fatigue issues. In 1945, M.A. Miner published a seminal paper titled "Cumulative Damage in Fatigue" in an ASME journal. A main application for reliability engineering in the military was for the vacuum tube as used in radar systems and other electronics, for which reliability proved to be very problematic and costly. The IEEE formed the Reliability Society in 1948. In 1950, the United States Department of Defense formed a group called the "Advisory Group on the Reliability of Electronic Equipment" (AGREE) to investigate reliability methods for military equipment.[8] This group recommended three main ways of working:
- Improve component reliability.
- Establish quality and reliability requirements for suppliers.
- Collect field data and find root causes of failures.
In the 1960s, more emphasis was given to reliability testing on component and system levels. The famous military standard MIL-STD-781 was created at that time. Around this period also the much-used predecessor to military handbook 217 was published by RCA and was used for the prediction of failure rates of electronic components. The emphasis on component reliability and empirical research (e.g. Mil Std 217) alone slowly decreased. More pragmatic approaches, as used in the consumer industries, were being used. In the 1980s, televisions were increasingly made up of solid-state semiconductors. Automobiles rapidly increased their use of semiconductors with a variety of microcomputers under the hood and in the dash. Large air conditioning systems developed electronic controllers, as did microwave ovens and a variety of other appliances. Communications systems began to adopt electronics to replace older mechanical switching systems. Bellcore issued the first consumer prediction methodology for telecommunications, and SAE developed a similar document SAE870050 for automotive applications. The nature of predictions evolved during the decade, and it became apparent that die complexity wasn't the only factor that determined failure rates for integrated circuits (ICs).
Kam Wong published a paper questioning the bathtub curve[9]—see also reliability-centered maintenance. During this decade, the failure rate of many components dropped by a factor of 10. Software became important to the reliability of systems. By the 1990s, the pace of IC development was picking up. Wider use of stand-alone microcomputers was common, and the PC market helped keep IC densities following Moore's law and doubling about every 18 months. Reliability engineering was now changing as it moved towards understanding the physics of failure. Failure rates for components kept dropping, but system-level issues became more prominent. Systems thinking has become more and more important. For software, the CMM model (Capability Maturity Model) was developed, which gave a more qualitative approach to reliability. ISO 9000 added reliability measures as part of the design and development portion of certification.
The expansion of the World Wide Web created new challenges of security and trust. The older problem of too little reliable information available had now been replaced by too much information of questionable value. Consumer reliability problems could now be discussed online in real-time using data. New technologies such as micro-electromechanical systems (MEMS), handheld GPS, and hand-held devices that combine cell phones and computers all represent challenges to maintaining reliability. Product development time continued to shorten through this decade and what had been done in three years was being done in 18 months. This meant that reliability tools and tasks had to be more closely tied to the development process itself. In many ways, reliability has become part of everyday life and consumer expectations.
Overview
[edit]Reliability is the probability of a product performing its intended function under specified operating conditions in a manner that meets or exceeds customer expectations.[10]
Objective
[edit]The objectives of reliability engineering, in decreasing order of priority, are:[11]
- To apply engineering knowledge and specialist techniques to prevent or to reduce the likelihood or frequency of failures.
- To identify and correct the causes of failures that do occur despite the efforts to prevent them.
- To determine ways of coping with failures that do occur, if their causes have not been corrected.
- To apply methods for estimating the likely reliability of new designs, and for analysing reliability data.
The reason for the priority emphasis is that it is by far the most effective way of working, in terms of minimizing costs and generating reliable products. The primary skills that are required, therefore, are the ability to understand and anticipate the possible causes of failures, and knowledge of how to prevent them. It is also necessary to know the methods that can be used for analyzing designs and data.
Scope and techniques
[edit]Reliability engineering for "complex systems" requires a different, more elaborate systems approach than for non-complex systems. Reliability engineering may in that case involve:
- System availability and mission readiness analysis and related reliability and maintenance requirement allocation
- Functional system failure analysis and derived requirements specification
- Inherent (system) design reliability analysis and derived requirements specification for both hardware and software design
- System diagnostics design
- Fault tolerant systems (e.g. by redundancy)
- Predictive and preventive maintenance (e.g. reliability-centered maintenance)
- Human factors / human interaction / human errors
- Manufacturing- and assembly-induced failures (effect on the detected "0-hour quality" and reliability)
- Maintenance-induced failures
- Transport-induced failures
- Storage-induced failures
- Use (load) studies, component stress analysis, and derived requirements specification
- Software (systematic) failures
- Failure / reliability testing (and derived requirements)
- Field failure monitoring and corrective actions
- Spare parts stocking (availability control)
- Technical documentation, caution and warning analysis
- Data and information acquisition/organisation (creation of a general reliability development hazard log and FRACAS system)
- Chaos engineering
Effective reliability engineering requires understanding of the basics of failure mechanisms for which experience, broad engineering skills and good knowledge from many different special fields of engineering are required,[12] for example:
- Tribology
- Stress (mechanics)
- Fracture mechanics / fatigue
- Thermal engineering
- Fluid mechanics / shock-loading engineering
- Electrical engineering
- Chemical engineering (e.g. corrosion)
- Material science
Definitions
[edit]Reliability may be defined in the following ways:
- The idea that an item is fit for a purpose
- The capacity of a designed, produced, or maintained item to perform as required
- The capacity of a population of designed, produced or maintained items to perform as required
- The resistance to failure of an item
- The probability of an item to perform a required function under stated conditions
- The durability of an object
Basics of a reliability assessment
[edit]Many engineering techniques are used in reliability risk assessments, such as reliability block diagrams, hazard analysis, failure mode and effects analysis (FMEA),[13] fault tree analysis (FTA), Reliability Centered Maintenance, (probabilistic) load and material stress and wear calculations, (probabilistic) fatigue and creep analysis, human error analysis, manufacturing defect analysis, reliability testing, etc. These analyses must be done properly and with much attention to detail to be effective. Because of the large number of reliability techniques, their expense, and the varying degrees of reliability required for different situations, most projects develop a reliability program plan to specify the reliability tasks (statement of work (SoW) requirements) that will be performed for that specific system.
Consistent with the creation of safety cases, for example per ARP4761, the goal of reliability assessments is to provide a robust set of qualitative and quantitative evidence that the use of a component or system will not be associated with unacceptable risk. The basic steps to take[14] are to:
- Thoroughly identify relevant unreliability "hazards", e.g. potential conditions, events, human errors, failure modes, interactions, failure mechanisms, and root causes, by specific analysis or tests.
- Assess the associated system risk, by specific analysis or testing.
- Propose mitigation, e.g. requirements, design changes, detection logic, maintenance, and training, by which the risks may be lowered and controlled at an acceptable level.
- Determine the best mitigation and get agreement on final, acceptable risk levels, possibly based on cost/benefit analysis.
The risk here is the combination of probability and severity of the failure incident (scenario) occurring. The severity can be looked at from a system safety or a system availability point of view. Reliability for safety can be thought of as a very different focus from reliability for system availability. Availability and safety can exist in dynamic tension as keeping a system too available can be unsafe. Forcing an engineering system into a safe state too quickly can force false alarms that impede the availability of the system.
In a de minimis definition, the severity of failures includes the cost of spare parts, man-hours, logistics, damage (secondary failures), and downtime of machines which may cause production loss. A more complete definition of failure also can mean injury, dismemberment, and death of people within the system (witness mine accidents, industrial accidents, space shuttle failures) and the same to innocent bystanders (witness the citizenry of cities like Bhopal, Love Canal, Chernobyl, or Sendai, and other victims of the 2011 Tōhoku earthquake and tsunami)—in this case, reliability engineering becomes system safety. What is acceptable is determined by the managing authority or customers or the affected communities. Residual risk is the risk that is left over after all reliability activities have finished, and includes the unidentified risk—and is therefore not completely quantifiable.
The complexity of the technical systems such as improvements of design and materials, planned inspections, fool-proof design, and backup redundancy decreases risk and increases the cost. The risk can be decreased to ALARA (as low as reasonably achievable) or ALAPA (as low as practically achievable) levels.
Reliability and availability program plan
[edit]Implementing a reliability program is not simply a software purchase; it is not just a checklist of items that must be completed that ensure one has reliable products and processes. A reliability program is a complex learning and knowledge-based system unique to one's products and processes. It is supported by leadership, built on the skills that one develops within a team, integrated into business processes, and executed by following proven standard work practices.[15]
A reliability program plan is used to document exactly what "best practices" (tasks, methods, tools, analysis, and tests) are required for a particular (sub)system, as well as clarify customer requirements for reliability assessment. For large-scale complex systems, the reliability program plan should be a separate document. Resource determination for manpower and budgets for testing and other tasks is critical for a successful program. In general, the amount of work required for an effective program for complex systems is large.
A reliability program plan is essential for achieving high levels of reliability, testability, maintainability, and the resulting system availability, and is developed early during system development and refined over the system's life cycle. It specifies not only what the reliability engineer does, but also the tasks performed by other stakeholders. An effective reliability program plan must be approved by top program management, which is responsible for the allocation of sufficient resources for its implementation.
A reliability program plan may also be used to evaluate and improve the availability of a system by the strategy of focusing on increasing testability & maintainability and not on reliability. Improving maintainability is generally easier than improving reliability. Maintainability estimates (repair rates) are also generally more accurate. However, because the uncertainties in the reliability estimates are in most cases very large, they are likely to dominate the availability calculation (prediction uncertainty problem), even when maintainability levels are very high. When reliability is not under control, more complicated issues may arise, like manpower (maintainers/customer service capability) shortages, spare part availability, logistic delays, lack of repair facilities, extensive retrofit and complex configuration management costs, and others. The problem of unreliability may be increased also due to the "domino effect" of maintenance-induced failures after repairs. Focusing only on maintainability is therefore not enough. If failures are prevented, none of the other issues are of any importance, and therefore reliability is generally regarded as the most important part of availability. Reliability needs to be evaluated and improved related to both availability and the total cost of ownership (TCO) due to the cost of spare parts, maintenance man-hours, transport costs, storage costs, part obsolete risks, etc. But, as GM and Toyota have belatedly discovered, TCO also includes the downstream liability costs when reliability calculations have not sufficiently or accurately addressed customers' bodily risks. Often a trade-off is needed between the two. There might be a maximum ratio between availability and cost of ownership. The testability of a system should also be addressed in the plan, as this is the link between reliability and maintainability. The maintenance strategy can influence the reliability of a system (e.g., by preventive and/or predictive maintenance), although it can never bring it above the inherent reliability.
The reliability plan should clearly provide a strategy for availability control. Whether only availability or also cost of ownership is more important depends on the use of the system. For example, a system that is a critical link in a production system—e.g., a big oil platform—is normally allowed to have a very high cost of ownership if that cost translates to even a minor increase in availability, as the unavailability of the platform results in a massive loss of revenue which can easily exceed the high cost of ownership. A proper reliability plan should always address RAMT analysis in its total context. RAMT stands for reliability, availability, maintainability/maintenance, and testability in the context of the customer's needs.
Reliability requirements
[edit]For any system, one of the first tasks of reliability engineering is to adequately specify the reliability and maintainability requirements allocated from the overall availability needs and, more importantly, derived from proper design failure analysis or preliminary prototype test results. Clear requirements (able to be designed to) should constrain the designers from designing particular unreliable items/constructions/interfaces/systems. Setting only availability, reliability, testability, or maintainability targets (e.g., max. failure rates) is not appropriate. This is a broad misunderstanding about Reliability Requirements Engineering. Reliability requirements address the system itself, including test and assessment requirements, and associated tasks and documentation. Reliability requirements are included in the appropriate system or subsystem requirements specifications, test plans, and contract statements. The creation of proper lower-level requirements is critical.[16] The provision of only quantitative minimum targets (e.g., Mean Time Between Failure (MTBF) values or failure rates) is not sufficient for different reasons. One reason is that a full validation (related to correctness and verifiability in time) of a quantitative reliability allocation (requirement spec) on lower levels for complex systems can (often) not be made as a consequence of (1) the fact that the requirements are probabilistic, (2) the extremely high level of uncertainties involved for showing compliance with all these probabilistic requirements, and because (3) reliability is a function of time, and accurate estimates of a (probabilistic) reliability number per item are available only very late in the project, sometimes even after many years of in-service use. Compare this problem with the continuous (re-)balancing of, for example, lower-level-system mass requirements in the development of an aircraft, which is already often a big undertaking. Notice that in this case, masses do only differ in terms of only some %, are not a function of time, and the data is non-probabilistic and available already in CAD models. In the case of reliability, the levels of unreliability (failure rates) may change with factors of decades (multiples of 10) as a result of very minor deviations in design, process, or anything else.[17] The information is often not available without huge uncertainties within the development phase. This makes this allocation problem almost impossible to do in a useful, practical, valid manner that does not result in massive over- or under-specification. A pragmatic approach is therefore needed—for example: the use of general levels/classes of quantitative requirements depending only on severity of failure effects. Also, the validation of results is a far more subjective task than any other type of requirement. (Quantitative) reliability parameters—in terms of MTBF—are by far the most uncertain design parameters in any design.
Furthermore, reliability design requirements should drive a (system or part) design to incorporate features that prevent failures from occurring, or limit consequences from failure in the first place. Not only would it aid in some predictions, this effort would keep from distracting the engineering effort into a kind of accounting work. A design requirement should be precise enough so that a designer can "design to" it and can also prove—through analysis or testing—that the requirement has been achieved, and, if possible, within some a stated confidence. Any type of reliability requirement should be detailed and could be derived from failure analysis (Finite-Element Stress and Fatigue analysis, Reliability Hazard Analysis, FTA, FMEA, Human Factor Analysis, Functional Hazard Analysis, etc.) or any type of reliability testing. Also, requirements are needed for verification tests (e.g., required overload stresses) and test time needed. To derive these requirements in an effective manner, a systems engineering-based risk assessment and mitigation logic should be used. Robust hazard log systems must be created that contain detailed information on why and how systems could or have failed. Requirements are to be derived and tracked in this way. These practical design requirements shall drive the design and not be used only for verification purposes. These requirements (often design constraints) are in this way derived from failure analysis or preliminary tests. Understanding of this difference compared to only purely quantitative (logistic) requirement specification (e.g., Failure Rate / MTBF target) is paramount in the development of successful (complex) systems.[18]
The maintainability requirements address the costs of repairs as well as repair time. Testability (not to be confused with test requirements) requirements provide the link between reliability and maintainability and should address detectability of failure modes (on a particular system level), isolation levels, and the creation of diagnostics (procedures). As indicated above, reliability engineers should also address requirements for various reliability tasks and documentation during system development, testing, production, and operation. These requirements are generally specified in the contract statement of work and depend on how much leeway the customer wishes to provide to the contractor. Reliability tasks include various analyses, planning, and failure reporting. Task selection depends on the criticality of the system as well as cost. A safety-critical system may require a formal failure reporting and review process throughout development, whereas a non-critical system may rely on final test reports. The most common reliability program tasks are documented in reliability program standards, such as MIL-STD-785 and IEEE 1332. Failure reporting analysis and corrective action systems are a common approach for product/process reliability monitoring.
Reliability culture / human errors / human factors
[edit]In practice, most failures can be traced back to some type of human error, for example in:
- Management decisions (e.g. in budgeting, timing, and required tasks)
- Systems Engineering: Use studies (load cases)
- Systems Engineering: Requirement analysis / setting
- Systems Engineering: Configuration control
- Assumptions
- Calculations / simulations / FEM analysis
- Design
- Design drawings
- Testing (e.g. incorrect load settings or failure measurement)
- Statistical analysis
- Manufacturing
- Quality control
- Maintenance
- Maintenance manuals
- Training
- Classifying and ordering of information
- Feedback of field information (e.g. incorrect or too vague)
- etc.
However, humans are also very good at detecting such failures, correcting them, and improvising when abnormal situations occur. Therefore, policies that completely rule out human actions in design and production processes to improve reliability may not be effective. Some tasks are better performed by humans and some are better performed by machines.[19]
Furthermore, human errors in management; the organization of data and information; or the misuse or abuse of items, may also contribute to unreliability. This is the core reason why high levels of reliability for complex systems can only be achieved by following a robust systems engineering process with proper planning and execution of the validation and verification tasks. This also includes the careful organization of data and information sharing and creating a "reliability culture", in the same way, that having a "safety culture" is paramount in the development of safety-critical systems.
Reliability prediction and improvement
[edit]Reliability prediction combines:
- creation of a proper reliability model (see further on this page)
- estimation (and justification) of input parameters for this model (e.g. failure rates for a particular failure mode or event and the mean time to repair the system for a particular failure)
- estimation of output reliability parameters at system or part level (i.e. system availability or frequency of a particular functional failure) The emphasis on quantification and target setting (e.g. MTBF) might imply there is a limit to achievable reliability, however, there is no inherent limit and development of higher reliability does not need to be more costly. In addition, they[who?] argue that prediction of reliability from historic data can be very misleading, with comparisons only valid for identical designs, products, manufacturing processes, and maintenance with identical operating loads and usage environments. Even minor changes in any of these could have major effects on reliability. Furthermore, the most unreliable and important items (i.e. the most interesting candidates for a reliability investigation) are most likely to be modified and re-engineered since historical data was gathered, making the standard (re-active or pro-active) statistical methods and processes used in e.g. medical or insurance industries less effective. Another argument is that to be able to accurately predict reliability by testing, the exact mechanisms of failure must be known and therefore – in most cases – could be prevented. Following the incorrect route of trying to quantify and solve a complex reliability engineering problem in terms of MTBF or probability using an-incorrect – for example, the re-active – approach is referred to by Barnard as "Playing the Numbers Game" and is regarded as bad practice.[20]
For existing systems, it is arguable that any attempt by a responsible program to correct the root cause of discovered failures may render the initial MTBF estimate invalid, as new assumptions (themselves subject to high error levels) of the effect of this correction must be made. Another practical issue is the general unavailability of detailed failure data, with those available often featuring inconsistent filtering of failure (feedback) data, and ignoring statistical errors (which are very high for rare events like reliability related failures). Very clear guidelines must be present to count and compare failures related to different type of root-causes (e.g. manufacturing-, maintenance-, transport-, system-induced or inherent design failures). Comparing different types of causes may lead to incorrect estimations and incorrect business decisions about the focus of improvement.
To perform a proper quantitative reliability prediction for systems may be difficult and very expensive if done by testing. At the individual part-level, reliability results can often be obtained with comparatively high confidence, as testing of many sample parts might be possible using the available testing budget. However, these tests may lack validity at a system-level due to assumptions made at part-level testing. These authors emphasized the importance of initial part- or system-level testing until failure, and to learn from such failures to improve the system or part. The general conclusion is drawn that an accurate and absolute prediction – by either field-data comparison or testing – of reliability is in most cases not possible. An exception might be failures due to wear-out problems such as fatigue failures. In the introduction of MIL-STD-785 it is written that reliability prediction should be used with great caution, if not used solely for comparison in trade-off studies.
Design for reliability
[edit]Design for Reliability (DfR) is a process that encompasses tools and procedures to ensure that a product meets its reliability requirements, under its use environment, for the duration of its lifetime. DfR is implemented in the design stage of a product to proactively improve product reliability.[21] DfR is often used as part of an overall Design for Excellence (DfX) strategy.
Statistics-based approach (i.e. MTBF)
[edit]Reliability design begins with the development of a (system) model. Reliability and availability models use block diagrams and Fault Tree Analysis to provide a graphical means of evaluating the relationships between different parts of the system. These models may incorporate predictions based on failure rates taken from historical data. While the (input data) predictions are often not accurate in an absolute sense, they are valuable to assess relative differences in design alternatives. Maintainability parameters, for example Mean time to repair (MTTR), can also be used as inputs for such models.
The most important fundamental initiating causes and failure mechanisms are to be identified and analyzed with engineering tools. A diverse set of practical guidance as to performance and reliability should be provided to designers so that they can generate low-stressed designs and products that protect, or are protected against, damage and excessive wear. Proper validation of input loads (requirements) may be needed, in addition to verification for reliability "performance" by testing.

One of the most important design techniques is redundancy. This means that if one part of the system fails, there is an alternate success path, such as a backup system. The reason why this is the ultimate design choice is related to the fact that high-confidence reliability evidence for new parts or systems is often not available, or is extremely expensive to obtain. By combining redundancy, together with a high level of failure monitoring, and the avoidance of common cause failures; even a system with relatively poor single-channel (part) reliability, can be made highly reliable at a system level (up to mission critical reliability). No testing of reliability has to be required for this. In conjunction with redundancy, the use of dissimilar designs or manufacturing processes (e.g. via different suppliers of similar parts) for single independent channels, can provide less sensitivity to quality issues (e.g. early childhood failures at a single supplier), allowing very-high levels of reliability to be achieved at all moments of the development cycle (from early life to long-term). Redundancy can also be applied in systems engineering by double checking requirements, data, designs, calculations, software, and tests to overcome systematic failures.
Another effective way to deal with reliability issues is to perform analysis that predicts degradation, enabling the prevention of unscheduled downtime events / failures. RCM (Reliability Centered Maintenance) programs can be used for this.
Physics-of-failure-based approach
[edit]For electronic assemblies, there has been an increasing shift towards a different approach called physics of failure. This technique relies on understanding the physical static and dynamic failure mechanisms. It accounts for variation in load, strength, and stress that lead to failure with a high level of detail, made possible with the use of modern finite element method (FEM) software programs that can handle complex geometries and mechanisms such as creep, stress relaxation, fatigue, and probabilistic design (Monte Carlo Methods/DOE). The material or component can be re-designed to reduce the probability of failure and to make it more robust against such variations. Another common design technique is component derating: i.e. selecting components whose specifications significantly exceed the expected stress levels, such as using heavier gauge electrical wire than might normally be specified for the expected electric current.
Common tools and techniques
[edit]Many of the tasks, techniques, and analyses used in Reliability Engineering are specific to particular industries and applications, but can commonly include:
- Physics of failure (PoF)
- Built-in self-test (BIT or BIST) (testability analysis)
- Failure mode and effects analysis (FMEA)
- Reliability hazard analysis
- Reliability block-diagram analysis
- Dynamic reliability block-diagram analysis[22]
- Fault tree analysis
- Root cause analysis
- Statistical engineering, design of experiments – e.g. on simulations / FEM models or with testing
- Sneak circuit analysis
- Accelerated testing
- Reliability growth analysis (re-active reliability)
- Weibull analysis (for testing or mainly "re-active" reliability)
- Hypertabastic survival models
- Thermal analysis by finite element analysis (FEA) and / or measurement
- Thermal induced, shock and vibration fatigue analysis by FEA and / or measurement
- Electromagnetic analysis
- Avoidance of single point of failure (SPOF)
- Functional analysis and functional failure analysis (e.g., function FMEA, FHA or FFA)
- Predictive and preventive maintenance: reliability centered maintenance (RCM) analysis
- Testability analysis
- Failure diagnostics analysis (normally also incorporated in FMEA)
- Human error analysis
- Operational hazard analysis
- Preventative/Planned Maintenance Optimization (PMO)
- Manual screening
- Integrated logistics support
Results from these methods are presented during reviews of part or system design, and logistics. Reliability is just one requirement among many for a complex part or system. Engineering trade-off studies are used to determine the optimum balance between reliability requirements and other constraints.
The importance of language
[edit]Reliability engineers, whether using quantitative or qualitative methods to describe a failure or hazard, rely on language to pinpoint the risks and enable issues to be solved. The language used must help create an orderly description of the function/item/system and its complex surrounding as it relates to the failure of these functions/items/systems. Systems engineering is very much about finding the correct words to describe the problem (and related risks), so that they can be readily solved via engineering solutions. Jack Ring said that a systems engineer's job is to "language the project." (Ring et al. 2000)[23] For part/system failures, reliability engineers should concentrate more on the "why and how", rather that predicting "when". Understanding "why" a failure has occurred (e.g. due to over-stressed components or manufacturing issues) is far more likely to lead to improvement in the designs and processes used[4] than quantifying "when" a failure is likely to occur (e.g. via determining MTBF). To do this, first the reliability hazards relating to the part/system need to be classified and ordered (based on some form of qualitative and quantitative logic if possible) to allow for more efficient assessment and eventual improvement. This is partly done in pure language and proposition logic, but also based on experience with similar items. This can for example be seen in descriptions of events in fault tree analysis, FMEA analysis, and hazard (tracking) logs. In this sense language and proper grammar (part of qualitative analysis) plays an important role in reliability engineering, just like it does in safety engineering or in-general within systems engineering.
Correct use of language can also be key to identifying or reducing the risks of human error, which are often the root cause of many failures. This can include proper instructions in maintenance manuals, operation manuals, emergency procedures, and others to prevent systematic human errors that may result in system failures. These should be written by trained or experienced technical authors using so-called simplified English or Simplified Technical English, where words and structure are specifically chosen and created so as to reduce ambiguity or risk of confusion (e.g. an "replace the old part" could ambiguously refer to a swapping a worn-out part with a non-worn-out part, or replacing a part with one using a more recent and hopefully improved design).
Reliability modeling
[edit]Reliability modeling is the process of predicting or understanding the reliability of a component or system prior to its implementation. Two types of analysis that are often used to model a complete system's availability behavior including effects from logistics issues like spare part provisioning, transport and manpower are fault tree analysis and reliability block diagrams. At a component level, the same types of analyses can be used together with others. The input for the models can come from many sources including testing; prior operational experience; field data; as well as data handbooks from similar or related industries. Regardless of source, all model input data must be used with great caution, as predictions are only valid in cases where the same product was used in the same context. As such, predictions are often only used to help compare alternatives.

For part level predictions, two separate fields of investigation are common:
- The physics of failure approach uses an understanding of physical failure mechanisms involved, such as mechanical crack propagation or chemical corrosion degradation or failure;
- The parts stress modelling approach is an empirical method for prediction based on counting the number and type of components of the system, and the stress they undergo during operation.
Reliability theory
[edit]Reliability is defined as the probability that a device will perform its intended function during a specified period of time under stated conditions. Mathematically, this may be expressed as,
,
where is the failure probability density function and is the length of the period of time (which is assumed to start from time zero).
There are a few key elements of this definition:
- Reliability is predicated on "intended function:" Generally, this is taken to mean operation without failure. However, even if no individual part of the system fails, but the system as a whole does not do what was intended, then it is still charged against the system reliability. The system requirements specification is the criterion against which reliability is measured.
- Reliability applies to a specified period of time. In practical terms, this means that a system has a specified chance that it will operate without failure before time . Reliability engineering ensures that components and materials will meet the requirements during the specified time. Note that units other than time may sometimes be used (e.g. "a mission", "operation cycles").
- Reliability is restricted to operation under stated (or explicitly defined) conditions. This constraint is necessary because it is impossible to design a system for unlimited conditions. A Mars rover will have different specified conditions than a family car. The operating environment must be addressed during design and testing. That same rover may be required to operate in varying conditions requiring additional scrutiny.
- Two notable references on reliability theory and its mathematical and statistical foundations are Barlow, R. E. and Proschan, F. (1982) and Samaniego, F. J. (2007).
Quantitative system reliability parameters—theory
[edit]Quantitative requirements are specified using reliability parameters. The most common reliability parameter is the mean time to failure (MTTF), which can also be specified as the failure rate (this is expressed as a frequency or conditional probability density function (PDF)) or the number of failures during a given period. These parameters may be useful for higher system levels and systems that are operated frequently (i.e. vehicles, machinery, and electronic equipment). Reliability increases as the MTTF increases. The MTTF is usually specified in hours, but can also be used with other units of measurement, such as miles or cycles. Using MTTF values on lower system levels can be very misleading, especially if they do not specify the associated Failures Modes and Mechanisms (The F in MTTF).[17]
In other cases, reliability is specified as the probability of mission success. For example, reliability of a scheduled aircraft flight can be specified as a dimensionless probability or a percentage, as often used in system safety engineering.
A special case of mission success is the single-shot device or system. These are devices or systems that remain relatively dormant and only operate once. Examples include automobile airbags, thermal batteries and missiles. Single-shot reliability is specified as a probability of one-time success or is subsumed into a related parameter. Single-shot missile reliability may be specified as a requirement for the probability of a hit. For such systems, the probability of failure on demand (PFD) is the reliability measure – this is actually an "unavailability" number. The PFD is derived from failure rate (a frequency of occurrence) and mission time for non-repairable systems.
For repairable systems, it is obtained from failure rate, mean-time-to-repair (MTTR), and test interval. This measure may not be unique for a given system as this measure depends on the kind of demand. In addition to system level requirements, reliability requirements may be specified for critical subsystems. In most cases, reliability parameters are specified with appropriate statistical confidence intervals.
Reliability testing
[edit]The purpose of reliability testing or reliability verification is to discover potential problems with the design as early as possible and, ultimately, provide confidence that the system meets its reliability requirements. The reliability of the product in all environments such as expected use, transportation, or storage during the specified lifespan should be considered.[10] It is to expose the product to natural or artificial environmental conditions to undergo its action to evaluate the performance of the product under the environmental conditions of actual use, transportation, and storage, and to analyze and study the degree of influence of environmental factors and their mechanism of action.[24] Through the use of various environmental test equipment to simulate the high temperature, low temperature, and high humidity, and temperature changes in the climate environment, to accelerate the reaction of the product in the use environment, to verify whether it reaches the expected quality in R&D, design, and manufacturing.[25]
Reliability verification is also called reliability testing, which refers to the use of modeling, statistics, and other methods to evaluate the reliability of the product based on the product's life span and expected performance.[26] Most product on the market requires reliability testing, such as automotive, integrated circuit, heavy machinery used to mine nature resources, Aircraft auto software.[27][28]
Reliability testing may be performed at several levels and there are different types of testing. Complex systems may be tested at component, circuit board, unit, assembly, subsystem and system levels.[29] (The test level nomenclature varies among applications.) For example, performing environmental stress screening tests at lower levels, such as piece parts or small assemblies, catches problems before they cause failures at higher levels. Testing proceeds during each level of integration through full-up system testing, developmental testing, and operational testing, thereby reducing program risk. However, testing does not mitigate unreliability risk.
With each test both statistical type I and type II errors could be made, depending on sample size, test time, assumptions and the needed discrimination ratio. There is risk of incorrectly rejecting a good design (type I error) and the risk of incorrectly accepting a bad design (type II error).
It is not always feasible to test all system requirements. Some systems are prohibitively expensive to test; some failure modes may take years to observe; some complex interactions result in a huge number of possible test cases; and some tests require the use of limited test ranges or other resources. In such cases, different approaches to testing can be used, such as (highly) accelerated life testing, design of experiments, and simulations.
The desired level of statistical confidence also plays a role in reliability testing. Statistical confidence is increased by increasing either the test time or the number of items tested. Reliability test plans are designed to achieve the specified reliability at the specified confidence level with the minimum number of test units and test time. Different test plans result in different levels of risk to the producer and consumer. The desired reliability, statistical confidence, and risk levels for each side influence the ultimate test plan. The customer and developer should agree in advance on how reliability requirements will be tested.
A key aspect of reliability testing is to define "failure". Although this may seem obvious, there are many situations where it is not clear whether a failure is really the fault of the system. Variations in test conditions, operator differences, weather and unexpected situations create differences between the customer and the system developer. One strategy to address this issue is to use a scoring conference process. A scoring conference includes representatives from the customer, the developer, the test organization, the reliability organization, and sometimes independent observers. The scoring conference process is defined in the statement of work. Each test case is considered by the group and "scored" as a success or failure. This scoring is the official result used by the reliability engineer.
As part of the requirements phase, the reliability engineer develops a test strategy with the customer. The test strategy makes trade-offs between the needs of the reliability organization, which wants as much data as possible, and constraints such as cost, schedule and available resources. Test plans and procedures are developed for each reliability test, and results are documented.
Reliability testing is common in the Photonics industry. Examples of reliability tests of lasers are life test and burn-in. These tests consist of the highly accelerated aging, under controlled conditions, of a group of lasers. The data collected from these life tests are used to predict laser life expectancy under the intended operating characteristics.[30]
Reliability test requirements
[edit]There are many criteria to test depends on the product or process that are testing on, and mainly, there are five components that are most common:[31][32]
- Product life span
- Intended function
- Operating Condition
- Probability of Performance
- User exceptions[33]
The product life span can be split into four different for analysis. Useful life is the estimated economic life of the product, which is defined as the time can be used before the cost of repair do not justify the continue use to the product. Warranty life is the product should perform the function within the specified time period. Design life is where during the design of the product, designer take into consideration on the life time of competitive product and customer desire and ensure that the product do not result in customer dissatisfaction.[34][35]
Reliability test requirements can follow from any analysis for which the first estimate of failure probability, failure mode or effect needs to be justified. Evidence can be generated with some level of confidence by testing. With software-based systems, the probability is a mix of software and hardware-based failures. Testing reliability requirements is problematic for several reasons. A single test is in most cases insufficient to generate enough statistical data. Multiple tests or long-duration tests are usually very expensive. Some tests are simply impractical, and environmental conditions can be hard to predict over a systems life-cycle.
Reliability engineering is used to design a realistic and affordable test program that provides empirical evidence that the system meets its reliability requirements. Statistical confidence levels are used to address some of these concerns. A certain parameter is expressed along with a corresponding confidence level: for example, an MTBF of 1000 hours at 90% confidence level. From this specification, the reliability engineer can, for example, design a test with explicit criteria for the number of hours and number of failures until the requirement is met or failed. Different sorts of tests are possible.
The combination of required reliability level and required confidence level greatly affects the development cost and the risk to both the customer and producer. Care is needed to select the best combination of requirements—e.g. cost-effectiveness. Reliability testing may be performed at various levels, such as component, subsystem and system. Also, many factors must be addressed during testing and operation, such as extreme temperature and humidity, shock, vibration, or other environmental factors (like loss of signal, cooling or power; or other catastrophes such as fire, floods, excessive heat, physical or security violations or other myriad forms of damage or degradation). For systems that must last many years, accelerated life tests may be needed.
Testing method
[edit]A systematic approach to reliability testing is to, first, determine reliability goal, then do tests that are linked to performance and determine the reliability of the product.[36] A reliability verification test in modern industries should clearly determine how they relate to the product's overall reliability performance and how individual tests impact the warranty cost and customer satisfaction.[37]
Accelerated testing
[edit]The purpose of accelerated life testing (ALT test) is to induce field failure in the laboratory at a much faster rate by providing a harsher, but nonetheless representative, environment. In such a test, the product is expected to fail in the lab just as it would have failed in the field—but in much less time. The main objective of an accelerated test is either of the following:
- To discover failure modes
- To predict the normal field life from the high stress lab life
An accelerated testing program can be broken down into the following steps:
- Define objective and scope of the test
- Collect required information about the product
- Identify the stress(es)
- Determine level of stress(es)
- Conduct the accelerated test and analyze the collected data.
Common ways to determine a life stress relationship are:
- Arrhenius model
- Eyring model
- Inverse power law model
- Temperature–humidity model
- Temperature non-thermal model
Software reliability
[edit]Software reliability is a special aspect of reliability engineering. It focuses on foundations and techniques to make software more reliable, i.e., resilient to faults. System reliability, by definition, includes all parts of the system, including hardware, software, supporting infrastructure (including critical external interfaces), operators and procedures. Traditionally, reliability engineering focuses on critical hardware parts of the system. Since the widespread use of digital integrated circuit technology, software has become an increasingly critical part of most electronics and, hence, nearly all present day systems. Therefore, software reliability has gained prominence within the field of system reliability.
There are significant differences, however, in how software and hardware behave. Most hardware unreliability is the result of a component or material failure that results in the system not performing its intended function. Repairing or replacing the hardware component restores the system to its original operating state. However, software does not fail in the same sense that hardware fails. Instead, software unreliability is the result of unanticipated results of software operations. Even relatively small software programs can have astronomically large combinations of inputs and states that are infeasible to exhaustively test. Restoring software to its original state only works until the same combination of inputs and states results in the same unintended result. Software reliability engineering must take this into account.
Despite this difference in the source of failure between software and hardware, several software reliability models based on statistics have been proposed to quantify what we experience with software: the longer software is run, the higher the probability that it will eventually be used in an untested manner and exhibit a latent defect that results in a failure (Shooman 1987), (Musa 2005), (Denney 2005).
As with hardware, software reliability depends on good requirements, design and implementation. Software reliability engineering relies heavily on a disciplined software engineering process to anticipate and design against unintended consequences. There is more overlap between software quality engineering and software reliability engineering than between hardware quality and reliability. A good software development plan is a key aspect of the software reliability program. The software development plan describes the design and coding standards, peer reviews, unit tests, configuration management, software metrics and software models to be used during software development.
A common reliability metric is the number of software faults per line of code (FLOC), usually expressed as faults per thousand lines of code. This metric, along with software execution time, is key to most software reliability models and estimates. The theory is that the software reliability increases as the number of faults (or fault density) decreases. Establishing a direct connection between fault density and mean-time-between-failure is difficult, however, because of the way software faults are distributed in the code, their severity, and the probability of the combination of inputs necessary to encounter the fault. Nevertheless, fault density serves as a useful indicator for the reliability engineer. Other software metrics, such as complexity, are also used. This metric remains controversial, since changes in software development and verification practices can have dramatic impact on overall defect rates.
Software testing is an important aspect of software reliability. Even the best software development process results in some software faults that are nearly undetectable until tested. Software is tested at several levels, starting with individual units, through integration and full-up system testing. All phases of testing, software faults are discovered, corrected, and re-tested. Reliability estimates are updated based on the fault density and other metrics. At a system level, mean-time-between-failure data can be collected and used to estimate reliability. Unlike hardware, performing exactly the same test on exactly the same software configuration does not provide increased statistical confidence. Instead, software reliability uses different metrics, such as code coverage.
The Software Engineering Institute's capability maturity model is a common means of assessing the overall software development process for reliability and quality purposes.
Structural reliability
[edit]Structural reliability or the reliability of structures is the application of reliability theory to the behavior of structures. It is used in both the design and maintenance of different types of structures including concrete and steel structures.[38][39] In structural reliability studies both loads and resistances are modeled as probabilistic variables. Using this approach the probability of failure of a structure is calculated.
Comparison to safety engineering
[edit]Reliability for safety and reliability for availability are often closely related. Lost availability of an engineering system can cost money. If a subway system is unavailable the subway operator will lose money for each hour the system is down. The subway operator will lose more money if safety is compromised. The definition of reliability is tied to a probability of not encountering a failure. A failure can cause loss of safety, loss of availability or both. It is undesirable to lose safety or availability in a critical system.
Reliability engineering is concerned with overall minimisation of failures that could lead to financial losses for the responsible entity, whereas safety engineering focuses on minimising a specific set of failure types that in general could lead to loss of life, injury or damage to equipment.
Reliability hazards could transform into incidents leading to a loss of revenue for the company or the customer, for example due to direct and indirect costs associated with: loss of production due to system unavailability; unexpected high or low demands for spares; repair costs; man-hours; re-designs or interruptions to normal production.[40]
Safety engineering is often highly specific, relating only to certain tightly regulated industries, applications, or areas. It primarily focuses on system safety hazards that could lead to severe accidents including: loss of life; destruction of equipment; or environmental damage. As such, the related system functional reliability requirements are often extremely high. Although it deals with unwanted failures in the same sense as reliability engineering, it, however, has less of a focus on direct costs, and is not concerned with post-failure repair actions. Another difference is the level of impact of failures on society, leading to a tendency for strict control by governments or regulatory bodies (e.g. nuclear, aerospace, defense, rail and oil industries).[40]
Fault tolerance
[edit]Safety can be increased using a 2oo2 cross checked redundant system. Availability can be increased by using "1oo2" (1 out of 2) redundancy at a part or system level. If both redundant elements disagree the more permissive element will maximize availability. A 1oo2 system should never be relied on for safety. Fault-tolerant systems often rely on additional redundancy (e.g. 2oo3 voting logic) where multiple redundant elements must agree on a potentially unsafe action before it is performed. This increases both availability and safety at a system level. This is common practice in aerospace systems that need continued availability and do not have a fail-safe mode. For example, aircraft may use triple modular redundancy for flight computers and control surfaces (including occasionally different modes of operation e.g. electrical/mechanical/hydraulic) as these need to always be operational, due to the fact that there are no "safe" default positions for control surfaces such as rudders or ailerons when the aircraft is flying.
Basic reliability and mission reliability
[edit]The above example of a 2oo3 fault tolerant system increases both mission reliability as well as safety. However, the "basic" reliability of the system will in this case still be lower than a non-redundant (1oo1) or 2oo2 system. Basic reliability engineering covers all failures, including those that might not result in system failure, but do result in additional cost due to: maintenance repair actions; logistics; spare parts etc. For example, replacement or repair of 1 faulty channel in a 2oo3 voting system, (the system is still operating, although with one failed channel it has actually become a 2oo2 system) is contributing to basic unreliability but not mission unreliability. As an example, the failure of the tail-light of an aircraft will not prevent the plane from flying (and so is not considered a mission failure), but it does need to be remedied (with a related cost, and so does contribute to the basic unreliability levels).
Detectability and common cause failures
[edit]When using fault tolerant (redundant) systems or systems that are equipped with protection functions, detectability of failures and avoidance of common cause failures becomes paramount for safe functioning and/or mission reliability.
Reliability versus quality (Six Sigma)
[edit]Quality often focuses on manufacturing defects during the warranty phase. Reliability looks at the failure intensity over the whole life of a product or engineering system from commissioning to decommissioning. Six Sigma has its roots in statistical control in quality of manufacturing. Reliability engineering is a specialty part of systems engineering. The systems engineering process is a discovery process that is often unlike a manufacturing process. A manufacturing process is often focused on repetitive activities that achieve high quality outputs with minimum cost and time.[41]
The everyday usage term "quality of a product" is loosely taken to mean its inherent degree of excellence. In industry, a more precise definition of quality as "conformance to requirements or specifications at the start of use" is used. Assuming the final product specification adequately captures the original requirements and customer/system needs, the quality level can be measured as the fraction of product units shipped that meet specifications.[42] Manufactured goods quality often focuses on the number of warranty claims during the warranty period.
Quality is a snapshot at the start of life through the warranty period and is related to the control of lower-level product specifications. This includes time-zero defects i.e. where manufacturing mistakes escaped final Quality Control. In theory the quality level might be described by a single fraction of defective products. Reliability, as a part of systems engineering, acts as more of an ongoing assessment of failure rates over many years. Theoretically, all items will fail over an infinite period of time.[43] Defects that appear over time are referred to as reliability fallout. To describe reliability fallout a probability model that describes the fraction fallout over time is needed. This is known as the life distribution model.[42] Some of these reliability issues may be due to inherent design issues, which may exist even though the product conforms to specifications. Even items that are produced perfectly will fail over time due to one or more failure mechanisms (e.g. due to human error or mechanical, electrical, and chemical factors). These reliability issues can also be influenced by acceptable levels of variation during initial production.
Quality and reliability are, therefore, related to manufacturing. Reliability is more targeted towards clients who are focused on failures throughout the whole life of the product such as the military, airlines or railroads. Items that do not conform to product specification will generally do worse in terms of reliability (having a lower MTTF), but this does not always have to be the case. The full mathematical quantification (in statistical models) of this combined relation is in general very difficult or even practically impossible. In cases where manufacturing variances can be effectively reduced, six sigma tools have been shown to be useful to find optimal process solutions which can increase quality and reliability. Six Sigma may also help to design products that are more robust to manufacturing induced failures and infant mortality defects in engineering systems and manufactured product.
In contrast with Six Sigma, reliability engineering solutions are generally found by focusing on reliability testing and system design. Solutions are found in different ways, such as by simplifying a system to allow more of the mechanisms of failure involved to be understood; performing detailed calculations of material stress levels allowing suitable safety factors to be determined; finding possible abnormal system load conditions and using this to increase robustness of a design to manufacturing variance related failure mechanisms. Furthermore, reliability engineering uses system-level solutions, like designing redundant and fault-tolerant systems for situations with high availability needs (see Reliability engineering vs Safety engineering above).
Note: A "defect" in six-sigma/quality literature is not the same as a "failure" (Field failure | e.g. fractured item) in reliability. A six-sigma/quality defect refers generally to non-conformance with a requirement (e.g. basic functionality or a key dimension). Items can, however, fail over time, even if these requirements are all fulfilled. Quality is generally not concerned with asking the crucial question "are the requirements actually correct?", whereas reliability is.
Reliability operational assessment
[edit]Once systems or parts are being produced, reliability engineering attempts to monitor, assess, and correct deficiencies. Monitoring includes electronic and visual surveillance of critical parameters identified during the fault tree analysis design stage. Data collection is highly dependent on the nature of the system. Most large organizations have quality control groups that collect failure data on vehicles, equipment and machinery. Consumer product failures are often tracked by the number of returns. For systems in dormant storage or on standby, it is necessary to establish a formal surveillance program to inspect and test random samples. Any changes to the system, such as field upgrades or recall repairs, require additional reliability testing to ensure the reliability of the modification. Since it is not possible to anticipate all the failure modes of a given system, especially ones with a human element, failures will occur. The reliability program also includes a systematic root cause analysis that identifies the causal relationships involved in the failure such that effective corrective actions may be implemented. When possible, system failures and corrective actions are reported to the reliability engineering organization.
Some of the most common methods to apply to a reliability operational assessment are failure reporting, analysis, and corrective action systems (FRACAS). This systematic approach develops a reliability, safety, and logistics assessment based on failure/incident reporting, management, analysis, and corrective/preventive actions. Organizations today are adopting this method and utilizing commercial systems (such as Web-based FRACAS applications) that enable them to create a failure/incident data repository from which statistics can be derived to view accurate and genuine reliability, safety, and quality metrics.
It is extremely important for an organization to adopt a common FRACAS system for all end items. Also, it should allow test results to be captured in a practical way. Failure to adopt one easy-to-use (in terms of ease of data-entry for field engineers and repair shop engineers) and easy-to-maintain integrated system is likely to result in a failure of the FRACAS program itself.
Some of the common outputs from a FRACAS system include Field MTBF, MTTR, spares consumption, reliability growth, failure/incidents distribution by type, location, part no., serial no., and symptom.
The use of past data to predict the reliability of new comparable systems/items can be misleading as reliability is a function of the context of use and can be affected by small changes in design/manufacturing.
Reliability organizations
[edit]Systems of any significant complexity are developed by organizations of people, such as a commercial company or a government agency. The reliability engineering organization must be consistent with the company's organizational structure. For small, non-critical systems, reliability engineering may be informal. As complexity grows, the need arises for a formal reliability function. Because reliability is important to the customer, the customer may even specify certain aspects of the reliability organization.
There are several common types of reliability organizations. The project manager or chief engineer may employ one or more reliability engineers directly. In larger organizations, there is usually a product assurance or specialty engineering organization, which may include reliability, maintainability, quality, safety, human factors, logistics, etc. In such case, the reliability engineer reports to the product assurance manager or specialty engineering manager.
In some cases, a company may wish to establish an independent reliability organization. This is desirable to ensure that the system reliability, which is often expensive and time-consuming, is not unduly slighted due to budget and schedule pressures. In such cases, the reliability engineer works for the project day-to-day, but is actually employed and paid by a separate organization within the company.
Because reliability engineering is critical to early system design, it has become common for reliability engineers, however, the organization is structured, to work as part of an integrated product team.
Education
[edit]Some universities offer graduate degrees in reliability engineering. Other reliability professionals typically have an engineering, statistics, math, or physics degree from a university or college program. Many engineering programs offer reliability courses, and some universities have entire reliability engineering programs. A reliability engineer must be registered as a professional engineer by the state or province by law, but not all reliability professionals are engineers. Reliability engineers are required in systems where public safety is at risk. There are many professional conferences and industry training programs available for reliability engineers. Several professional organizations exist for reliability engineers, including the American Society for Quality Reliability Division (ASQ-RD),[44] the IEEE Reliability Society, the American Society for Quality (ASQ),[45] and the Society of Reliability Engineers (SRE).[46]
See also
[edit]- Dependability – Measure in systems engineering
- Durable good – Good that has long term use
- Factor of safety – System strength beyond planned load
- Failing badly – Fails with a catastrophic result or without warning
- Failure mode and effects analysis (FMEA) – Analysis of potential system failures
- Fracture mechanics – Study of propagation of cracks in materials
- Highly accelerated life test – Stress testing methodology for enhancing product reliability
- Highly accelerated stress test
- Human reliability – Factor in safety, ergonomics and system resilience
- Industrial engineering – Branch of engineering which deals with the optimization of complex processes or systems
- Institute of Industrial and Systems Engineers – Professional society for the support of the industrial engineering profession
- Logistics engineering – Field of engineering
- Performance engineering – Encompasses the techniques applied during a systems development life cycle and Performance indicator – Measurement that evaluates the success of an organization
- Product certification – Performance and quality assurance
- Product testing – Line of work testing consumer products
- Overall equipment effectiveness – Measure of how well a manufacturing operation is utilized
- RAMS – Engineering characterization of a product or system
- Reliability, availability and serviceability – Quality of robustness of computer hardware
- Reliability theory of aging and longevity – Biophysics theory
- Risk-based inspection
- Robustness validation
- Security engineering – Process of incorporating security controls into an information system
- Software reliability testing
- Solid mechanics – Branch of mechanics concerned with solid materials and their behaviors
- Spurious trip level – Measure of incorrect activations of a safety or alarm system
- Strength of materials
- Stress–strength analysis
- Structural fracture mechanics – Field of structural engineering
- Temperature cycling – Chemical process
- Weibull distribution – Continuous probability distribution
References
[edit]- ^ "American Society for Quality". American Society for Quality. 24 July 2024.
- ^ RCM II, Reliability Centered Maintenance, Second edition 2008, pages 250–260, the role of Actuarial analysis in Reliability
- ^ Why You Cannot Predict Electronic Product Reliability (PDF). 2012 ARS, Europe. Warsaw, Poland.
- ^ a b O'Connor, Patrick D. T. (2002), Practical Reliability Engineering (Fourth Ed.), John Wiley & Sons, New York. ISBN 978-0-4708-4462-5.
- ^ Aven, Terje (1 June 2017). "Improving the foundation and practice of reliability engineering". Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability. 231 (3): 295–305. doi:10.1177/1748006X17699478. ISSN 1748-006X.
- ^ Saleh, J.H. and Marais, Ken, "Highlights from the Early (and pre-) History of Reliability Engineering", Reliability Engineering and System Safety, Volume 91, Issue 2, February 2006, pages 249–256
- ^ Juran, Joseph and Gryna, Frank, Quality Control Handbook, Fourth Edition, McGraw-Hill, New York, 1988, p.24.3
- ^ Reliability of military electronic equipment;report. Washington: United States Department of Defense. 4 June 1957. hdl:2027/mdp.39015013918332.
- ^ Wong, Kam, "Unified Field (Failure) Theory-Demise of the Bathtub Curve", Proceedings of Annual RAMS, 1981, pp 402–408
- ^ a b Tang, Jianfeng; Chen, Jie; Zhang, Chun; Guo, Qing; Chu, Jie (1 March 2013). "Exploration on process design, optimization and reliability verification for natural gas deacidizing column applied to offshore field". Chemical Engineering Research and Design. 91 (3): 542–551. Bibcode:2013CERD...91..542T. doi:10.1016/j.cherd.2012.09.018. ISSN 0263-8762.
- ^ Practical Reliability Engineering, P. O'Conner – 2012
- ^ "Articles – Where Do Reliability Engineers Come From? – ReliabilityWeb.com: A Culture of Reliability". Archived from the original on 30 December 2015. Retrieved 12 October 2014.
- ^ Using Failure Modes, Mechanisms, and Effects Analysis in Medical Device Adverse Event Investigations, S. Cheng, D. Das, and M. Pecht, ICBO: International Conference on Biomedical Ontology, Buffalo, NY, July 26–30, 2011, pp. 340–345
- ^ Federal Aviation Administration (19 March 2013). System Safety Handbook. U.S. Department of Transportation. Retrieved 2 June 2013.
- ^ Reliability Hotwire – July 2015
- ^ Reliability Maintainability and Risk Practical Methods for Engineers Including Reliability Centred Maintenance and Safety – David J. Smith (2011)
- ^ a b Practical Reliability Engineering, O'Conner, 2001
- ^ System Reliability Theory, second edition, Rausand and Hoyland – 2004
- ^ The Blame Machine, Why Human Error Causes Accidents – Whittingham, 2007
- ^ Barnard, R.W.A. (2008). "What is wrong with Reliability Engineering?" (PDF). Lambda Consulting. Retrieved 30 October 2014.
- ^ "Best Practices in Design for Reliability" (PDF). Archived from the original (PDF) on 17 November 2017.
- ^ Salvatore Distefano, Antonio Puliafito: Dependability Evaluation with Dynamic Reliability Block Diagrams and Dynamic Fault Trees. IEEE Trans. Dependable Sec. Comput. 6(1): 4–17 (2009)
- ^ The Seven Samurais of Systems Engineering, James Martin (2008) Archived 1 December 2023 at the Wayback Machine
- ^ Zhang, J.; Geiger, C.; Sun, F. (January 2016). "A system approach to reliability verification test design". 2016 Annual Reliability and Maintainability Symposium (RAMS). pp. 1–6. doi:10.1109/RAMS.2016.7448014. ISBN 978-1-5090-0249-8. S2CID 24770411.
- ^ Dai, Wei; Maropoulos, Paul G.; Zhao, Yu (2 January 2015). "Reliability modelling and verification of manufacturing processes based on process knowledge management". International Journal of Computer Integrated Manufacturing. 28 (1): 98–111. doi:10.1080/0951192X.2013.834462. ISSN 0951-192X. S2CID 32995968.
- ^ "Reliability Verification for AI and ML Processors - White Paper". www.allaboutcircuits.com. Retrieved 11 December 2020.
- ^ Weber, Wolfgang; Tondok, Heidemarie; Bachmayer, Michael (1 July 2005). "Enhancing software safety by fault trees: experiences from an application to flight critical software". Reliability Engineering & System Safety. Safety, Reliability and Security of Industrial Computer Systems. 89 (1): 57–70. doi:10.1016/j.ress.2004.08.007. ISSN 0951-8320.
- ^ Ren, Yuanqiang; Tao, Jingya; Xue, Zhaopeng (January 2020). "Design of a Large-Scale Piezoelectric Transducer Network Layer and Its Reliability Verification for Space Structures". Sensors. 20 (15): 4344. Bibcode:2020Senso..20.4344R. doi:10.3390/s20154344. PMC 7435873. PMID 32759794.
- ^ Ben-Gal I., Herer Y. and Raz T. (2003). "Self-correcting inspection procedure under inspection errors" (PDF). IIE Transactions on Quality and Reliability, 34(6), pp. 529–540. Archived from the original (PDF) on 13 October 2013. Retrieved 10 January 2014.
{{cite journal}}: Cite journal requires|journal=(help) - ^ "Yelo Reliability Testing". Archived from the original on 5 March 2016. Retrieved 6 November 2014.
- ^ Matheson, Granville J. (24 May 2019). "We need to talk about reliability: making better use of test-retest studies for study design and interpretation". PeerJ. 7 e6918. doi:10.7717/peerj.6918. ISSN 2167-8359. PMC 6536112. PMID 31179173.
- ^ Pronskikh, Vitaly (1 March 2019). "Computer Modeling and Simulation: Increasing Reliability by Disentangling Verification and Validation". Minds and Machines. 29 (1): 169–186. doi:10.1007/s11023-019-09494-7. ISSN 1572-8641. OSTI 1556973. S2CID 84187280.
- ^ Halamay, D. A.; Starrett, M.; Brekken, T. K. A. (2019). "Hardware Testing of Electric Hot Water Heaters Providing Energy Storage and Demand Response Through Model Predictive Control". IEEE Access. 7: 139047–139057. Bibcode:2019IEEEA...7m9047H. doi:10.1109/ACCESS.2019.2932978. ISSN 2169-3536.
- ^ Chen, Jing; Wang, Yinglong; Guo, Ying; Jiang, Mingyue (19 February 2019). "A metamorphic testing approach for event sequences". PLOS ONE. 14 (2) e0212476. Bibcode:2019PLoSO..1412476C. doi:10.1371/journal.pone.0212476. ISSN 1932-6203. PMC 6380623. PMID 30779769.
- ^ Bieńkowska, Agnieszka; Tworek, Katarzyna; Zabłocka-Kluczka, Anna (January 2020). "Organizational Reliability Model Verification in the Crisis Escalation Phase Caused by the COVID-19 Pandemic". Sustainability. 12 (10): 4318. Bibcode:2020Sust...12.4318B. doi:10.3390/su12104318.
- ^ Jenihhin, M.; Lai, X.; Ghasempouri, T.; Raik, J. (October 2018). "Towards Multidimensional Verification: Where Functional Meets Non-Functional". 2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC). pp. 1–7. arXiv:1908.00314. doi:10.1109/NORCHIP.2018.8573495. ISBN 978-1-5386-7656-1. S2CID 56170277.
- ^ Rackwitz, R. (21 February 2000). "Optimization — the basis of code-making and reliability verification". Structural Safety. 22 (1): 27–60. doi:10.1016/S0167-4730(99)00037-5. ISSN 0167-4730.
- ^ Piryonesi, Sayed Madeh; Tavakolan, Mehdi (9 January 2017). "A mathematical programming model for solving cost-safety optimization (CSO) problems in the maintenance of structures". KSCE Journal of Civil Engineering. 21 (6): 2226–2234. Bibcode:2017KSJCE..21.2226P. doi:10.1007/s12205-017-0531-z. S2CID 113616284.
- ^ Okasha, N. M., & Frangopol, D. M. (2009). Lifetime-oriented multi-objective optimization of structural maintenance considering system reliability, redundancy and life-cycle cost using GA. Structural Safety, 31(6), 460–474.
- ^ a b Reliability and Safety Engineering – Verma, Ajit Kumar, Ajit, Srividya, Karanki, Durga Rao (2010)
- ^ "INCOSE SE Guidelines". Archived from the original on 30 December 2014. Retrieved 20 February 2015.
- ^ a b "8.1.1.1. Quality versus reliability".
- ^ "The Second Law of Thermodynamics, Evolution, and Probability".
- ^ American Society for Quality Reliability Division (ASQ-RD)
- ^ American Society for Quality (ASQ)
- ^ Society of Reliability Engineers (SRE)
- N. Diaz, R. Pascual, F. Ruggeri, E. López Droguett (2017). "Modeling age replacement policy under multiple time scales and stochastic usage profiles". International Journal of Production Economics. 188: 22–28. doi:10.1016/j.ijpe.2017.03.009.
{{cite journal}}: CS1 maint: multiple names: authors list (link)
Further reading
[edit]- Barlow, R. E. and Proscan, F. (1981) Statistical Theory of Reliability and Life Testing, To Begin With Press, Silver Springs, MD.
- Blanchard, Benjamin S. (1992), Logistics Engineering and Management (Fourth Ed.), Prentice-Hall, Inc., Englewood Cliffs, New Jersey.
- Breitler, Alan L. and Sloan, C. (2005), Proceedings of the American Institute of Aeronautics and Astronautics (AIAA) Air Force T&E Days Conference, Nashville, TN, December, 2005: System Reliability Prediction: towards a General Approach Using a Neural Network.
- Ebeling, Charles E., (1997), An Introduction to Reliability and Maintainability Engineering, McGraw-Hill Companies, Inc., Boston.
- Denney, Richard (2005) Succeeding with Use Cases: Working Smart to Deliver Quality. Addison-Wesley Professional Publishing. ISBN. Discusses the use of software reliability engineering in use case driven software development.
- Gano, Dean L. (2007), "Apollo Root Cause Analysis" (Third Edition), Apollonian Publications, LLC., Richland, Washington
- Holmes, Oliver Wendell Sr. The Deacon's Masterpiece
- Horsburgh, Peter (2018), "5 Habits of an Extraordinary Reliability Engineer", Reliability Web
- Kapur, K.C., and Lamberson, L.R., (1977), Reliability in Engineering Design, John Wiley & Sons, New York.
- Kececioglu, Dimitri, (1991) "Reliability Engineering Handbook", Prentice-Hall, Englewood Cliffs, New Jersey
- Trevor Kletz (1998) Process Plants: A Handbook for Inherently Safer Design CRC ISBN 1-56032-619-0
- Leemis, Lawrence, (1995) Reliability: Probabilistic Models and Statistical Methods, 1995, Prentice-Hall. ISBN 0-13-720517-1
- Lees, Frank (2005). Loss Prevention in the Process Industries (3rd ed.). Elsevier. ISBN 978-0-7506-7555-0.
- MacDiarmid, Preston; Morris, Seymour; et al., (1995), Reliability Toolkit: Commercial Practices Edition, Reliability Analysis Center and Rome Laboratory, Rome, New York.
- Modarres, Mohammad; Kaminskiy, Mark; Krivtsov, Vasiliy (1999), Reliability Engineering and Risk Analysis: A Practical Guide, CRC Press, ISBN 0-8247-2000-8.
- Musa, John (2005) Software Reliability Engineering: More Reliable Software Faster and Cheaper, 2nd. Edition, AuthorHouse. ISBN
- Neubeck, Ken (2004) "Practical Reliability Analysis", Prentice Hall, New Jersey
- Neufelder, Ann Marie, (1993), Ensuring Software Reliability, Marcel Dekker, Inc., New York.
- O'Connor, Patrick D. T. (2002), Practical Reliability Engineering (Fourth Ed.), John Wiley & Sons, New York. ISBN 978-0-4708-4462-5.
- Samaniego, Francisco J. (2007) "System Signatures and their Applications in Engineering Reliability", Springer (International Series in Operations Research and Management Science), New York.
- Shooman, Martin, (1987), Software Engineering: Design, Reliability, and Management, McGraw-Hill, New York.
- Tobias, Trindade, (1995), Applied Reliability, Chapman & Hall/CRC, ISBN 0-442-00469-9
- Springer Series in Reliability Engineering
- Nelson, Wayne B., (2004), Accelerated Testing—Statistical Models, Test Plans, and Data Analysis, John Wiley & Sons, New York, ISBN 0-471-69736-2
- Bagdonavicius, V., Nikulin, M., (2002), "Accelerated Life Models. Modeling and Statistical analysis", CHAPMAN&HALL/CRC, Boca Raton, ISBN 1-58488-186-0
- Todinov, M. (2016), "Reliability and Risk Models: setting reliability requirements", Wiley, 978-1-118-87332-8.
US standards, specifications, and handbooks
[edit]- Aerospace Report Number: TOR-2007(8583)-6889 Reliability Program Requirements for Space Systems, The Aerospace Corporation (10 July 2007)
- DoD 3235.1-H (3rd Ed) Test and Evaluation of System Reliability, Availability, and Maintainability (A Primer), U.S. Department of Defense (March 1982).
- NASA GSFC 431-REF-000370 Flight Assurance Procedure: Performing a Failure Mode and Effects Analysis, National Aeronautics and Space Administration Goddard Space Flight Center (10 August 1996).
- IEEE 1332–1998 IEEE Standard Reliability Program for the Development and Production of Electronic Systems and Equipment, Institute of Electrical and Electronics Engineers (1998).
- JPL D-5703 Reliability Analysis Handbook, National Aeronautics and Space Administration Jet Propulsion Laboratory (July 1990).
- MIL-STD-785B Reliability Program for Systems and Equipment Development and Production, U.S. Department of Defense (15 September 1980). (*Obsolete, superseded by ANSI/GEIA-STD-0009-2008 titled Reliability Program Standard for Systems Design, Development, and Manufacturing, 13 Nov 2008)
- MIL-HDBK-217F Reliability Prediction of Electronic Equipment, U.S. Department of Defense (2 December 1991).
- MIL-HDBK-217F (Notice 1) Reliability Prediction of Electronic Equipment, U.S. Department of Defense (10 July 1992).
- MIL-HDBK-217F (Notice 2) Reliability Prediction of Electronic Equipment, U.S. Department of Defense (28 February 1995).
- MIL-STD-690D Failure Rate Sampling Plans and Procedures, U.S. Department of Defense (10 June 2005).
- MIL-HDBK-338B Electronic Reliability Design Handbook, U.S. Department of Defense (1 October 1998).
- MIL-HDBK-2173 Reliability-Centered Maintenance (RCM) Requirements for Naval Aircraft, Weapon Systems, and Support Equipment, U.S. Department of Defense (30 January 1998); (superseded by NAVAIR 00-25-403).
- MIL-STD-1543B Reliability Program Requirements for Space and Launch Vehicles, U.S. Department of Defense (25 October 1988).
- MIL-STD-1629A Procedures for Performing a Failure Mode Effects and Criticality Analysis, U.S. Department of Defense (24 November 1980).
- MIL-HDBK-781A Reliability Test Methods, Plans, and Environments for Engineering Development, Qualification, and Production, U.S. Department of Defense (1 April 1996).
- NSWC-06 (Part A & B) Handbook of Reliability Prediction Procedures for Mechanical Equipment, Naval Surface Warfare Center (10 January 2006).
- SR-332 Reliability Prediction Procedure for Electronic Equipment, Telcordia Technologies (January 2011).
- FD-ARPP-01 Automated Reliability Prediction Procedure, Telcordia Technologies (January 2011).
- GR-357 Generic Requirements for Assuring the Reliability of Components Used in Telecommunications Equipment, Telcordia Technologies (March 2001).
http://standards.sae.org/ja1000/1_199903/ SAE JA1000/1 Reliability Program Standard Implementation Guide
UK standards
[edit]In the UK, there are more up to date standards maintained under the sponsorship of UK MOD as Defence Standards. The relevant Standards include:
DEF STAN 00-40 Reliability and Maintainability (R&M)
- PART 1: Issue 5: Management Responsibilities and Requirements for Programmes and Plans
- PART 4: (ARMP-4)Issue 2: Guidance for Writing NATO R&M Requirements Documents
- PART 6: Issue 1: IN-SERVICE R & M
- PART 7 (ARMP-7) Issue 1: NATO R&M Terminology Applicable to ARMP's
DEF STAN 00-42 RELIABILITY AND MAINTAINABILITY ASSURANCE GUIDES
- PART 1: Issue 1: ONE-SHOT DEVICES/SYSTEMS
- PART 2: Issue 1: SOFTWARE
- PART 3: Issue 2: R&M CASE
- PART 4: Issue 1: Testability
- PART 5: Issue 1: IN-SERVICE RELIABILITY DEMONSTRATIONS
DEF STAN 00-43 RELIABILITY AND MAINTAINABILITY ASSURANCE ACTIVITY
- PART 2: Issue 1: IN-SERVICE MAINTAINABILITY DEMONSTRATIONS
DEF STAN 00-44 RELIABILITY AND MAINTAINABILITY DATA COLLECTION AND CLASSIFICATION
- PART 1: Issue 2: MAINTENANCE DATA & DEFECT REPORTING IN THE ROYAL NAVY, THE ARMY AND THE ROYAL AIR FORCE
- PART 2: Issue 1: DATA CLASSIFICATION AND INCIDENT SENTENCING—GENERAL
- PART 3: Issue 1: INCIDENT SENTENCING—SEA
- PART 4: Issue 1: INCIDENT SENTENCING—LAND
DEF STAN 00-45 Issue 1: RELIABILITY CENTERED MAINTENANCE
DEF STAN 00-49 Issue 1: RELIABILITY AND MAINTAINABILITY MOD GUIDE TO TERMINOLOGY DEFINITIONS
These can be obtained from DSTAN. There are also many commercial standards, produced by many organisations including the SAE, MSG, ARP, and IEE.
French standards
[edit]- FIDES [1]. The FIDES methodology (UTE-C 80-811) is based on the physics of failures and supported by the analysis of test data, field returns and existing modelling.
- UTE-C 80–810 or RDF2000 [2] Archived 17 July 2011 at the Wayback Machine. The RDF2000 methodology is based on the French telecom experience.
International standards
[edit]- TC 56 Standards: Dependability Archived 10 September 2019 at the Wayback Machine
External links
[edit]
Media related to Reliability engineering at Wikimedia Commons- John P. Rankin Collection, The University of Alabama in Huntsville Archives and Special Collections NASA reliability engineering research on sneak circuits.
Reliability engineering
View on GrokipediaIntroduction
Overview
Reliability engineering is the application of engineering principles, techniques, and methodologies to predict, analyze, and enhance the reliability of systems, components, and processes, ensuring they perform their intended functions under specified conditions for a predetermined period without failure.[4] This discipline defines reliability as the probability that a product, system, or service will satisfy its intended function adequately for a specified duration under stated environmental conditions.[2] The core objective is to minimize failure rates, optimize system uptime, and reduce lifecycle costs, particularly in high-stakes sectors such as aerospace, manufacturing, and electronics where downtime or malfunctions can have severe consequences.[4] The field evolved from early 20th-century statistical quality control practices, pioneered by Walter Shewhart in the 1920s at Bell Laboratories, which emphasized process consistency and defect reduction.[5] World War II demands for robust electronics accelerated progress, but reliability engineering emerged as a distinct discipline in the 1950s, driven by military needs and formalized through efforts like the U.S. Advisory Group on Reliability of Electronic Equipment (AGREE) report in 1957, which established foundational standards for reliability prediction and testing.[5] In practice, reliability engineering has profound real-world impacts, such as in aerospace where analysis of avionics failure data over decades has enabled proactive detection of wearout trends, preventing widespread system failures in commercial aircraft and enhancing overall fleet safety.[6] Similarly, in power generation, particularly nuclear facilities like liquid metal fast breeder reactors (LMFBRs), reliability programs employing fault tree analysis and failure modes assessment have minimized risks of core disruptive accidents by bolstering shutdown and heat removal systems, thereby averting potential catastrophic events.[7]Objectives and Scope
Reliability engineering primarily seeks to achieve specified reliability levels for products and systems, ensuring they perform their intended functions without failure for a predetermined duration under defined conditions. This involves applying engineering principles and specialized methods to prevent or minimize the occurrence of failures during the design phase.[8] A key objective is to reduce lifecycle costs by proactively addressing potential issues, thereby decreasing downtime, repair expenses, and overall ownership costs associated with unreliable performance.[2] Furthermore, it emphasizes ensuring system dependability under operational stresses, including environmental factors, mechanical loads, and varying usage demands, to maintain consistent functionality in real-world scenarios.[9] The scope of reliability engineering extends to hardware components, software systems, and human-system interactions, integrating these elements to optimize overall system performance. It applies across diverse industries such as automotive, where it addresses electronic and mechanical reliability in vehicles; telecommunications, focusing on network uptime and service continuity; and defense, ensuring robust operation of mission-critical equipment under extreme conditions.[10][11][12] This broad application underscores its role in fostering a reliability culture that influences organizational practices from design to operation.[13] In distinction from maintenance engineering, which centers on reactive repairs and periodic upkeep to restore functionality after failures, reliability engineering prioritizes preventive strategies embedded in the initial design and development processes to avoid failures altogether.[14] An overview of techniques in reliability engineering includes probabilistic analysis, which models failure probabilities based on statistical data, and failure mode identification methods that systematically evaluate potential weak points in a system. These approaches provide a foundation for informed decision-making without overlapping into detailed implementation covered elsewhere.[8]Historical Development
Early Foundations
The origins of reliability engineering trace back to the 1920s and 1930s, when industrial demands for consistent product performance spurred the development of statistical quality control (SQC) as a foundational approach. At Bell Laboratories, physicist Walter A. Shewhart pioneered SQC by introducing control charts in 1924, enabling manufacturers to monitor process variations and reduce defects through statistical analysis rather than inspection alone.[15] His seminal 1931 book, Economic Control of Quality of Manufactured Product, formalized these methods, emphasizing economic benefits of variability control in production, which laid the groundwork for assessing component and system dependability.[16] Concurrently, W. Edwards Deming, influenced by Shewhart, advanced SQC in the 1930s through lectures and collaborations, including with editorial assistance from Deming, published Statistical Method from the Viewpoint of Quality Control in 1939, which extended these principles to broader scientific and industrial applications.[17] While U.S. industrial efforts at Bell Labs were pivotal, reliability concepts had earlier roots in 19th-century European engineering for machinery durability.[5] Early failure analysis practices in telegraphy and nascent electronics influenced reliability concepts, as engineers addressed intermittent breakdowns in communication systems, such as wire fractures or voltage fluctuations, to maintain service continuity.[5] These analyses, often conducted at institutions like Bell Labs, shifted focus from reactive repairs to proactive identification of failure modes in electrical components, setting precedents for reliability in complex networks. A pivotal development occurred in the 1930s with the widespread adoption of vacuum tube technology in radios, where frequent tube failures due to filament burnout and gas leaks highlighted the need for rigorous reliability testing. The unreliability of vacuum tubes drove advancements in component testing at firms like Bell Labs and others, impacting radio performance and consumer adoption.[16][5] This era's emphasis remained on individual component testing—such as burn-in procedures for tubes—rather than holistic system design, as the unreliability of these core elements directly impacted radio performance and consumer adoption.[5] These pre-World War II efforts established reliability as an engineering discipline rooted in data-driven quality assurance.Post-World War II Advancements
The exigencies of World War II catalyzed the formalization of reliability engineering within the U.S. military, as high failure rates plagued complex electronic systems like radar and early missile technologies. Over 50% of airborne electronics failed while in storage, and shipboard systems experienced up to 50% downtime due to unreliable components such as vacuum tubes, prompting the military to establish dedicated reliability programs in the 1940s to mitigate these issues and ensure operational readiness.[5] These efforts marked a shift from ad hoc maintenance to systematic design and testing protocols, driven by the need for dependable performance in high-stakes combat environments. In the 1950s, institutional advancements solidified reliability engineering as a distinct discipline, with the establishment of the Advisory Group on Reliability of Electronic Equipment (AGREE) in 1950 by the Department of Defense and industry partners. AGREE's seminal 1957 report defined reliability as "the probability of a product performing without failure a specified function under given conditions for a specified period of time," and introduced standardized approaches to reporting, including field data collection and environmental testing protocols that evolved into Military Standard 781.[5] These milestones emphasized component quality control and predictive modeling, laying the groundwork for broader application in military electronics. Key figures like Z.W. Birnbaum advanced the statistical foundations of reliability during this era; at the University of Washington, he founded the Laboratory of Statistical Research in 1948 with support from the Office of Naval Research, contributing probabilistic inequalities, nonparametric estimation methods, and life distribution models essential for reliability analysis.[5] Birnbaum's work, including the 1969 Birnbaum-Saunders fatigue-life model, provided tools for assessing failure probabilities in complex systems.[5] The post-war period also saw a pivotal shift toward system-level reliability in aerospace, exemplified by NASA's role in the 1960s Apollo program, which prioritized zero-failure design through integrated risk management and redundancy. Apollo's success, including 100% reliability in all 13 Saturn V launches via the "all-up" testing concept—fully assembling and launching vehicles from the first flight—demonstrated this approach, with reliability goals setting crew safety probabilities 100 times higher than mission success rates.[18] Extensive testing, comprising nearly 50% of development efforts, and techniques like failure mode and effects analysis underscored NASA's emphasis on holistic system dependability over isolated component fixes.[18]Fundamental Concepts
Key Definitions
Reliability in engineering is defined as the probability that a system or component will perform its required functions under stated conditions for a specified period of time.[19] This concept focuses on the likelihood of failure-free operation within predefined environmental and operational constraints. Closely related terms include availability, which measures the proportion of time a system is in an operable and committable state, often expressed as the ratio of uptime to total time.[19] Maintainability refers to the ease and speed with which a system can be restored to operational condition after a failure, typically quantified by metrics like mean time to repair.[19] Dependability is used as an umbrella term to encompass core attributes such as reliability, availability, maintainability, and maintenance support performance that ensure trustworthy system performance.[20] Failures in reliability engineering are categorized into types based on their nature and onset. Catastrophic failures occur suddenly and completely, rendering the system inoperable without warning, often due to overload or defect.[21] In contrast, degradational failures develop gradually through wear, corrosion, or fatigue, allowing potential detection and intervention before total breakdown.[22] A fundamental mathematical representation of reliability is the reliability function , which gives the probability of survival beyond time . Under the assumption of a constant failure rate , this follows the exponential distribution, where: This derivation stems from the survival function of the exponential distribution, where the cumulative distribution function is , so , reflecting memoryless property and constant hazard rate in non-repairable systems.[23]Basic Principles of Reliability Assessment
Reliability assessment in engineering begins with a systematic evaluation of a system's ability to perform its intended functions without failure under specified conditions over a designated period. This process involves foundational steps that ensure potential issues are identified and mitigated early, drawing on probabilistic and statistical principles to quantify uncertainties and predict outcomes. Central to these principles is the recognition that reliability is not inherent but engineered through iterative analysis and validation, often starting during the design phase to minimize costs and risks later in the lifecycle. The primary steps in reliability assessment include identifying failure modes, quantifying associated risks, predicting system performance, and validating predictions through empirical data. Failure modes are identified using structured methods like Failure Mode and Effects Analysis (FMEA), a technique that systematically examines components and subsystems to list potential failures, their causes, and effects on overall system function.[24] This step involves breaking down the system into functional blocks and assessing each for weaknesses, such as mechanical wear or electrical shorts, to prioritize high-impact issues. Risks are then quantified by assigning severity ratings—ranging from catastrophic to minor—and estimating occurrence probabilities, often through criticality analysis in FMECA (Failure Modes, Effects, and Criticality Analysis), which ranks modes based on their potential to cause mission failure.[24] Performance prediction builds on these identifications by modeling expected behavior over time, incorporating concepts like the bathtub curve, which illustrates the typical failure rate profile of systems or components. The bathtub curve consists of three phases: an initial high-failure "infant mortality" period due to manufacturing defects, a stable "useful life" phase with constant random failures, and a rising "wear-out" phase from material degradation.[5] Originating from 1950s military electronics studies, this model guides engineers in anticipating failure patterns and scheduling maintenance, such as burn-in testing to eliminate early defects. Probabilistic methods further enhance predictions by calculating mission success probability, defined as the likelihood of performing required functions without failure for a specified duration.[25] These methods employ tools like fault trees and event trees to model failure scenarios and integrate component reliability data—such as failure rates—to yield overall system probabilities, often expressed as for constant failure rates in exponential models, where is reliability, is the failure rate, and is time.[25] Validation of these assessments occurs through data collection and testing, ensuring predictions align with real-world performance. This involves accelerated life testing, field data analysis, and feedback loops like Failure Reporting, Analysis, and Corrective Action Systems (FRACAS) to confirm or refine models.[26] For instance, operational data from prototypes can reveal discrepancies in predicted failure rates, prompting design adjustments. Reliability assessment is integrated across the full product lifecycle, from concept to disposal, to enable early detection of weaknesses and continuous improvement. In the concept phase, initial analyses like reliability block diagrams assess feasibility; during design and production, FMECA and testing verify requirements; and in operations, ongoing monitoring tracks performance against predictions.[26] This lifecycle approach, emphasized in military and aerospace standards, shifts focus from reactive fixes to proactive enhancements, reducing lifecycle costs by addressing issues before full deployment.[26]Reliability Programs and Requirements
Program Planning
Program planning in reliability engineering involves developing a structured framework to ensure that reliability objectives are systematically integrated into the overall project lifecycle, guiding organizational efforts to achieve dependable system performance. This process begins with defining clear goals aligned with mission requirements, such as establishing quantitative targets for system uptime and failure rates, to direct all subsequent activities.[27] Key elements of a reliability program plan include goal setting, where specific, measurable objectives like target mean time between failures (MTBF) are outlined based on operational environments and performance needs; resource allocation, encompassing personnel, budget, and tools dedicated to reliability tasks; and milestone establishment, such as preliminary design reviews (PDR) and critical design reviews (CDR), to track progress against timelines.[27] Integration with broader project management is essential, ensuring reliability considerations influence design, production, and logistics phases without silos, often through coordinated schedules and shared documentation.[27] These elements form a cohesive plan that supports efficient implementation while adapting to program constraints. Modern DoD programs also align with manuals like DoDM 4151.25 (as of 2024) for reliability-centered maintenance integration across the lifecycle.[28] Reliability programs typically align with established standards to provide a robust framework; for instance, the Department of Defense's Best Practices to Achieve Better Reliability and Maintainability (R&M) Estimates (February 2025) outlines requirements for program plans in defense systems, emphasizing tailored tasks and management oversight, while ISO 9001 offers a quality management structure that incorporates reliability planning through clauses on organizational context, leadership, and resource planning.[27][29] The program unfolds in distinct phases: planning, where requirements are derived and resources committed; execution, involving task implementation like analyses and testing preparations; and review, featuring assessments at milestones to evaluate adherence and adjust strategies.[30] Success metrics focus on comparing achieved performance against targets, such as actual MTBF versus planned values, to quantify reliability growth and inform corrective actions, ensuring the program's effectiveness in meeting objectives.[27] Cross-functional teams, comprising experts from design, testing, operations, and quality assurance, are vital for holistic input, fostering collaboration to address reliability across disciplines and mitigate risks early.[27] This team-based approach enhances program outcomes by integrating diverse perspectives, though it requires clear roles and communication protocols as defined in the plan. Reliability requirements, briefly, serve as the foundation for these goals, linking them to specific system targets detailed elsewhere.[30]Establishing Reliability Requirements
Establishing reliability requirements begins with translating operational mission profiles into quantifiable targets that reflect the system's intended use, environment, and performance expectations. This process involves analyzing user needs, such as those outlined in capability documents, to derive specific metrics like mean time between failures (MTBF) or availability percentages, often adjusting for uncontrollable failure modes like early-life defects or random occurrences. For instance, a system might be targeted for 95% reliability over a 5-year operational period based on mission duration and failure rate estimates derived from historical data.[27] These goals ensure the system meets sustainment key performance parameters while balancing feasibility during design.[27] Reliability allocation methods distribute these system-level targets to subsystems and components, primarily through top-down and bottom-up approaches. In the top-down method, requirements are apportioned from the overall system goal to lower levels using weighting factors based on component complexity, criticality, or historical failure rates, often assuming a series system configuration for initial estimates. This approach is particularly useful in early design phases where detailed component data is limited, as seen in methods like the AGREE allocation that employs factors such as module count and environmental stress.[31] Conversely, the bottom-up method aggregates predicted reliabilities from individual parts—derived from physics-of-failure models or life testing—upward to validate or refine the system target, optimizing for constraints like cost minimization through mathematical programming.[31] These methods are often iterated to reconcile discrepancies, ensuring alignment across the hierarchy.[27] Several factors influence the setting of reliability requirements, including cost implications, organizational risk tolerance, and adherence to regulatory standards. Overly stringent targets can constrain design trade-offs and inflate lifecycle costs, such as through excessive spares or maintenance, prompting engineers to incorporate uncertainty buffers like 40-60% increases in failure rate estimates for data variability.[27] Risk tolerance dictates adjustments for potential field performance gaps, while regulations enforce minimum thresholds; for example, in aviation, the Federal Aviation Administration's Advisory Circular 120-17B (as of 2018) guides operators in establishing reliability programs to monitor metrics like MTBF and adjust maintenance intervals without compromising safety, as required under 14 CFR parts 91, 119, 121, and 135.[32] A key tool for apportioning targets is the reliability block diagram (RBD), a graphical model representing system architecture as blocks in series, parallel, or hybrid configurations to calculate overall reliability and identify allocation needs. RBDs facilitate top-down distribution by modeling how component reliabilities contribute to system success, such as combining a switch's MTBF of 5,000 hours with a fan's L10 life of 1,000 hours to derive an assembly-level target of approximately 73.9% reliability at 1,000 hours.[27] This visual and analytical approach highlights weak points and supports iterative refinement during requirement establishment.[31]Human Factors in Reliability
Reliability Culture
Reliability culture refers to an organizational environment where focus, proaction, and priority guide efforts to prevent failures and achieve consistent performance, shifting from reactive fixes to preventive measures.[33] This culture is built on leadership commitment, where senior executives establish a clear vision, allocate resources, and model proactive behaviors to integrate reliability into core operations.[33] Employee training plays a pivotal role, addressing skill gaps through hands-on programs in areas like root cause analysis and precision maintenance, reinforced by supervisory involvement to ensure practical application.[33] Such training fosters a shared understanding that reliability is a collective responsibility, enhancing overall organizational resilience.[34] Key practices in reliability culture include robust incident reporting systems, such as Failure Reporting, Analysis, and Corrective Action Systems (FRACAS), which encourage employees to document and analyze failures without fear of reprisal, enabling early identification of chronic issues.[33] Continuous improvement loops, often through methods like root cause failure analysis, target recurring problems—responsible for up to 80% of operational losses—and promote incremental enhancements in processes and equipment precision.[33] Incentives, including recognition programs and rewards for proactive contributions, such as identifying potential risks or achieving error-free milestones, motivate teams and reinforce positive behaviors, like those seen in monthly awards for safety-focused innovations.[35] These practices create feedback mechanisms that drive iterative learning and cultural embedding of reliability principles.[34] A notable case study is Boeing's response to the 737 MAX incidents in 2018 and 2019, which exposed cultural shortcomings prioritizing production over safety.[36] Post-incidents, Boeing implemented comprehensive reforms, including a Safety Management System (SMS) overhaul with proactive risk identification through data analytics and phased audits to mitigate hazards across the product lifecycle.[36] Leadership emphasized cultural change via mandatory Positive Safety Culture training for over 160,000 employees and managers, while enhancing the Speak Up reporting channel, resulting in a 220% increase in safety reports from 2023 to 2024, signaling greater proactive risk awareness and transparency.[36] Cultural health in reliability engineering is assessed through metrics like error or incident reporting rates, where higher voluntary reporting—such as reports per employee—indicates a non-punitive environment that promotes learning from near-misses.[37] Training completion rates also serve as key indicators, measuring the organization's investment in skill-building; for instance, full participation in reliability-focused programs correlates with reduced failure recurrence.[36] These metrics, tracked via surveys and system data, help gauge the shift toward a proactive culture, with benchmarks like increasing reporting volumes demonstrating improved employee engagement and risk mitigation effectiveness.[38]Human Errors and Mitigation
Human errors represent a significant contributor to system failures in reliability engineering, often stemming from cognitive and behavioral limitations during operation, maintenance, or design phases. In complex systems, these errors can propagate through interconnected components, leading to cascading failures that undermine overall reliability. According to established models, human errors are categorized into slips, which involve unintended actions due to attentional failures; lapses, characterized by memory or attention deficits resulting in omissions; and mistakes, which arise from flawed planning or decision-making processes.[39] These distinctions, derived from cognitive psychology, highlight that slips and lapses typically occur in routine, skill-based tasks, while mistakes involve higher-level knowledge or rule-based judgments.[39] Empirical data underscores the prevalence of human errors in high-stakes environments. For instance, in nuclear power plants, approximately 70-80% of reported events and incidents are attributed to human factors, including errors in procedure execution or oversight during monitoring.[40] This statistic reflects the challenges of maintaining reliability in sociotechnical systems where human performance interfaces with automated controls and safety barriers. To systematically investigate these errors, frameworks like the Human Factors Analysis and Classification System (HFACS) provide a structured taxonomy for root cause analysis. Developed originally for aviation but widely adopted in reliability engineering, HFACS organizes errors into levels—unsafe acts, preconditions for unsafe acts, unsafe supervision, and organizational influences—enabling identification of latent contributors beyond immediate operator actions. Mitigation strategies in reliability engineering emphasize proactive design and procedural interventions to reduce error likelihood. Human factors engineering (HFE) integrates ergonomic principles into system design, ensuring interfaces and workflows align with human capabilities to minimize cognitive overload and perceptual mismatches.[41] Complementary approaches include error-proofing techniques such as poka-yoke, which embed physical or logical safeguards to prevent errors at the source, like mismatched connectors that inhibit incorrect assembly.[42] Usability testing further supports mitigation by evaluating user interactions with prototypes or systems under realistic conditions, identifying potential error traps before deployment and quantifying error rates to inform iterative improvements.[43] These methods collectively foster resilient systems by addressing human fallibility as an inherent design parameter rather than an anomaly.Design for Reliability
Prediction Methods
Prediction methods in reliability engineering enable engineers to forecast the performance and longevity of systems during the design phase, allowing for proactive mitigation of potential failures before production or deployment. These methods primarily fall into two categories: statistics-based approaches, which rely on historical data and empirical models, and physics-of-failure techniques, which examine underlying physical mechanisms driving degradation. By estimating metrics such as failure rates and mean time between failures (MTBF), designers can allocate reliability budgets, select components, and refine architectures to meet specified targets. Statistics-based prediction methods use aggregated failure data from past systems to estimate reliability parameters, often assuming constant failure rates under the exponential distribution model. A key metric is the mean time between failures (MTBF), defined as the reciprocal of the constant failure rate λ, expressed as MTBF = 1/λ, where λ represents the average number of failures per unit time.[44] This approach facilitates quick assessments by summing component-level failure rates to predict system-level reliability. Handbooks like MIL-HDBK-217 provide empirical failure rate models for electronic parts, incorporating factors such as quality levels, operating environments, and stress ratings to calculate λ for individual components.[45] For instance, the failure rate for a resistor might be derived from base rates adjusted by temperature and power stress multipliers, enabling bottom-up system predictions.[46] These methods are particularly useful for early-stage comparisons of design alternatives but can overestimate failures in modern systems due to outdated databases. Modern alternatives like 217Plus™ address these limitations by incorporating updated field data.[47] In contrast, physics-of-failure (PoF) methods focus on identifying and modeling the root causes of degradation, such as material fatigue, corrosion, or electromigration, to predict failure under specific operating conditions. This approach analyzes how stresses like thermal cycling, vibration, or humidity interact with a product's materials and geometry to initiate and propagate damage. For thermal stress, the Arrhenius equation models the acceleration factor for extrapolating high-temperature test data to normal use conditions, given by: where is the acceleration factor, is the activation energy, is Boltzmann's constant, is the absolute use temperature, and is the absolute test temperature in Kelvin.[48] For example, this allows estimation of how elevated temperatures accelerate solder joint fatigue in electronics. For mechanical fatigue, PoF employs damage accumulation models like Miner's rule to quantify cumulative wear from cyclic loads.[49] By simulating these mechanisms using finite element analysis or probabilistic tools, PoF provides mechanistic insights that guide design modifications to enhance endurance. Additionally, as of 2025, AI and machine learning are increasingly integrated into PoF for enhanced simulation and prediction accuracy.[50] A complementary tool in prediction is Failure Modes and Effects Analysis (FMEA), which systematically identifies potential failure modes, their causes, and effects to prioritize risks during design. FMEA assigns severity, occurrence, and detection ratings to each mode, yielding a risk priority number (RPN) to focus efforts on high-impact areas, such as vibration-induced cracks in structural components.[51] This qualitative-to-quantitative process integrates with both statistical and PoF methods to refine predictions by highlighting vulnerabilities not captured in aggregate data.[52] Compared to statistics-based methods, which offer rapid estimates using generic data for initial screening, PoF excels in root-cause prevention by tailoring predictions to specific designs and environments, leading to more accurate and actionable outcomes.[50] While statistical approaches like MIL-HDBK-217 are efficient for legacy systems, PoF reduces over-design and supports innovation in complex products by addressing emerging failure mechanisms.[53]Improvement Techniques
Improvement techniques in reliability engineering focus on applying insights from predictive analyses to iteratively refine designs, thereby enhancing system performance and longevity. These methods aim to mitigate potential failure modes identified during the design phase, ensuring that products meet or exceed reliability targets without excessive cost increases. By integrating such techniques early, engineers can achieve robust systems that perform consistently under varying conditions. Key techniques include redundancy, derating, and robust design optimization. Redundancy involves incorporating duplicate or backup components to ensure continued operation if a primary element fails, thereby increasing overall system availability.[54] Derating complements this by operating components below their maximum specified ratings—such as voltage, temperature, or current—to reduce stress and extend service life.[55] Robust design optimization seeks to minimize sensitivity to environmental variations and manufacturing tolerances, creating systems that maintain performance despite external perturbations. This approach, grounded in principles like axiomatic design, systematically allocates reliability across subsystems to optimize the entire architecture.[56] Common tools for implementing these techniques include the Taguchi methods and Quality Function Deployment (QFD). The Taguchi methods employ statistical experimental designs to identify control factors that reduce variability in product performance, effectively making designs more robust against noise factors like temperature fluctuations or material inconsistencies; this has been shown to lower development costs by streamlining the identification of optimal parameters.[57] QFD, meanwhile, translates customer reliability needs—such as mean time between failures—into technical specifications through a structured matrix, ensuring that design decisions align with end-user expectations and prioritize high-impact features.[58] Clear and unambiguous language in specifications is crucial for effective reliability improvements, as it prevents misinterpretation during design and testing phases. For instance, explicitly defining "failure" as any degradation beyond a specified threshold (e.g., a 10% drop in output) avoids subjective assessments and enables precise measurement of reliability metrics.[59] A representative case involves enhancing automotive electronics reliability against vibration-induced failures through targeted material selection. In electronic control units exposed to road vibrations, selecting potting materials with high damping coefficients, such as silicone-based compounds, reduces stress on solder joints and components under simulated automotive conditions.Reliability Modeling
Theoretical Foundations
Reliability theory forms the probabilistic foundation for analyzing the performance and failure of engineering systems over time. It draws heavily from stochastic processes to model the random nature of failures and survival analysis to quantify the probability that a system or component will function without failure under stated conditions for a specified period. Survival analysis, which originated in biostatistics but has been adapted to engineering contexts, treats failure times as realizations of stochastic processes, enabling the estimation of hazard functions that describe the instantaneous failure rate. These frameworks allow engineers to predict and mitigate risks by characterizing uncertainty in system lifetimes through probability distributions and process models. A cornerstone of reliability modeling is the Weibull distribution, widely used for its flexibility in representing various failure patterns, from infant mortality to wear-out phases. Introduced by Waloddi Weibull in his seminal 1951 paper, it provides a versatile tool for failure time analysis across materials and mechanical systems. The probability density function of the two-parameter Weibull distribution is: where is the shape parameter influencing the failure rate's behavior (e.g., indicates decreasing hazard, constant hazard, increasing hazard), and is the scale parameter representing the characteristic life. The corresponding reliability function, or survival function, is , which gives the probability of survival beyond time . This distribution's ability to model diverse bathtub-shaped hazard rates makes it essential for life data analysis in reliability engineering.[60] For non-repairable systems composed of multiple components, reliability is often assessed using combinatorial structures like series and parallel configurations, assuming component independence. In a series system, where the system fails if any component fails, the overall reliability is the product of the individual component reliabilities: . Conversely, in a parallel system, where the system functions as long as at least one component operates, the reliability is . These formulas, derived from basic probability principles, extend to more complex networks via minimal path or cut sets, providing a theoretical basis for system-level predictions. Repairable systems, which can transition between operational and failed states through maintenance, are modeled using continuous-time Markov chains to capture dynamic behavior. In these models, states represent system conditions (e.g., fully operational, degraded, or failed), and transition rates between states reflect failure and repair intensities, often assumed constant in basic formulations. The steady-state availability, or long-run proportion of time the system is operational, is computed from the balance equations of the Markov process, such as solving where is the stationary distribution and the infinitesimal generator matrix. This approach accounts for time dependencies absent in static reliability functions, enabling analysis of maintainability and downtime. Fundamental assumptions underpin these theoretical models to ensure tractability. Component failures are typically assumed independent, meaning the failure of one does not influence others, which simplifies probability calculations but may not hold in interconnected systems. Additionally, basic models often posit constant hazard rates, aligning with the exponential distribution as a special case of Weibull (), implying memoryless failures where the probability of failure is independent of age. These assumptions facilitate analytical solutions but require validation or extension (e.g., via time-varying rates) for real-world applications. Quantitative parameters like mean time to failure build on these foundations, as detailed in subsequent analyses.Quantitative Parameters
Quantitative parameters in reliability engineering provide measurable indicators for assessing the performance and dependability of systems, derived from probabilistic models of failure and repair processes. These metrics quantify the likelihood and timing of failures, enabling engineers to predict system behavior under specified conditions. Central to this are the mean time to failure (MTTF) and mean time between failures (MTBF), which represent expected operational durations for non-repairable and repairable systems, respectively. The MTTF is defined as the expected lifetime of a non-repairable system, calculated as the integral of the reliability function over time: where is the probability that the system survives beyond time .[61] For repairable systems, the MTBF extends this by incorporating repair time, given by , where MTTR is the mean time to repair.[61] The constant failure rate , often assumed in exponential distributions for constant hazard scenarios, relates inversely to MTTF as , representing the instantaneous probability of failure per unit time.[23] Availability , a key measure of system uptime, is the steady-state proportion of time the system is operational, expressed as .[61] Mission reliability extends these parameters to time-dependent scenarios, defined as the probability that a system successfully completes a specified mission profile, which may involve varying operational phases and durations. It incorporates time-dependent reliability functions to account for mission-specific stresses, such as phased operations in aerospace systems, where success requires fault-free performance over the entire required timeframe at the mandated performance level.[62] For instance, in non-repairable systems under exponential failure assumptions, mission reliability simplifies to , but more complex profiles use cumulative distribution functions tailored to the mission timeline.[63] At the system level, reliability aggregates component reliabilities, particularly for k-out-of-n configurations where the system functions if at least k of n independent, identically reliable components succeed. Assuming binary component states and constant reliability , the system reliability follows the binomial distribution: This formula captures redundancy effects, with often derived from individual MTTF values via for a given mission time .[63] For example, in a 2-out-of-3 system with , , illustrating how redundancy boosts overall dependability.[63] Sensitivity analysis quantifies how variations in these parameters influence overall system reliability, identifying critical factors for design prioritization. It involves computing derivatives of reliability metrics with respect to inputs like failure rates or component reliabilities, often using adjoint methods or direct differentiation to assess impacts on or availability. For instance, a 10% increase in for a key component can reduce system by several percentage points in redundant setups, guiding targeted improvements without exhaustive re-modeling.[64] This approach, rooted in extending theoretical models, ensures parameters are evaluated for robustness across operational uncertainties.Reliability Testing
Test Planning and Requirements
Test planning in reliability engineering establishes the framework for verifying that systems or components meet specified reliability goals, involving the definition of clear objectives, resource allocation, and procedural steps to ensure efficient and effective testing. This phase begins with identifying the reliability targets, such as mean time between failures (MTBF) or failure rates, and aligning them with project requirements to guide subsequent test execution.[65] Key requirements include sample size determination, which relies on statistical methods to achieve desired confidence in reliability estimates. For zero-failure demonstration tests, the non-parametric binomial approach calculates sample size using the formula , where is the target reliability and is the consumer's risk (1 - confidence level); for instance, demonstrating 90% reliability at 90% confidence requires 22 samples with no failures.[66] Success criteria are defined as the number of allowable failures or survival probabilities that confirm the reliability target, often set via binomial distribution to balance Type I and Type II errors.[67] These requirements must align with standards like IEC 61508, which requires demonstration of hardware reliability targets using methods such as fault mode effects and diagnostic analysis (FMEDA), environmental stress simulations, and proof-of-design tests to meet architectural constraints (Routes 1H or 2H) for safety integrity levels (SILs) 2 or 3.[68][69][70] Planning steps emphasize risk-based prioritization to focus resources on high-impact areas, employing probabilistic risk analysis (PRA) and fault tree models to rank test cases by potential failure consequences and likelihood.[71] Test environments are set up to replicate operational conditions, including temperature, vibration, and humidity controls, to ensure test results reflect real-world performance.[69] Test duration is determined based on confidence levels, such as planning for 90% confidence in 95% reliability, which may require extended exposure until the statistical threshold is met, often using tables from hypergeometric distributions for finite populations to avoid underestimation.[72] Reliability tests are categorized into qualification testing, which verifies that the design meets reliability specifications under accelerated stresses to uncover failure mechanisms, and acceptance testing, which screens production lots via sampling to confirm manufacturing consistency and compliance with customer requirements.[73] Resource considerations involve cost-benefit analysis to justify investments in test fixtures, instrumentation, and data collection systems, weighing testing costs against in-service failure penalties using Bayesian growth models; for example, optimal test durations minimize total expected costs by balancing hourly testing expenses (e.g., £500) with fault correction values (e.g., £50,000) and risk multipliers.[74] This approach ensures economical planning without compromising demonstration of reliability objectives.Methods and Accelerated Approaches
Reliability testing methods encompass a range of techniques designed to evaluate product endurance under controlled conditions, with acceleration strategies employed to compress timelines and reveal potential weaknesses more rapidly than standard use conditions. Constant stress testing involves applying a fixed level of stress—such as elevated temperature or voltage—throughout the duration of the test to all specimens, allowing for the observation of failure times under steady-state acceleration.[75] This approach is particularly useful for estimating mean time to failure (MTTF) when failure mechanisms are expected to remain consistent at the applied stress level.[76] Step-stress testing, in contrast, incrementally increases the stress on test units at predetermined intervals or upon reaching a specified number of failures, starting from a baseline and escalating to higher levels while holding each step constant for a defined period.[77] This method efficiently uncovers failure thresholds by simulating progressive degradation, often used when resources limit the number of test samples available for parallel constant-stress runs.[78] Highly Accelerated Life Testing (HALT) pushes products beyond operational limits using rapid, multi-axis stressors like temperature extremes, vibration, and humidity in a "test-fail-fix" cycle to identify design weaknesses early in development.[79] HALT typically employs small sample sizes and aggressive step stresses to provoke about 85% of field-relevant failure modes, facilitating iterative improvements without exhaustive statistical validation.[80] Accelerated testing leverages environmental factors such as temperature and voltage to expedite failure occurrences while preserving the underlying physics of degradation. Temperature acceleration is commonly modeled using the Arrhenius equation, which relates reaction rates to thermal energy; the acceleration factor (AF) quantifies how much faster failures occur at test conditions compared to use conditions. The formula is given by: where is the activation energy (in eV), is Boltzmann's constant ( eV/K), is the use temperature (in Kelvin), and is the elevated test temperature (in Kelvin).[75] For voltage acceleration in electronic components, an inverse power law model is often applied, where AF scales with the ratio of stresses raised to a power exponent, typically derived empirically.[48] These factors enable extrapolation of test data to predict long-term reliability, assuming the dominant failure mechanisms do not change under acceleration. Data analysis in these tests relies on statistical methods to interpret failure times, accounting for incomplete observations through censoring techniques. Right-censoring occurs when tests end before all units fail, such as due to time constraints or reaching a quota of failures, providing partial information on surviving units.[81] Weibull plotting is a graphical method for estimating distribution parameters like shape () and scale (), where failure data are plotted on Weibull probability paper; a straight line indicates a good fit, with the slope revealing wear-out or infant mortality patterns.[81] For censored data, adjusted plotting positions incorporate survival probabilities to avoid bias in parameter estimation, enabling reliable predictions of reliability metrics like the B10 life (time to 10% failure).[82] A key limitation of accelerated approaches is the risk of introducing extraneous failure modes not representative of field use, particularly if stresses exceed operational relevance and trigger atypical mechanisms like material phase changes or unintended interactions.[83] Validation through failure mode analysis and comparison to known use-level behaviors is essential to ensure extrapolation validity, as over-acceleration can undermine the test's predictive power.[84]Specialized Applications
Software Reliability
Software reliability engineering applies principles of reliability to software systems, focusing on predicting, measuring, and improving the probability that software operates without failure under specified conditions for a given time period. Unlike hardware, software exhibits non-degradational failures, meaning defects do not wear out over time but remain latent until triggered, leading to sudden, unpredictable failures. Additionally, software's infinite scalability allows replication without physical degradation, yet it often suffers from high defect density due to the complexity of code and human error in development, with typical densities often ranging from 0.1 to 5 defects per thousand lines of code (KLOC) depending on development stage and complexity, and mature high-reliability systems achieving below 1 per KLOC.[85] These challenges necessitate specialized models and techniques tailored to software's intangible and deterministic nature. A foundational model in software reliability is the Jelinski-Moranda model, introduced in 1972, which assumes that software contains an initial number of faults N, each equally likely to cause failure, and that faults are removed upon detection without introducing new ones. The failure rate after the (i-1)th failure is given by where is the hazard rate per remaining fault and i indexes the failures during debugging. This non-homogeneous Poisson process model predicts the time between failures, enabling estimation of remaining faults and reliability growth as testing progresses. It has been widely adopted for its simplicity and as a basis for subsequent models, though it assumes perfect debugging, which limits its applicability in imperfect environments.[86][87] Software reliability growth models like the Musa-Okumoto logarithmic model, developed in 1984, extend these ideas to operational profiles during development, predicting failure intensity based on execution time rather than calendar time. The model assumes failures follow a logarithmic Poisson process, where the cumulative expected failures m(t) increase logarithmically with operational usage, reflecting decreasing failure rates as faults are exposed and removed under realistic workloads. This approach is particularly useful for time-constrained projects, allowing predictions of operational reliability before full deployment by incorporating factors like testing effort and fault detection rates. It has influenced standards for software reliability assessment in mission-critical systems.[88] Key techniques for enhancing software reliability include fault injection, which deliberately introduces faults into the system to evaluate error detection and recovery mechanisms, thereby validating fault tolerance under simulated adverse conditions. Code coverage testing measures the proportion of code executed during tests, such as branch or statement coverage, to ensure comprehensive fault exposure and reduce undetected defects. Metrics like defect density, calculated as the number of faults per KLOC, provide a quantitative indicator of code quality, with lower densities correlating to higher reliability; for instance, benchmarks suggest under 1 defect per KLOC for high-reliability software. These methods, often integrated into development lifecycles, support iterative improvements without relying on hardware-specific degradation analysis.[89][90]Structural Reliability
Structural reliability engineering applies probabilistic methods to evaluate the performance of civil and mechanical structures under various loads and material variabilities, ensuring they withstand environmental and operational stresses over their intended lifespan. Central to this field is the probability of failure, defined as , where represents the structure's resistance (e.g., material strength or capacity) and denotes the applied load (e.g., dead, live, or environmental forces). This formulation captures the uncertainty in both resistance and load, often modeled as random variables with statistical distributions, allowing engineers to quantify the likelihood of exceedance and design for acceptable risk levels.[91] Load and resistance factor design (LRFD) is a widely adopted approach that integrates structural reliability principles into practice by applying load factors to amplify expected loads and resistance factors to reduce nominal capacities, achieving a target reliability index typically around 3.0 for common structures.[92] Developed from probabilistic calibrations, LRFD ensures consistent safety margins across different load types and materials, such as steel and concrete, by aligning designs with a low probability of failure over 50-year reference periods.[93] This method contrasts with allowable stress design by explicitly accounting for variabilities, promoting more efficient use of materials while maintaining reliability.[94] To assess reliability amid uncertainties, methods like Monte Carlo simulation are employed, generating thousands of random samples from distributions of variables such as concrete compressive strength or wind load intensities to estimate failure probabilities.[95] For instance, in analyzing elevated water tanks, simulations incorporate wind speed variability and material properties to compute system-level risks, providing robust estimates even for complex, nonlinear responses.[96] These simulations are particularly valuable for capturing tail-end events in load spectra, offering higher accuracy than analytical approximations for rare failure scenarios. Standards such as ASCE 7-22 provide the framework for seismic reliability in building design, specifying load combinations and response spectra calibrated to achieve uniform reliability targets across seismic hazard levels.[97] Within this, first-order second-moment (FOSM) approximations are used to efficiently compute reliability indices by linearizing limit state functions around mean values and variances of loads and resistances, facilitating quick assessments during code calibration.[98] ASCE 7-22's provisions ensure that structures in high-seismic zones maintain a collapse probability below 1% in 50 years, informed by probabilistic seismic hazard analysis.[97] In applications like bridge and building design, structural reliability addresses long-term degradation from fatigue and corrosion, which progressively reduce resistance over decades of service.[99] For steel bridges, fatigue reliability models account for cyclic traffic loads, using fracture mechanics to predict crack growth and set inspection intervals that keep failure risks below 10^{-4} annually.[100] In reinforced concrete buildings, corrosion-induced section loss is modeled stochastically, incorporating chloride ingress and environmental exposure to evaluate time-dependent reliability and inform protective measures like coatings or cathodic protection.[101] These considerations ensure structures remain safe against cumulative damage, balancing initial design costs with lifecycle maintenance.Comparisons and Distinctions
Versus Safety Engineering
Reliability engineering primarily focuses on the statistical prediction and avoidance of failures to ensure that systems perform their intended functions over a specified period, often quantified through metrics like mean time between failures (MTBF).[102] In contrast, safety engineering emphasizes the prevention of hazardous events that could cause harm to people, property, or the environment, prioritizing the elimination of risks associated with system malfunctions rather than overall operational consistency.[103] While both disciplines aim to mitigate failures, reliability targets the probability of successful operation under normal conditions, whereas safety addresses worst-case scenarios where failures could lead to accidents, such as in aerospace or nuclear systems.[104] In terms of fault tolerance, reliability engineering employs redundancy—such as duplicate components or parallel systems—to maintain functionality and extend operational life when individual elements fail, thereby improving overall system availability.[105] Safety engineering, however, incorporates fail-safe mechanisms designed to detect faults and transition the system to a benign state, like emergency shutdowns in chemical plants or parachutes in aircraft, to avert harm even if full functionality is lost.[103] For instance, a redundant power supply in a data center enhances reliability by ensuring continuous operation, but a fail-safe circuit breaker in an industrial machine prioritizes safety by halting operations to prevent electrical hazards.[102] Mission reliability in reliability engineering encompasses the probability of a system completing its operational objectives within a defined mission profile, accounting for environmental stresses and usage patterns, as seen in NASA's space missions.[106] Basic reliability, by comparison, focuses on inherent component durability without mission-specific contexts. Safety engineering, however, prioritizes hazard elimination across all phases, ensuring that even mission-critical systems do not compromise human or environmental safety, such as through inherent design features that avoid single points of failure leading to catastrophes.[104] Both fields address common cause failures (CCFs), where a single event impacts multiple components, using the beta-factor model to quantify the fraction (β) of total failure rates attributable to CCFs, typically ranging from 0.01 to 0.25 based on empirical data from nuclear and aerospace applications.[107] This model is shared in reliability assessments for predicting system unavailability and in safety analyses for risk quantification, but safety engineering additionally incorporates detectability and recovery factors, such as staggered testing or human intervention, to reduce CCF impacts in probabilistic risk assessments.[108] For example, in redundant safety systems like emergency diesel generators, the beta-factor helps estimate CCF probabilities, with safety protocols emphasizing post-failure diagnostics to enhance overall hazard control.[107]Versus Quality Engineering
Reliability engineering focuses on predicting and ensuring the long-term performance of systems and products over their intended lifespan, emphasizing the probability that a product will function as required under specified conditions for a given duration.[2] In contrast, quality engineering primarily addresses short-term conformance to specifications, such as achieving defect-free production and meeting immediate customer requirements at the point of delivery.[2] This distinction underscores that while quality ensures initial functionality, reliability extends to sustained performance amid degradation, environmental stresses, and operational use. In methodologies like Six Sigma, quality engineering leverages the DMAIC framework (Define, Measure, Analyze, Improve, Control) to reduce process variation and defects in manufacturing and service delivery.[109] Reliability engineering integrates these tools but augments them with life-cycle modeling techniques, such as Weibull analysis and accelerated life testing, to address failure mechanisms beyond mere process variability and predict performance across the product's operational phases.[110] For instance, while Six Sigma targets sigma levels for immediate output quality, reliability efforts incorporate probabilistic models to forecast mean time between failures (MTBF) and mission reliability.[110] Metrics in quality engineering often include process capability indices like Cp and Cpk, which quantify how well a process meets specification limits based on variation and centering, with values above 1.33 indicating capable processes. Reliability engineering, however, employs survival analysis metrics such as the reliability function R(t), representing the probability of no failure by time t, or hazard rates derived from life data to model time-dependent risks.[2] Both disciplines overlap in the use of Failure Mode and Effects Analysis (FMEA) to identify potential failure modes and prioritize risks through severity, occurrence, and detection ratings.[52] However, quality-focused FMEA typically examines process or design conformance at production, whereas reliability engineering extends FMEA to incorporate usage stresses, environmental factors, and long-term degradation, often evolving it into Failure Modes, Effects, and Criticality Analysis (FMECA) for quantitative risk assessment over the product lifecycle.[111]Operational and Organizational Aspects
Operational Assessment
Operational assessment in reliability engineering involves the systematic evaluation of system performance in real-world conditions through the collection and analysis of field data, enabling organizations to quantify achieved reliability and inform ongoing improvements. This process relies on post-deployment data from operational environments, such as failure reports, usage logs, and maintenance records, to validate or adjust initial reliability predictions derived from testing. Unlike controlled laboratory assessments, operational evaluation accounts for diverse stressors like environmental variations and human factors, providing a more accurate picture of long-term reliability.[112] A key method for analyzing field failure data is Weibull analysis, which models the distribution of failure times to identify patterns such as infant mortality or wear-out phases. In applications like compressor valve reliability, Weibull plots of field data reveal hazard rates with slopes less than one, indicating early-life failures due to manufacturing defects, and enable predictions of cumulative failure probabilities over operational hours. For instance, analysis of retrofitted compressor reeds showed projected failure rates of 8-9% at 24,000 hours, guiding decisions on further modifications. Trend tests complement this by detecting whether reliability is improving or degrading over time in field data sets. The Laplace trend test, for example, assesses deviations from a homogeneous Poisson process by comparing inter-failure times, rejecting the null hypothesis of constant failure rates if trends indicate acceleration (chi-square values exceeding critical percentiles at 5% or 10% significance). These tests are essential for repairable systems, where increasing or decreasing failure intensities signal the need for interventions.[113][112] Common metrics in operational assessment include achieved mean time between failures (MTBF) derived from warranty claims and probability of failure (PoF) curves. Achieved MTBF is calculated by dividing total operational exposure time—estimated from warranty claim durations and unit sales—by the number of reported failures, offering a practical measure of field performance that accounts for varying usage patterns.[114] Warranty data analysis can incorporate Weibull methods to estimate reliability metrics. PoF curves, often constructed via lifetime variability models, plot failure probabilities over time, incorporating statistical distributions updated with operational data to reduce uncertainty and prioritize inspections for high-risk assets. These curves provide more precise forecasts than conservative standards like API 581, tightening as reliable data accumulates.[115][116] Feedback loops integrate operational data into design iterations, fostering continuous reliability enhancement. In aviation, fleet monitoring programs collect data on nonroutine events and maintenance from aircraft systems, analyzing trends to adjust design parameters, such as component stress levels or task intervals, within continuous airworthiness maintenance frameworks. This approach, as outlined in Federal Aviation Administration guidance, uses root cause analysis of fleet-wide data to refine designs, ensuring sustained operational reliability without safety compromises.[32] Challenges in operational assessment often stem from data quality issues, including underreporting of failures, which can bias reliability estimates toward overly optimistic values. Underreporting arises from incomplete logging or threshold-based incident criteria, leading to sparse field data sets. Bayesian updates address this by incorporating prior distributions—derived from historical or expert knowledge—to refine posterior estimates of failure probabilities, particularly effective with limited observations. Hierarchical Bayesian models, for instance, using beta-binomial distributions, demonstrate that informative priors significantly improve predictions in small samples (e.g., 10-40 units), converging toward accurate reliability assessments as data volume increases.[117]Organizations and Education
Several professional organizations play a pivotal role in advancing reliability engineering through standards development, knowledge dissemination, and community building. The American Society for Quality (ASQ), founded in 1946, promotes reliability practices via its Reliability and Risk Division, which is the world's largest volunteer group focused on risk analysis and reliability training, offering resources, conferences, and certification programs to enhance professional competencies.[118] The Society of Reliability Engineers (SRE), established in 1966, provides a forum for professionals across industries to address shared challenges in reliability, emphasizing practical applications and networking opportunities.[119] The IEEE Reliability Society, a technical society within the Institute of Electrical and Electronics Engineers (IEEE), supports engineers in ensuring system reliability through technical publications, symposia like the annual RAMS conference, and educational initiatives spanning reliability modeling and prognostics. Educational pathways in reliability engineering typically include graduate degrees that build on foundational engineering principles, integrating statistics, failure analysis, and system design. Universities such as the University of Maryland offer Master of Science (M.S.), Master of Engineering (M.Eng.), and Ph.D. programs in Reliability Engineering, administered through the Center for Risk and Reliability, which emphasize multidisciplinary approaches to failure prediction, risk assessment, and reliability optimization for working professionals via on-campus and online formats.[120] Other institutions, including the University of Tennessee and UCLA, provide similar graduate programs focusing on reliability and maintainability, often with concentrations in data-driven techniques and industry applications.[121][122] Certifications validate expertise and are essential for career advancement in the field. The Certified Reliability Engineer (CRE) credential, administered by ASQ since 1964, certifies professionals in performance evaluation, prediction, and improvement of product systems reliability. The Body of Knowledge was updated effective January 2025. It requires examination on topics like reliability fundamentals, risk management, and statistical methods, with eligibility based on experience and education.[123] Training programs complement formal education by offering hands-on skill development in specialized tools and methodologies. Workshops and courses, such as those using ReliaSoft software (now part of HBK), cover reliability analysis from basic concepts like Weibull distribution modeling to advanced system simulations with tools like BlockSim and Weibull++, enabling practitioners to apply quantitative methods in real-world scenarios through webinars, online modules, and in-person sessions.[124] Curricula in these trainings progress from introductory reliability engineering principles to sophisticated topics including accelerated life testing and probabilistic risk assessment, ensuring comprehensive preparation for industry demands.[125] On a global scale, the International Council on Systems Engineering (INCOSE) facilitates the integration of reliability engineering within broader systems engineering practices, promoting interdisciplinary frameworks that embed reliability considerations into system design, verification, and lifecycle management through handbooks, working groups, and international symposia.[126]References
- https://sebokwiki.org/wiki/Human_Systems_Integration
- https://sebokwiki.org/wiki/System_Reliability%2C_Availability%2C_and_Maintainability
