Recent from talks
Nothing was collected or created yet.
Failure rate
View on WikipediaFailure rate is the frequency with which any system or component fails, expressed in failures per unit of time. It thus depends on the system conditions, time interval, and total number of systems under study.[1] It can describe electronic, mechanical, or biological systems, in fields such as systems and reliability engineering, medicine and biology, or insurance and finance. It is usually denoted by the Greek letter (lambda).
In real-world applications, the failure probability of a system usually differs over time; failures occur more frequently in early-life ("burning in"), or as a system ages ("wearing out"). This is known as the bathtub curve, where the middle region is called the "useful life period".
Mean time between failures (MTBF)
[edit]The mean time between failures (MTBF, ) is often reported instead of the failure rate, as numbers such as "2,000 hours" are more intuitive than numbers such as "0.0005 per hour".
However, this is only valid if the failure rate is actually constant over time, such as within the flat region of the bathtub curve. In many cases where MTBF is quoted, it refers only to this region; thus it cannot be used to give an accurate calculation of the average lifetime of a system, as it ignores the "burn-in" and "wear-out" regions.
MTBF appears frequently in engineering design requirements, and governs the frequency of required system maintenance and inspections. A similar ratio used in the transport industries, especially in railways and trucking, is "mean distance between failures" (MDBF) - allowing maintenance to be scheduled based on distance travelled, rather than at regular time intervals.
Mathematical definition
[edit]The simplest definition of failure rate is simply the number of failures per time interval :
which would depend on the number of systems under study, and the conditions over the time period.
Failures over time
[edit]
To accurately model failures over time, a cumulative failure distribution, must be defined, which can be any cumulative distribution function (CDF) that gradually increases from to . In the case of many identical systems, this may be thought of as the fraction of systems failing over time , after all starting operation at time ; or in the case of a single system, as the probability of the system having its failure time before time :
As CDFs are defined by integrating a probability density function, the failure probability density is defined such that:

where is a dummy integration variable. Here can be thought of as the instantaneous failure rate, i.e. the probability of failure in the time interval between and , as tends towards :
Hazard rate
[edit]A concept closely related but different[2] to instantaneous failure rate is the hazard rate (or hazard function), . In the many-system case, this is defined as the proportional failure rate of the systems still functioning at time – as opposed to , which is the expressed as a proportion of the initial number of systems.
For convenience we first define the reliability (or survival function) as:
then the hazard rate is simply the instantaneous failure rate, scaled by the fraction of surviving systems at time :
In the probabilistic sense for a single system, this can be interpreted as the instantaneous failure rate under the conditional probability that the system or component has already survived to time :
Conversion to cumulative failure rate
[edit]To convert between and , we can solve the differential equation
with initial condition , which yields[2]
Thus for a collection of identical systems, only one of hazard rate , failure probability density , or cumulative failure distribution need be defined.
Confusion can occur as the notation for "failure rate" often refers to the function rather than [3]
Constant hazard rate model
[edit]There are many possible functions that could be chosen to represent failure probability density or hazard rate , based on empirical or theoretical evidence, but the most common and easily-understandable choice is to set
- ,
an exponential function with scaling constant . As seen in the figures above, this represents a gradually decreasing failure probability density.
The CDF is then calculated as:
which can be seen to gradually approach as representing the fact that eventually all systems under study will fail.
The hazard rate function is then:
In other words, in this particular case only, the hazard rate is constant over time.
This illustrates the difference in hazard rate and failure probability density - as the number of systems surviving at time gradually reduces, the total failure rate also reduces, but the hazard rate remains constant. In other words, the probabilities of each individual system failing do not change over time as the systems age - they are "memory-less".
Other models
[edit]
For many systems, a constant hazard function may not be a realistic approximation; the chance of failure of an individual component may depend on its age. Therefore, other distributions are often used.
For example, the deterministic distribution increases hazard rate over time (for systems where wear-out is the most important factor), while the Pareto distribution decreases it (for systems where early-life failures are more common). The commonly used Weibull distribution combines both of these effects, as do the log-normal and hypertabastic distributions.
After modelling a given distribution and parameters for , the failure probability density and cumulative failure distribution can be predicted using the given equations.
Measuring failure rate
[edit]Failure rate data can be obtained in several ways. The most common means are:
- Estimation
- From field failure rate reports, statistical analysis techniques can be used to estimate failure rates. For accurate failure rates the analyst must have a good understanding of equipment operation, procedures for data collection, the key environmental variables impacting failure rates, how the equipment is used at the system level, and how the failure data will be used by system designers.
- Historical data about the device or system under consideration
- Many organizations maintain internal databases of failure information on the devices or systems that they produce, which can be used to calculate failure rates for those devices or systems. For new devices or systems, the historical data for similar devices or systems can serve as a useful estimate.
- Government and commercial failure rate data
- Handbooks of failure rate data for various components are available from government and commercial sources. MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, is a military standard that provides failure rate data for many military electronic components. Several failure rate data sources are available commercially that focus on commercial components, including some non-electronic components.
- Prediction
- Time lag is one of the serious drawbacks of all failure rate estimations. Often by the time the failure rate data are available, the devices under study have become obsolete. Due to this drawback, failure-rate prediction methods have been developed. These methods may be used on newly designed devices to predict the device's failure rates and failure modes. Two approaches have become well known, Cycle Testing and FMEDA.
- Life Testing
- The most accurate source of data is to test samples of the actual devices or systems in order to generate failure data. This is often prohibitively expensive or impractical, so that the previous data sources are often used instead.
- Cycle Testing
- Mechanical movement is the predominant failure mechanism causing mechanical and electromechanical devices to wear out. For many devices, the wear-out failure point is measured by the number of cycles performed before the device fails, and can be discovered by cycle testing. In cycle testing, a device is cycled as rapidly as practical until it fails. When a collection of these devices are tested, the test will run until 10% of the units fail dangerously.
- FMEDA
- Failure modes, effects, and diagnostic analysis (FMEDA) is a systematic analysis technique to obtain subsystem / product level failure rates, failure modes and design strength. The FMEDA technique considers:
- All components of a design,
- The functionality of each component,
- The failure modes of each component,
- The effect of each component failure mode on the product functionality,
- The ability of any automatic diagnostics to detect the failure,
- The design strength (de-rating, safety factors) and
- The operational profile (environmental stress factors).
Given a component database calibrated with field failure data that is reasonably accurate,[4] the method can predict product level failure rate and failure mode data for a given application. The predictions have been shown to be more accurate[5] than field warranty return analysis or even typical field failure analysis given that these methods depend on reports that typically do not have sufficient detail information in failure records.[6]
Examples
[edit]Decreasing failure rates
[edit]A decreasing failure rate describes cases where early-life failures are common[7] and corresponds to the situation where is a decreasing function.
This can describe, for example, the period of infant mortality in humans, or the early failure of a transistors due to manufacturing defects.
Decreasing failure rates have been found in the lifetimes of spacecraft - Baker and Baker commenting that "those spacecraft that last, last on and on."[8][9]
The hazard rate of aircraft air conditioning systems was found to have an exponentially decreasing distribution.[10]
Renewal processes
[edit]In special processes called renewal processes, where the time to recover from failure can be neglected, the likelihood of failure remains constant with respect to time.
For a renewal process with DFR renewal function, inter-renewal times are concave.[clarification needed][11][12] Brown conjectured the converse, that DFR is also necessary for the inter-renewal times to be concave,[13] however it has been shown that this conjecture holds neither in the discrete case[12] nor in the continuous case.[14]
Coefficient of variation
[edit]When the failure rate is decreasing the coefficient of variation is ⩾ 1, and when the failure rate is increasing the coefficient of variation is ⩽ 1.[clarification needed][15] Note that this result only holds when the failure rate is defined for all t ⩾ 0[16] and that the converse result (coefficient of variation determining nature of failure rate) does not hold.
Units
[edit]Failure rates can be expressed using any measure of time, but hours is the most common unit in practice. Other units, such as miles, revolutions, etc., can also be used in place of "time" units.
Failure rates are often expressed in engineering notation as failures per million, or 10−6, especially for individual components, since their failure rates are often very low.
The Failures In Time (FIT) rate of a device is the number of failures that can be expected in one billion (109) device-hours of operation[17] (e.g. 1,000 devices for 1,000,000 hours, or 1,000,000 devices for 1,000 hours each, or some other combination). This term is used particularly by the semiconductor industry.
Combinations of failure types
[edit]If a complex system consists of many parts, and the failure of any single part means the failure of the entire system, and the chance of failure for each part is conditionally independent of the failure of any other part, then the total failure rate is simply the sum of the individual failure rates of its parts
however, this assumes that the failure rate is constant, and that the units are consistent (e.g. failures per million hours), and not expressed as a ratio or as probability densities. This is useful to estimate the failure rate of a system when individual components or subsystems have already been tested.[18][19]
When adding "redundant" components to eliminate a single point of failure, the quantity of interest is not the sum of individual failure rates but rather the "mission failure" rate, or the "mean time between critical failures" (MTBCF).[20]
Combining failure or hazard rates that are time-dependent is more complicated. For example, mixtures of Decreasing Failure Rate (DFR) variables are also DFR.[11] Mixtures of exponentially distributed failure rates are hyperexponentially distributed.
Simple example
[edit]Suppose it is desired to estimate the failure rate of a certain component. Ten identical components are each tested until they either fail or reach 1,000 hours, at which time the test is terminated. A total of 7,502 component-hours of testing is performed, and 6 failures are recorded.
The estimated failure rate is:
which could also be expressed as a MTBF of 1,250 hours, or approximately 800 failures for every million hours of operation.
See also
[edit]References
[edit]- ^ * MacDiarmid, Preston; Morris, Seymour; et al. (n.d.). Reliability Toolkit (Commercial Practices ed.). Rome, New York: Reliability Analysis Center and Rome Laboratory. pp. 35–39.
- ^ a b Todinov, MT (2007). "Chapter 2.2 HAZARD RATE AND TIME TO FAILURE DISTRIBUTION". Risk-Based Reliability Analysis and Generic Principles for Risk Reduction.
- ^ Wang, Shaoping (2016). "Chapter 3.3.1.3: Failure Rate λ(t)". Comprehensive Reliability Design of Aircraft Hydraulic System.
- ^ Electrical & Mechanical Component Reliability Handbook. exida. 2006.
- ^ Goble, William M.; Iwan van Beurden (2014). Combining field failure data with new instrument design margins to predict failure rates for SIS Verification. Proceedings of the 2014 International Symposium - BEYOND REGULATORY COMPLIANCE, MAKING SAFETY SECOND NATURE, Hilton College Station-Conference Center, College Station, Texas.
- ^ W. M. Goble, "Field Failure Data – the Good, the Bad and the Ugly," exida, Sellersville, PA [1]
- ^ Finkelstein, Maxim (2008). "Introduction". Failure Rate Modelling for Reliability and Risk. Springer Series in Reliability Engineering. pp. 1–84. doi:10.1007/978-1-84800-986-8_1. ISBN 978-1-84800-985-1.
- ^ Baker, J. C.; Baker, G. A. S. . (1980). "Impact of the space environment on spacecraft lifetimes". Journal of Spacecraft and Rockets. 17 (5): 479. Bibcode:1980JSpRo..17..479B. doi:10.2514/3.28040.
- ^ Saleh, Joseph Homer; Castet, Jean-François (2011). "On Time, Reliability, and Spacecraft". Spacecraft Reliability and Multi-State Failures. p. 1. doi:10.1002/9781119994077.ch1. ISBN 9781119994077.
- ^ Proschan, F. (1963). "Theoretical Explanation of Observed Decreasing Failure Rate". Technometrics. 5 (3): 375–383. doi:10.1080/00401706.1963.10490105. JSTOR 1266340.
- ^ a b Brown, M. (1980). "Bounds, Inequalities, and Monotonicity Properties for Some Specialized Renewal Processes". The Annals of Probability. 8 (2): 227–240. doi:10.1214/aop/1176994773. JSTOR 2243267.
- ^ a b Shanthikumar, J. G. (1988). "DFR Property of First-Passage Times and its Preservation Under Geometric Compounding". The Annals of Probability. 16 (1): 397–406. doi:10.1214/aop/1176991910. JSTOR 2243910.
- ^ Brown, M. (1981). "Further Monotonicity Properties for Specialized Renewal Processes". The Annals of Probability. 9 (5): 891–895. doi:10.1214/aop/1176994317. JSTOR 2243747.
- ^ Yu, Y. (2011). "Concave renewal functions do not imply DFR interrenewal times". Journal of Applied Probability. 48 (2): 583–588. arXiv:1009.2463. doi:10.1239/jap/1308662647. S2CID 26570923.
- ^ Wierman, A.; Bansal, N.; Harchol-Balter, M. (2004). "A note on comparing response times in the M/GI/1/FB and M/GI/1/PS queues" (PDF). Operations Research Letters. 32: 73–76. doi:10.1016/S0167-6377(03)00061-0.
- ^ Gautam, Natarajan (2012). Analysis of Queues: Methods and Applications. CRC Press. p. 703. ISBN 978-1439806586.
- ^ Xin Li; Michael C. Huang; Kai Shen; Lingkun Chu. "A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility". 2010. p. 6.
- ^ "Reliability Basics". 2010.
- ^ Vita Faraci. "Calculating Failure Rates of Series/Parallel Networks" Archived 2016-03-03 at the Wayback Machine. 2006.
- ^ "Mission Reliability and Logistics Reliability: A Design Paradox".
Further reading
[edit]- Goble, William M. (2018), Safety Instrumented System Design: Techniques and Design Verification, Research Triangle Park, NC: International Society of Automation
- Blanchard, Benjamin S. (1992). Logistics Engineering and Management (Fourth ed.). Englewood Cliffs, New Jersey: Prentice-Hall. pp. 26–32. ISBN 0135241170.
- Ebeling, Charles E. (1997). An Introduction to Reliability and Maintainability Engineering. Boston: McGraw-Hill. pp. 23–32. ISBN 0070188521.
- Federal Standard 1037C
- Kapur, K. C.; Lamberson, L. R. (1977). Reliability in Engineering Design. New York: John Wiley & Sons. pp. 8–30. ISBN 0471511919.
- Knowles, D. I. (1995). "Should We Move Away From 'Acceptable Failure Rate'?". Communications in Reliability Maintainability and Supportability. 2 (1). International RMS Committee, USA: 23.
- Modarres, M.; Kaminskiy, M.; Krivtsov, V. (2010). Reliability Engineering and Risk Analysis: A Practical Guide (2nd ed.). CRC Press. ISBN 9780849392474.
- Mondro, Mitchell J. (June 2002). "Approximation of Mean Time Between Failure When a System has Periodic Maintenance" (PDF). IEEE Transactions on Reliability. 51 (2): 166–167. doi:10.1109/TR.2002.1011521. Archived from the original (PDF) on 2008-12-17. Retrieved 2008-07-14.
- Rausand, M.; Hoyland, A. (2004). System Reliability Theory; Models, Statistical methods, and Applications. New York: John Wiley & Sons. ISBN 047147133X.
- Turner, T.; Hockley, C.; Burdaky, R. (1997). The Customer Needs A Maintenance-Free Operating Period. Leatherhead, Surrey, UK: ERA Technology Ltd.
{{cite book}}:|work=ignored (help) - U.S. Department of Defense, (1991) Military Handbook, “Reliability Prediction of Electronic Equipment, MIL-HDBK-217F, 2
External links
[edit]- Fault Tolerant Computing in Industrial Automation Archived 2014-03-26 at the Wayback Machine by Hubert Kirrmann, ABB Research Center, Switzerland
Failure rate
View on GrokipediaBasic Concepts
Definition and Interpretation
In reliability engineering, the failure rate refers to the rate at which failures occur within a population of identical items or components under specified conditions, typically expressed as the number of failures per unit of time.[6] For time-dependent scenarios, it is commonly denoted as λ(t), representing how this rate may vary as a function of time or usage.[7] This measure is fundamental to assessing the dependability of systems, from consumer electronics to critical infrastructure. The failure rate is interpreted as the conditional probability of failure occurring in a small time interval immediately following time t, given that the item has survived up to time t.[7] In practical terms, it quantifies the instantaneous risk of failure for surviving units in the population, providing insight into when and how likely breakdowns are to happen next. This is also known as the hazard rate in statistical contexts and directly influences overall system reliability by determining the probability of continued operation.[6] The concept of failure rate originated in the mid-20th century amid the rapid advancement of reliability engineering during World War II, driven by the need to mitigate unacceptable failure rates in military equipment such as radar and communication devices.[8] A key distinction exists between non-repairable (destructive) systems, where the failure rate applies to the time until the first and only failure—after which the item is discarded—and repairable systems, where repeated failures can occur post-maintenance, rendering the traditional failure rate inapplicable and necessitating alternative metrics like the rate of failure occurrences.[9]Units and Terminology
The failure rate is typically expressed in units of failures per unit time, such as failures per hour (h⁻¹) or failures per million hours, reflecting the frequency of failures among a population of items under specified conditions.[10] In high-reliability applications, particularly for electronic components, the standard unit is FIT (failures in time), defined as one failure per 10⁹ device-hours of operation.[11] This unit facilitates comparison across large-scale systems, where rates are often very low; for instance, a component with an MTBF of one million hours corresponds to a failure rate of 1,000 FIT.[11] Terminology for failure rate varies by discipline but often overlaps significantly. In reliability engineering, "failure rate" and "hazard rate" are synonymous, both denoting the instantaneous rate at which surviving items fail, conditional on survival up to that point.[10] In actuarial science, the equivalent concept is the "force of mortality," which measures the instantaneous rate of death at a given age and is mathematically identical to the hazard rate.[12] These terms emphasize the conditional nature of the metric, distinguishing it from unconditional probabilities. Conversions between units ensure consistency in analysis; for example, an annual failure rate can be converted to an hourly rate by dividing by 8,760, the approximate number of hours in a non-leap year.[13] In mechanical systems subject to repetitive loading, failure rates may adopt dimensionless forms, such as failures per cycle or per million cycles, to account for fatigue or wear independent of time.[14] A common pitfall in terminology is conflating failure rate with failure probability, as the former is a rate per unit time (e.g., instantaneous or average) while the latter is a dimensionless probability over a specific interval; substituting one for the other in calculations, such as reliability predictions, can lead to significant errors.[15]Mathematical Foundations
Probability Distributions in Reliability
In reliability engineering, the time to failure of a component or system is modeled as a non-negative continuous random variable . The cumulative distribution function (CDF) quantifies the probability that failure occurs at or before time , providing a foundational measure of failure accumulation over time. The probability density function (PDF) then describes the distribution of failure times, indicating the relative likelihood of failure occurring in a small interval around time . These functions assume continuous time, which aligns with most physical failure processes where exact failure instants are not discrete.[7] The reliability function, denoted and also referred to as the survival function, is defined as . It represents the probability that the component or system survives without failure beyond time , or equivalently, the probability of no failure occurring by time . This function is monotonically decreasing from to , reflecting the inevitable nature of failure in finite-lifetime systems. The survival function is particularly useful for interpreting long-term performance, as it directly complements the CDF by focusing on non-failure events.[16] Reliability analyses commonly assume that failures among independent components occur independently, allowing system-level reliability to be computed as the product of individual component reliabilities. Additionally, real-world data often involves right-censoring, where the failure time for some units is unknown because the observation ends before failure (e.g., during accelerated testing or field studies); this requires statistical methods that account for partial information without biasing estimates. These assumptions enable robust probabilistic modeling while accommodating practical data limitations.[16][17] A key metric derived from these distributions is the expected lifetime, or mean time to failure (MTTF), which quantifies the average operational duration before failure. For a non-repairable system, the MTTF is calculated as the integral of the reliability function over all time: This integral provides a conceptual summary of survival expectancy, emphasizing the role of the reliability function in assessing overall durability without assuming specific failure mechanisms.[18]Hazard Rate and Derivation
The hazard rate, denoted as , represents the instantaneous failure rate at time , conditional on the system or component having survived up to that point. It quantifies the risk of failure in an infinitesimally small interval following time , given no prior failure, and is a fundamental concept in reliability engineering for modeling time-dependent failure behavior.[10] The hazard rate is formally derived from the conditional probability of failure. Consider the time to failure random variable ; the probability of failure in the small interval given survival to time is . The hazard rate is then the limit of this probability divided by the interval length as the interval approaches zero: This limit yields the instantaneous conditional failure rate. Expanding the conditional probability gives , where is the cumulative distribution function and is the survival (reliability) function. Dividing by and taking the limit as results in , where is the probability density function.[7][10] Conceptually, the hazard rate relates to the bathtub curve, a common model in reliability engineering that describes how failure rates evolve over a product's lifecycle: initially high during early defects (infant mortality), stabilizing to a relatively constant level during normal operation, and rising again due to wear-out mechanisms in later stages. This time-varying profile highlights the hazard rate's ability to capture phased reliability behaviors in real-world systems.[2] Key properties of the hazard rate include its non-negativity ( for all ), as it represents a probability density, and its potential to vary with time, allowing flexible modeling of failure processes unlike constant-rate assumptions. The integral of over time intervals represents the accumulated failure risk, providing a measure of total exposure to failure events. Units for are typically failures per unit time, such as per hour or per cycle.[10][7]Cumulative Failure Metrics
The cumulative hazard function, denoted as , integrates the hazard rate from 0 to , providing a measure of the accumulated risk of failure over time: This function quantifies the total exposure to failure risk up to time , where the hazard rate serves as the instantaneous integrand.[19][20] From the cumulative hazard, the reliability function , which is the probability of survival beyond time , is obtained as . Consequently, the cumulative distribution function , representing the probability of failure by time , follows as . These conversions enable the translation of accumulated risk into probabilistic interpretations of survival and failure.[19][20] The mean residual life (MRL) at time , defined as the expected remaining lifetime given survival to , relates to cumulative metrics through the survival function: it equals the integral of from to infinity, normalized by . Since for , the MRL provides insight into aging effects by leveraging the cumulative hazard to assess how past risk accumulation influences future expectations.[21][22] In practical applications, cumulative metrics like predict the total number of failures over a fixed interval for a population of units, approximating the expected failures as , which aids in maintenance planning and resource allocation.[23][19] When the hazard rate is complex and lacks a closed-form antiderivative, numerical approximation methods compute via integration techniques such as the trapezoidal rule, which discretizes the interval into subintervals and sums weighted averages of values, or Simpson's rule for higher accuracy using quadratic interpolations. These methods ensure reliable estimation of cumulative risk in engineering analyses where analytical solutions are infeasible.[24][25]Failure Rate Models
Constant Failure Rate Model
The constant failure rate model in reliability engineering assumes that the hazard rate, denoted as λ, remains invariant over time, implying that the probability of failure per unit time is independent of the system's age. This assumption leads to the exponential distribution as the underlying probability model for time to failure. The probability density function (PDF) is expressed as where λ > 0 is the constant failure rate parameter. The corresponding reliability function, which gives the probability of survival beyond time t, is [26][27] This model is particularly applicable to electronic components and systems during their useful life phase, where failures arise predominantly from random external factors rather than degradation. A key feature is the memoryless property of the exponential distribution, meaning the conditional probability of failure in a future interval is unaffected by prior operation time, effectively modeling components with no aging or wear accumulation.[26][28] In this framework, the mean time to failure (MTTF)—equivalent to mean time between failures (MTBF) for non-repairable systems—is simply the reciprocal of the failure rate, MTTF = 1/λ. This result is obtained by computing the expected value as the integral of the survival function: The simplicity of this derivation underscores the model's utility for quick reliability predictions.[29][30] The constant failure rate model's advantages include its mathematical tractability, allowing closed-form solutions for system reliability and enabling the use of the homogeneous Poisson process to model failure occurrences, where the expected number of failures in time t is λt. This Poisson linkage facilitates efficient counting and prediction of random events in large populations. However, the model has limitations, as it fails to represent increasing failure rates due to wear-out or decreasing rates from infant mortality, restricting its use to stable operational phases.[31][32]Time-Varying Failure Rate Models
Time-varying failure rate models account for scenarios where the instantaneous failure rate λ(t) evolves with time t, reflecting real-world degradation processes such as material fatigue or manufacturing defects that influence reliability over the product lifecycle. Unlike constant rate assumptions, these models capture phases of decreasing, increasing, or non-monotonic hazard rates, enabling more accurate predictions for non-repairable systems subject to aging.[33] The Weibull distribution is a foundational time-varying model, introduced by Waloddi Weibull in 1951, widely adopted for its flexibility in modeling diverse failure behaviors through the shape parameter β and scale parameter η.[34] The failure rate is given by where β determines the rate's monotonicity: β < 1 yields a decreasing rate (e.g., early-life infant mortality), β = 1 reduces to a constant exponential rate, and β > 1 produces an increasing rate (e.g., wear-out failures). This parameterization allows integration to derive the cumulative distribution function for reliability assessment.[35] The log-normal distribution models failure times where the logarithm of time-to-failure follows a normal distribution, suitable for processes driven by multiplicative effects like fatigue or corrosion in mechanical components.[36] Its hazard rate λ(t) lacks a closed-form expression but typically rises sharply to a peak before declining, reflecting initial low risk that accelerates under stress then tapers as survivors endure.[37] Parameters include the mean μ and standard deviation σ of the log-times, with applications in electronic component lifetimes where failure risk decreases over time after an initial surge.[38] The gamma distribution, parameterized by shape α and scale β, provides another versatile option for time-varying rates, often arising in systems with sequential degradation events or as a conjugate prior in Bayesian reliability analysis.[33] The hazard rate is where Γ denotes the gamma function and the incomplete gamma ratio influences the shape; α < 1 leads to decreasing rates, α = 1 to constant (exponential), and α > 1 to increasing rates, making it apt for modeling wear-out in standby redundancies.[33] The Pareto distribution, particularly the Type I form with shape α > 0 and scale x_m > 0, is employed for extreme value failures exhibiting heavy-tailed behavior, such as rare catastrophic events in reliability contexts.[39] Its failure rate decreases monotonically as λ(t) = α / t for t ≥ x_m, capturing scenarios with high initial vulnerability that diminishes, though less common than Weibull for general time-varying applications due to its focus on tail extremes.[40] Selection of a time-varying model hinges on underlying physical mechanisms: decreasing rates suit defect-dominated early failures (e.g., β < 1 in Weibull), while increasing rates align with fatigue or diffusion processes (e.g., β > 1 or α > 1 in gamma), informed by physics-of-failure analysis to match degradation physics like crack propagation. Empirical data trends and goodness-of-fit tests further guide choices, prioritizing models that reflect observed hazard shapes from accelerated life testing.[41] Parameter estimation for these models typically involves maximum likelihood methods applied to failure time data, yielding point estimates for shape and scale parameters that maximize the likelihood function, often supplemented by graphical techniques like probability plotting for initial validation.[42] Confidence intervals are derived via asymptotic approximations or bootstrapping to quantify uncertainty in the fitted failure rate.[33]Estimation and Measurement
Empirical Data Collection
Empirical data collection for failure rate analysis involves systematic gathering of real-world or simulated failure information from products or systems to inform reliability assessments. This process is essential in reliability engineering, as it provides the foundational data needed for estimating failure probabilities under various conditions. Methods emphasize capturing accurate, representative failure events while accounting for practical constraints in testing and observation. Key types of testing for collecting failure data include accelerated life testing (ALT), field data collection, and laboratory simulations. In ALT, components are subjected to elevated stress levels—such as higher temperatures, voltages, or vibrations—to induce failures more rapidly than under normal use, allowing extrapolation of failure rates to operational conditions.[43] Field data collection involves monitoring systems in actual operational environments, capturing failures as they occur during routine use, which provides insights into long-term behavior but requires extensive time and resources.[44] Laboratory simulations replicate controlled environments to test prototypes or batches under standardized stresses, offering repeatable conditions for initial data gathering before field deployment.[45] The primary data types collected are time-to-failure measurements, censored observations, and records of multiple failure modes. Time-to-failure data records the exact duration from activation to breakdown for each unit, forming the basis for distribution fitting.[17] Censored observations arise in suspended tests, where units are removed before failure (right-censoring) or have already failed before testing begins (left-censoring), providing partial information that must be handled carefully to avoid bias.[46] Multiple failure modes, such as electrical shorts or mechanical wear, are documented to distinguish competing risks, enabling mode-specific failure rate analysis.[47] Sampling considerations are critical to ensure data validity, focusing on selecting representative populations and determining adequate sample sizes for statistical power. Representative sampling draws from the target user base, accounting for variations in materials, manufacturing batches, or environmental exposures to mirror real-world diversity.[6] Sample size must balance precision needs with cost; for rare events like low failure rates, larger samples (often hundreds or thousands) are required to achieve sufficient failures for reliable estimates, guided by power calculations based on expected failure distributions. Common sources of failure data include warranty claims, maintenance logs, and established reliability databases. Warranty claims from customer returns offer aggregated field failure records, often including timestamps and usage details for post-sale analysis.[49] Maintenance logs from operational systems track repair events and downtime, providing chronological failure histories in industrial or military contexts.[50] Reliability databases like MIL-HDBK-217 compile historical empirical data from military and commercial sources to predict component failure rates, serving as a benchmark for initial assessments.[51] Challenges in empirical data collection often stem from incomplete records and varying operating conditions. Incomplete data, such as unreported failures or missing timestamps, can introduce bias and reduce dataset utility, necessitating imputation or exclusion strategies.[52] Varying conditions, like fluctuating temperatures or loads in field settings, complicate direct comparability with lab data and require normalization to isolate failure drivers.[53] These issues underscore the need for robust protocols to enhance data quality for subsequent estimation.Statistical Estimation Methods
Statistical estimation methods for failure rates involve applying probabilistic techniques to observed failure data, often incorporating censoring due to incomplete observations in reliability testing. These methods enable the computation of point estimates, uncertainty measures, and model validations from empirical datasets, assuming underlying distributions such as the exponential for constant failure rates. Parametric approaches, like maximum likelihood estimation, assume a specific form for the failure rate function, while non-parametric methods provide distribution-free estimates suitable for exploratory analysis or when model assumptions are uncertain. For the exponential distribution, which models constant failure rates, the maximum likelihood estimator (MLE) of the failure rate is derived from the likelihood function of observed failure times. Given independent observations of failure times , the MLE is , where the denominator represents the total exposure time.[54] This estimator is unbiased and achieves the Cramér-Rao lower bound for variance in large samples, making it efficient for reliability assessments under the constant hazard assumption.[55] Non-parametric methods avoid distributional assumptions and are particularly useful for estimating survival functions and cumulative hazards from censored data. The Kaplan-Meier estimator computes the survival function , from which the failure rate can be inferred as the negative derivative or through related hazard estimates; it is given by the product-limit formula , where is the number of failures at time and is the number at risk.[56] This method, introduced in 1958, handles right-censoring effectively and provides a step-function estimate of reliability.[56] Complementarily, the Nelson-Aalen estimator approximates the cumulative hazard function as , offering a direct non-parametric measure of accumulated risk over time.[57] Developed in the early 1970s, it converges uniformly to the true cumulative hazard under mild conditions and is foundational for comparing failure processes across groups.[57][58] Confidence intervals quantify the uncertainty in these estimates, essential for decision-making in engineering reliability. For the exponential MLE , two-sided intervals are constructed using the chi-squared distribution: , where is the number of failures and is the total exposure time.[59] This approach leverages the fact that follows a chi-squared distribution with degrees of freedom for complete data.[59] For more complex models or non-parametric estimators like Kaplan-Meier or Nelson-Aalen, bootstrap methods generate empirical distributions by resampling the data with replacement; percentile intervals are then the 2.5th and 97.5th quantiles of the bootstrapped statistics, providing robust coverage even with small samples or irregular distributions.[60] Introduced in 1979, bootstrapping approximates the sampling distribution without parametric assumptions, widely applied in reliability for variance estimation of hazard functions.[60] Model validation ensures the assumed distribution fits the data adequately, preventing erroneous failure rate predictions. The Anderson-Darling test assesses goodness-of-fit by measuring deviations between empirical and hypothesized cumulative distribution functions, with the test statistic , where is the fitted distribution; higher values indicate poor fit, compared against critical values from asymptotic theory.[61] Originating in 1952, this test weights tail discrepancies more heavily than alternatives like Kolmogorov-Smirnov, enhancing sensitivity for reliability models such as Weibull or exponential.[61][62] It is particularly effective for validating failure rate assumptions in life-testing data, where deviations in extreme failure times critically impact predictions.[62] Censored data, where failure times are only partially observed (e.g., right-censoring when testing ends before failure), is common in reliability studies and must be incorporated to avoid bias. In maximum likelihood estimation, the likelihood function is modified to include contributions from both failed and censored units: for exponential models, it becomes , where denotes failed observations with times and censored with times .[63] The resulting MLE adjusts the total exposure time to include censored contributions, yielding .[63] This partial likelihood approach, standard in survival analysis, ensures consistent estimation even with high censoring rates, as long as censoring is independent of failure risk.[64]Related Reliability Metrics
Mean Time Between Failures (MTBF)
Mean Time Between Failures (MTBF) is a key reliability metric used specifically for repairable systems, representing the average time elapsed between consecutive failures during normal operation. It quantifies the expected operational lifespan between repairs, providing a measure of system dependability in scenarios where components can be restored to service after failure. This metric is particularly relevant for systems like machinery, electronics, or infrastructure that undergo periodic maintenance to extend their useful life.[65][66] The relationship between MTBF and failure rate is foundational in reliability analysis. For systems with a constant failure rate , MTBF is simply the reciprocal, expressed as where denotes failures per unit time. In the general case for repairable systems modeled as renewal processes, MTBF corresponds to the expected value of the inter-failure (renewal) interval in steady-state operation, allowing for time-varying failure rates beyond the constant assumption.[67][68][69] To calculate MTBF from field data, divide the total operating (uptime) hours across a population of units by the total number of failures observed, excluding downtime associated with repairs or maintenance: For instance, if a fleet of 10 identical devices accumulates 5,000 operating hours with 2 failures, the MTBF is 2,500 hours. This empirical approach relies on real-world usage data to validate predictions and refine maintenance strategies.[65][67] MTBF plays a critical role in maintainability predictions and system design. It informs spares provisioning, life-cycle cost estimates, and overall system performance forecasting for repairable assets. A primary application is in availability modeling, where inherent availability is computed as with MTTR being the mean time to repair; this ratio highlights the proportion of time the system is operational, guiding decisions in high-stakes environments like aerospace or defense.[65][70] Despite its utility, MTBF has notable limitations rooted in its assumptions. It presumes steady-state conditions after initial deployment, where failure and repair rates stabilize, and does not account for early-life infant mortality or end-of-life wearout phases. Additionally, MTBF is inappropriate for non-repairable items, where Mean Time to Failure (MTTF) should be used instead to capture one-way failure progression. These constraints underscore the need for contextual application in reliability assessments.[71][66]Mean Time to Failure (MTTF)
Mean Time to Failure (MTTF) serves as a fundamental reliability metric for non-repairable systems, representing the expected operational lifetime before failure occurs. It is defined mathematically as the integral of the reliability function over all time: where is the probability that the system survives beyond time .[72] This formulation arises from the expected value of the failure time distribution in survival analysis. Equivalently, since with denoting the cumulative hazard function, MTTF can be expressed as: [26] For systems exhibiting a constant failure rate , the lifetime follows an exponential distribution, yielding .[26] Under this assumption, the MTTF value coincides with the Mean Time Between Failures (MTBF) for repairable systems analyzed similarly. MTTF finds primary application in non-repairable contexts, such as consumer products like light bulbs and fuses, or mission-critical items like missiles, where failure necessitates full replacement rather than repair.[73] In these scenarios, it quantifies the average lifespan to inform design, procurement, and warranty decisions. Lifetime distributions in reliability engineering are often right-skewed, as seen in the Weibull model, where the MTTF (mean) exceeds the median life—the time at which 50% of units fail—and both surpass the mode, the most frequent failure time.[74] This ordering underscores how extended survival times inflate the mean, potentially overestimating typical performance. To support risk assessment, higher moments of the lifetime distribution offer deeper insights: the variance measures lifetime dispersion (e.g., for the exponential case), while skewness and kurtosis reveal asymmetry and tail heaviness, aiding in probabilistic safety evaluations.[26]Applications and Examples
Bathtub Curve Analysis
The bathtub curve serves as a graphical representation of the failure rate, denoted as λ(t), over the lifecycle of a product or system, typically exhibiting three distinct phases that reflect evolving reliability characteristics. This model, widely adopted in reliability engineering, illustrates how failure rates decrease initially, remain constant during mid-life, and then increase toward the end, resembling the shape of a bathtub.[75] The first phase, known as infant mortality or early failure, features a decreasing failure rate due to the elimination of inherent defects as weaker components fail early. This period is characterized by high initial λ(t) that rapidly declines as manufacturing and assembly flaws are exposed and removed from the population.[76][75] Following this, the useful life phase displays a relatively constant failure rate, where random failures dominate without significant aging effects. These failures arise from external stresses or unforeseen events during normal operation, maintaining a stable λ(t) over an extended period.[76][75] The final wear-out phase shows an increasing failure rate as components degrade due to material fatigue, corrosion, or thermal/mechanical stresses accumulated over time. This upward trend in λ(t) signals the onset of end-of-life failures, necessitating intervention to extend system usability.[76][75] Causes of failures align with these phases: manufacturing defects and poor assembly drive infant mortality, while random environmental or operational stresses cause useful life incidents, and progressive material degradation leads to wear-out.[75][76] Modeling the bathtub curve often involves piecewise functions that combine different distributions for each phase or a single flexible distribution like the Weibull, where shifts in the shape parameter β capture the transition from decreasing (β < 1) to constant (β = 1) and increasing (β > 1) rates.[76][75] Design implications include implementing burn-in testing during manufacturing to screen out infant mortality failures and scheduling preventive maintenance to address wear-out before critical degradation occurs.[76][75] In real-world applications, the bathtub curve is observed in electronics, where early assembly errors in components like memory units contribute to infant mortality, and in automotive systems, such as engines and pumps, where wear-out from fatigue affects longevity.[75]Renewal Processes in Repairable Systems
In repairable systems, where components or units are restored to operational status after failure rather than discarded, the sequence of failures and subsequent repairs can be modeled using renewal theory. A renewal process describes this as a series of independent and identically distributed inter-renewal times, each representing the duration from the completion of one repair to the next failure, assuming perfect repair that returns the system to its initial "good-as-new" state. The inter-arrival times between failures follow the distribution of the system's time-to-failure, enabling the modeling of recurrent events in systems like machinery or electronics that undergo multiple repair cycles.[77] The renewal function, denoted , quantifies the expected number of renewals (failures) occurring in the interval , serving as a key measure of the system's failure intensity over time. For large , the average renewal rate approaches , where is the random variable for the inter-renewal time, providing the asymptotic failure rate as the long-run average frequency of failures per unit time. This limiting value equals the reciprocal of the mean time between failures (MTBF), which represents the steady-state operational reliability under repeated repair cycles. In mathematical terms, by the elementary renewal theorem, this convergence highlights how the system's failure behavior stabilizes after many cycles, independent of initial conditions.[77][78] Such models find practical application in scenarios where repairs effectively reset the system's failure clock, such as fleet maintenance for vehicles or aircraft, where each overhaul renews the operational timeline and allows prediction of downtime accumulation across multiple units. Similarly, in software systems, patching processes act as renewals by addressing vulnerabilities and restoring baseline reliability, enabling estimation of update frequencies to minimize service interruptions. These applications leverage the renewal framework to optimize maintenance schedules and resource allocation, balancing repair costs against failure risks.[79][80] For cases where the failure rate varies over time due to aging or external factors, even after repairs, a non-homogeneous Poisson process (NHPP) extends the renewal model by incorporating a time-dependent intensity function, capturing non-stationary behavior in repairable systems without assuming identical inter-renewal distributions. This approach is particularly useful when repairs do not fully restore the original condition, leading to trending failure patterns that deviate from the constant asymptotic rate of ordinary renewals.[81]Practical Numerical Examples
Consider a simple case of a device with a constant failure rate λ = 0.001 failures per hour, typical in reliability engineering for components exhibiting random failures.[26] The reliability function for such a system follows the exponential distribution, where the probability of survival up to time t is given by For a mission duration of 1000 hours, this yields R(1000) = e^{-0.001 \times 1000} = e^{-1} \approx 0.368, meaning approximately 36.8% of devices are expected to survive without failure.[26] In scenarios with a decreasing failure rate, such as early-life infant mortality in electronic components, the Weibull distribution provides a suitable model with shape parameter β < 1. For β = 0.5 and scale parameter η = 1000 hours, the failure rate function is This results in a hazard that drops over time; for instance, λ(100) ≈ 0.0016 failures per hour, decreasing to λ(1000) ≈ 0.0005 failures per hour, illustrating the rapid decline in failure probability as the component matures.[82] For a series system of independent components, the total failure rate is the sum of the individual failure rates, assuming constant rates for each. If three components have λ_1 = 0.0002, λ_2 = 0.0003, and λ_3 = 0.0005 failures per hour, the system failure rate is λ_total = 0.001 failures per hour, making the overall reliability R(t) = e^{-0.001 t}.[83] This additive property highlights the vulnerability of series configurations to even low-rate components. The coefficient of variation (CV) for inter-failure times, defined as CV = σ / μ where σ is the standard deviation and μ is the mean inter-failure time, serves as an indicator of variability in failure processes. In constant failure rate models like the exponential distribution, CV = 1, reflecting high variability; values greater than 1 suggest decreasing failure rates with more clustered early failures, while CV < 1 indicates increasing rates and more predictable later failures.[84] A real-world estimation example arises in aircraft engine reliability, where failure data from operational hours informs maintenance planning. Suppose 10 engine failures are observed across a fleet totaling 5000 flight hours; under a constant failure rate assumption and Poisson process, the estimated λ = 10 / 5000 = 0.002 failures per hour, or 2 failures per 1000 hours, which can guide predictive scheduling.[6]References
- https://s3vi.ndc.[nasa](/page/NASA).gov/ssri-kb/static/resources/203618.pdf
