Recent from talks
Nothing was collected or created yet.
Local outlier factor
View on Wikipedia| Part of a series on |
| Machine learning and data mining |
|---|
In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.[1]
LOF shares some concepts with DBSCAN and OPTICS such as the concepts of "core distance" and "reachability distance", which are used for local density estimation.[2]
Basic idea
[edit]
The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.
The local density is estimated by the typical distance at which a point can be "reached" from its neighbors. The definition of "reachability distance" used in LOF is an additional measure to produce more stable results within clusters. The "reachability distance" used by LOF has some subtle details that are often found incorrect in secondary sources, e.g., in the textbook of Ethem Alpaydin.[3]
Formal definition
[edit]Let be the distance of the object A to the k-th nearest neighbor. Note that the set of the k nearest neighbors includes all objects at this distance, which can in the case of a "tie" be more than k objects. We denote the set of k nearest neighbors as Nk(A).

This distance is used to define what is called reachability distance:
In words, the reachability distance of an object A from B is the true distance between the two objects, but at least the of B. Objects that belong to the k nearest neighbors of B (the "core" of B, see DBSCAN cluster analysis) are considered to be equally distant. The reason for this is to reduce the statistical fluctuations between all points A close to B, where increasing the value for k increases the smoothing effect.[1] Note that this is not a distance in the mathematical definition, since it is not symmetric. (While it is a common mistake[4] to always use the , this yields a slightly different method, referred to as Simplified-LOF[4])
The local reachability density of an object A is defined by
which is the inverse of the average reachability distance of the object A from its neighbors. Note that it is not the average reachability of the neighbors from A (which by definition would be the ), but the distance at which A can be "reached" from its neighbors. With duplicate points, this value can become infinite.
The local reachability densities are then compared with those of the neighbors using
which is the average local reachability density of the neighbors divided by the object's own local reachability density. A value of approximately 1 indicates that the object is comparable to its neighbors (and thus not an outlier). A value below 1 indicates a denser region (which would be an inlier), while values significantly larger than 1 indicate outliers.
LOF(k) ~ 1 means Similar density as neighbors,
LOF(k) < 1 means Higher density than neighbors (Inlier),
LOF(k) > 1 means Lower density than neighbors (Outlier)
Advantages
[edit]
Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors.
While the geometric intuition of LOF is only applicable to low-dimensional vector spaces, the algorithm can be applied in any context a dissimilarity function can be defined. It has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection[5] and on processed classification benchmark data.[6]
The LOF family of methods can be easily generalized and then applied to various other problems, such as detecting outliers in geographic data, video streams or authorship networks.[4]
Disadvantages and Extensions
[edit]The resulting values are quotient-values and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. These differences can also occur within a dataset due to the locality of the method. There exist extensions of LOF that try to improve over LOF in these aspects:
- Feature Bagging for Outlier Detection[7] runs LOF on multiple projections and combines the results for improved detection qualities in high dimensions. This is the first ensemble learning approach to outlier detection, for other variants see ref.[8]
- Local Outlier Probability (LoOP)[9] is a method derived from LOF but using inexpensive local statistics to become less sensitive to the choice of the parameter k. In addition, the resulting values are scaled to a value range of [0:1].
- Interpreting and Unifying Outlier Scores[10] proposes a normalization of the LOF outlier scores to the interval [0:1] using statistical scaling to increase usability and can be seen an improved version of the LoOP ideas.
- On Evaluation of Outlier Rankings and Outlier Scores[11] proposes methods for measuring similarity and diversity of methods for building advanced outlier detection ensembles using LOF variants and other algorithms and improving on the Feature Bagging approach discussed above.
- Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection[4] discusses the general pattern in various local outlier detection methods (including, e.g., LOF, a simplified version of LOF and LoOP) and abstracts from this into a general framework. This framework is then applied, e.g., to detecting outliers in geographic data, video streams and authorship networks.
References
[edit]- ^ a b Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; Sander, J. (2000). LOF: Identifying Density-based Local Outliers (PDF). Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD. pp. 93–104. doi:10.1145/335191.335388. ISBN 1-58113-217-4.
- ^ Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; Sander, J. R. (1999). "OPTICS-OF: Identifying Local Outliers" (PDF). Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. Vol. 1704. pp. 262–270. doi:10.1007/978-3-540-48247-5_28. ISBN 978-3-540-66490-1.
- ^ Alpaydin, Ethem (2020). Introduction to machine learning (Fourth ed.). Cambridge, Massachusetts. ISBN 978-0-262-04379-3. OCLC 1108782604.
{{cite book}}: CS1 maint: location missing publisher (link) - ^ a b c d Schubert, E.; Zimek, A.; Kriegel, H. -P. (2012). "Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection". Data Mining and Knowledge Discovery. 28: 190–237. doi:10.1007/s10618-012-0300-z. S2CID 19036098.
- ^ Lazarevic, A.; Ozgur, A.; Ertoz, L.; Srivastava, J.; Kumar, V. (2003). "A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection" (PDF). Proceedings of the 2003 SIAM International Conference on Data Mining. pp. 25–36. doi:10.1137/1.9781611972733.3. ISBN 978-0-89871-545-3. Archived from the original (PDF) on 2013-07-17. Retrieved 2010-05-14.
- ^ Campos, Guilherme O.; Zimek, Arthur; Sander, Jörg; Campello, Ricardo J. G. B.; Micenková, Barbora; Schubert, Erich; Assent, Ira; Houle, Michael E. (2016). "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study". Data Mining and Knowledge Discovery. 30 (4): 891–927. doi:10.1007/s10618-015-0444-8. ISSN 1384-5810. S2CID 1952214.
- ^ Lazarevic, A.; Kumar, V. (2005). "Feature bagging for outlier detection". Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. pp. 157–166. doi:10.1145/1081870.1081891. ISBN 159593135X. S2CID 2054204.
- ^ Zimek, A.; Campello, R. J. G. B.; Sander, J. R. (2014). "Ensembles for unsupervised outlier detection". ACM SIGKDD Explorations Newsletter. 15: 11–22. doi:10.1145/2594473.2594476. S2CID 8065347.
- ^ Kriegel, H.-P.; Kröger, P.; Schubert, E.; Zimek, A. (2009). "LoOP: Local outlier probabilities". Proceedings of the 18th ACM conference on Information and knowledge management (PDF). CIKM '09. pp. 1649–1652. doi:10.1145/1645953.1646195. ISBN 978-1-60558-512-3.
- ^ Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2011). Interpreting and Unifying Outlier Scores. Proceedings of the 2011 SIAM International Conference on Data Mining. pp. 13–24. CiteSeerX 10.1.1.232.2719. doi:10.1137/1.9781611972818.2. ISBN 978-0-89871-992-5.
- ^ Schubert, E.; Wojdanowski, R.; Zimek, A.; Kriegel, H. P. (2012). On Evaluation of Outlier Rankings and Outlier Scores. Proceedings of the 2012 SIAM International Conference on Data Mining. pp. 1047–1058. CiteSeerX 10.1.1.300.7205. doi:10.1137/1.9781611972825.90. ISBN 978-1-61197-232-0.
Local outlier factor
View on GrokipediaIntroduction
Definition and Overview
The Local Outlier Factor (LOF) is a density-based algorithm for unsupervised outlier detection that quantifies the degree to which a data point deviates from the local density of its neighborhood, enabling the identification of anomalies that may not be apparent under global density assumptions.[4] Introduced by Breunig et al. in 2000, LOF measures outlierness as a continuous score rather than a binary label, allowing for nuanced ranking of potential anomalies based on their relative isolation within varying local contexts.[4] The primary purpose of LOF is to detect outliers in high-dimensional datasets where data distributions are uneven, such as in fraud detection within electronic commerce transactions, by focusing on local deviations rather than assuming uniform density across the entire dataset.[4] This approach is particularly valuable in unsupervised settings, where no labeled examples of outliers are available, and it addresses limitations of global methods that might overlook anomalies in dense clusters or misidentify points in sparse but consistent regions.[4] At a high level, LOF operates by estimating the local density of each data point using its k-nearest neighbors and then comparing that density to the densities of its neighboring points to compute an outlier score.[4] For instance, in a dataset containing clusters of varying densities, LOF would assign higher outlier scores to points located in sparser regions relative to their immediate surroundings, effectively highlighting local anomalies without requiring prior knowledge of the data's overall structure.[4]Historical Development
The Local Outlier Factor (LOF) algorithm was introduced in 2000 by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander in their seminal paper presented at the ACM SIGMOD International Conference on Management of Data.[4] This work addressed key limitations in prior outlier detection techniques, which predominantly relied on global distance-based methods—such as DB(pct, dmin)-outliers—that struggled to identify anomalies in datasets exhibiting varying local densities, often resulting in binary classifications that overlooked nuanced degrees of outlierness.[1] By shifting focus to local density comparisons within neighborhoods, LOF provided a more flexible, density-based framework suitable for knowledge discovery in databases (KDD) applications, where rare events like fraud or errors held greater analytical value than common patterns.[1] Following its debut, LOF rapidly gained traction in the data mining and database communities, with the original paper "LOF: Identifying Density-Based Local Outliers" becoming a cornerstone reference cited over 10,000 times by 2025.[4] Early adoption highlighted its utility in handling real-world datasets with non-uniform cluster densities, outperforming global methods in benchmarks involving synthetic datasets and real-world data such as the NHL hockey dataset and German soccer player statistics. Its influence extended to subsequent density-based techniques. A notable milestone in LOF's evolution occurred in 2017 with its incorporation into the scikit-learn machine learning library as the LocalOutlierFactor class in version 0.19, enabling widespread accessibility for practitioners and researchers in Python-based workflows.[3]Core Concepts
Basic Idea
The local outlier factor (LOF) addresses the limitations of traditional outlier detection methods by identifying anomalies based on deviations in local density rather than assuming a uniform density across the entire dataset. In this approach, an outlier is a data point whose density is significantly lower than the densities of its surrounding points, capturing the intuitive notion that points in sparse regions relative to their neighbors are anomalous. This local perspective allows LOF to detect outliers that may be inconspicuous in a global context but stand out within their immediate vicinity, providing a continuous score of outlier-ness rather than a binary classification.[1] To illustrate, consider a two-dimensional dataset featuring a dense cluster C1 of 400 closely packed points and a sparser cluster C2 of 100 points nearby. A point o1 located far from C1 would have low local density compared to the high-density neighbors in C1, marking it as an outlier. Similarly, a point o2 positioned near C2 but in an even sparser area would exhibit lower density than the points in C2, also identifying it as anomalous. In contrast, a point on the edge of C1 maintains high local density relative to its neighbors within the cluster, classifying it as an inlier despite its peripheral position. This example highlights how LOF leverages the k-nearest neighbors to define the local context, enabling the detection of context-specific outliers that vary by region.[1] Global outlier detection methods, such as distance-based techniques like z-score or those using a fixed minimum distance threshold, often fail in datasets with multiple density modes or clusters of varying tightness, as they apply a uniform criterion across all points. For instance, in the aforementioned example, a global method might overlook o2 because its distance to C2 appears normal relative to the dataset's overall spread, but LOF succeeds by adapting to the local variations in density around each point. This adaptability makes LOF particularly effective for real-world applications like fraud detection, where anomalies are defined by local patterns rather than global norms. Reachability distance further refines neighbor influence by accounting for the k-distance (distance to the k-th nearest neighbor) of the neighboring points, helping to robustly measure these local densities without direct outlier interference.[1]Key Components
The Local Outlier Factor (LOF) algorithm relies on several foundational concepts to assess local densities in a dataset, beginning with the identification of nearest neighbors for each data point. The k-nearest neighbors of a point , denoted as , are defined as the set of closest points to in the dataset , excluding itself, based on a chosen distance metric . Formally, , where the size of this set is at least k (greater than k if there are ties at the k-distance).[4] Central to this neighborhood definition is the k-distance of , which measures the distance to the -th nearest neighbor of . It is the smallest distance such that at least points satisfy , and at most points satisfy . This k-distance effectively sets the radius of the local neighborhood around , allowing for a consistent scale in density comparisons across varying data regions.[4] These components underpin local density estimation in LOF by providing a basis for quantifying the density around each point relative to its surroundings; specifically, the local density of p is inversely proportional to the average reachability distances from p to the points in , which adjust the distances from p by the local scales (k-distances) of its neighbors. This approach enables the detection of outliers through deviations in local densities without requiring a global density model.[4] As prerequisites, k-nearest neighbors and k-distance facilitate scalable, localized computations by focusing on small subsets of the data, and they operate with any distance metric, not limited to Euclidean space, making LOF applicable to diverse data types such as graphs or strings.[4]Formal Definition
Reachability Distance
The reachability distance is a key concept in the local outlier factor (LOF) algorithm, defined for two objects and in a dataset with respect to a parameter (a positive integer representing the neighborhood size). Formally, the reachability distance of with respect to is given by where denotes the pairwise distance between and (using any suitable distance metric, such as Euclidean distance), and -distance() is the distance from to its -th nearest neighbor.[1] This definition incorporates the -distance of the reference object as a threshold, ensuring that the reachability distance is at least as large as the extent of 's local neighborhood. The primary purpose of the reachability distance is to smooth out statistical fluctuations in pairwise distances when estimating local densities, particularly for objects that are close to the reference object . By replacing the actual distance with -distance() when lies within 's -neighborhood (i.e., when -distance()), it reduces variability in distance measures among nearby points, making density comparisons more stable across regions of varying densities. This approach mitigates the undue influence of minor distance variations or isolated points (potential outliers) in density estimation, effectively "reaching" neighbors by considering the sparser of the two local neighborhoods—either 's or the direct path to . The smoothing effect is tunable via : larger values of lead to more uniform reachability distances within a neighborhood, enhancing robustness to local density differences.[1] For illustration, consider a reference object with , where its -distance is 2 units. If a nearby object is 1 unit away from , then , using the neighborhood size instead of the smaller actual distance. In contrast, for a distant object that is 5 units away, , retaining the actual distance. This mechanism ensures that close neighbors contribute consistently to density calculations without underestimating influences from slightly varying local structures.[1] The reachability distance exhibits symmetry in its practical effect for density-based comparisons within similar neighborhoods, though the measure itself is not strictly symmetric (as would use -distance() instead). It is robust to local density variations by design, preventing overestimation of reachability in dense areas or underestimation in sparse ones. Additionally, it is computable using any dissimilarity measure that supports -nearest neighbor computations, making it versatile for non-Euclidean spaces.[1]Local Reachability Density
The local reachability density (LRD) of a point in a dataset quantifies the density of its local neighborhood by measuring how reachable is from its -nearest neighbors, providing a robust estimate that accounts for varying local densities. It is defined as the reciprocal of the average reachability distance from to the objects in its -nearest neighborhood , where the reachability distance incorporates both the distance to neighbors and their own neighborhood sparsity to avoid overestimating density in sparse regions.[4] The formal formula for LRD is given by: where typically equals , and is the reachability distance between and each neighbor . This formulation ensures that LRD is inversely proportional to the typical reachability distance in the neighborhood, yielding a high value for points in dense clusters and a low value for those in sparse areas.[4] In computation, LRD is derived by first identifying the -nearest neighbors of and then aggregating the reachability distances to those neighbors, which serves as a normalized inverse average to estimate local density without being overly sensitive to isolated distant points. This approach provides a more stable density measure compared to simple inverse distance averages, as the reachability distance caps the influence of outliers in the neighborhood.[4] For edge cases, such as when a point has fewer than neighbors (e.g., in small datasets or boundary regions), the LRD uses the actual size of the available neighborhood in both the numerator and the summation denominator, adjusting the density estimate accordingly to reflect the limited local structure. Additionally, LRD can become infinite if all reachability distances are zero, which occurs when there are at least duplicate points sharing the same coordinates as , though the method assumes no such duplicates for practical datasets.[4]Local Outlier Factor
The local outlier factor (LOF) quantifies the degree to which a data point deviates from the density of its local neighborhood, serving as a measure of its outlier-ness. It is computed as the average of the ratios of the local reachability densities of the point's k-nearest neighbors to the local reachability density of the point itself. This approach captures relative isolation by comparing local densities rather than relying on global statistics.[4] Formally, for a point and parameter , the LOF is defined as: where denotes the k-nearest neighbors of , and is the local reachability density, which serves as the basis for these density ratios.[4] The interpretation of the LOF score is straightforward: values approximately equal to 1 indicate that the point has a similar local density to its neighbors, marking it as an inlier. In contrast, LOF values greater than 1 signal lower density relative to neighbors, with values much larger than 1 (often >>1) denoting strong outliers due to significant isolation. For scoring and outlier identification in unsupervised settings, thresholds are typically set empirically; for instance, points with LOF > 1.5 are considered outliers in practical applications, as higher scores reflect progressively stronger deviations.[4] Key properties of the LOF include its locality, which confines the analysis to the immediate neighborhood of each point, and its relativity, which allows for varying degrees of outlier-ness across different density regions without assuming uniform data distribution. These characteristics enable the detection of outliers in datasets with clusters of differing densities.[4]Algorithm and Implementation
Computational Procedure
The computation of Local Outlier Factor (LOF) scores involves a multi-step process that relies on density-based neighborhood analysis for a dataset of n points in a metric space. The procedure assumes a user-specified parameter MinPts (often denoted as k), which defines the neighborhood size, and uses a distance metric such as Euclidean distance. Preprocessing typically includes building spatial indexing structures to enable efficient nearest-neighbor queries, avoiding the naive exhaustive pairwise distance computation.[1] The algorithm proceeds in the following steps:- Compute k-distance for all points: For each point p, determine the k-distance(p), defined as the distance to its k-th nearest neighbor, where k = MinPts. This step identifies the minimum distance threshold enclosing exactly k neighbors for each point. Sorting the distances to all other points achieves this naively, but indexing structures accelerate it.[1]
- Find k-nearest neighbors (k-NN) for each point: Construct the k-NN neighborhood N_k(p) for every point p, consisting of all points q such that the distance d(p, q) ≤ k-distance(p). If multiple points lie at exactly the k-distance, the neighborhood may include more than k points to ensure inclusivity. Efficient range queries via indexing retrieve these neighborhoods.[1]
- Calculate reachability distances for pairs in neighborhoods: For each point p and every neighbor o in N_k(p), compute the reachability distance reach-dist_k(p, o) as the maximum of k-distance(o) and d(p, o). This measure accounts for the sparser of the two neighborhoods when assessing mutual reachability. These values are calculated only within the identified neighborhoods to limit computations.[1]
- Compute local reachability density (LRD) for each point: For each p, calculate its LRD as the inverse of the average reachability distance from p to all points in N_k(p). This yields a density estimate local to p's neighborhood, with higher values indicating denser regions. The average is taken over all |N_k(p)| neighbors.[1]
- Compute LOF for each point: Finally, derive the LOF score for p as the average ratio of the LRD of each neighbor o in N_k(p) to the LRD of p itself. This aggregates the relative density deviation, with values greater than 1 indicating outlierness.[1]
# Assume [dataset](/page/Data_set) D with n points, [parameter](/page/Parameter) MinPts = k
# Preprocessing: Build KD-tree or similar index on D
for each point p in D:
kdist_p = k_distance(p, D, k) # Distance to k-th NN using index
N_k_p = k_NN(p, D, kdist_p) # Retrieve neighbors within kdist_p
for each point p in D:
sum_reach = 0
for each o in N_k_p:
reach = max(k_distance(o), distance(p, o))
sum_reach += reach
lrd_p = |N_k_p| / sum_reach if sum_reach > 0 else 0
for each point p in D:
sum_ratio = 0
for each o in N_k_p:
sum_ratio += lrd_o / lrd_p if lrd_p > 0 else 0
lof_p = sum_ratio / |N_k_p| if |N_k_p| > 0 else 1
# Assume [dataset](/page/Data_set) D with n points, [parameter](/page/Parameter) MinPts = k
# Preprocessing: Build KD-tree or similar index on D
for each point p in D:
kdist_p = k_distance(p, D, k) # Distance to k-th NN using index
N_k_p = k_NN(p, D, kdist_p) # Retrieve neighbors within kdist_p
for each point p in D:
sum_reach = 0
for each o in N_k_p:
reach = max(k_distance(o), distance(p, o))
sum_reach += reach
lrd_p = |N_k_p| / sum_reach if sum_reach > 0 else 0
for each point p in D:
sum_ratio = 0
for each o in N_k_p:
sum_ratio += lrd_o / lrd_p if lrd_p > 0 else 0
lof_p = sum_ratio / |N_k_p| if |N_k_p| > 0 else 1
