Confusion matrix
View on WikipediaThis article contains one or more duplicated citations. (October 2025) |
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as error matrix,[1] is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one; in unsupervised learning it is usually called a matching matrix.
Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa – both variants are found in the literature.[2] The diagonal of the matrix therefore represents all instances that are correctly predicted.[3] The name stems from the fact that it makes it easy to see whether the system is confusing two classes (i.e. commonly mislabeling one as another).
It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each combination of dimension and class is a variable in the contingency table).
Example
[edit]Given a sample of 12 individuals, 8 that have been diagnosed with cancer and 4 that are cancer-free, where individuals with cancer belong to class 1 (positive) and non-cancer individuals belong to class 0 (negative), we can display that data as follows:
| Individual number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Actual classification | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
Assume that we have a classifier that distinguishes between individuals with and without cancer in some way, we can take the 12 individuals and run them through the classifier. The classifier then makes 9 accurate predictions and misses 3: 2 individuals with cancer wrongly predicted as being cancer-free (sample 1 and 2), and 1 person without cancer that is wrongly predicted to have cancer (sample 9).
| Individual number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Actual classification | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| Predicted classification | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
Notice, that if we compare the actual classification set to the predicted classification set, there are 4 different outcomes that could result in any particular column. One, if the actual classification is positive and the predicted classification is positive (1,1), this is called a true positive result because the positive sample was correctly identified by the classifier. Two, if the actual classification is positive and the predicted classification is negative (1,0), this is called a false negative result because the positive sample is incorrectly identified by the classifier as being negative. Third, if the actual classification is negative and the predicted classification is positive (0,1), this is called a false positive result because the negative sample is incorrectly identified by the classifier as being positive. Fourth, if the actual classification is negative and the predicted classification is negative (0,0), this is called a true negative result because the negative sample gets correctly identified by the classifier.
We can then perform the comparison between actual and predicted classifications and add this information to the table, making correct results appear in green so they are more easily identifiable.
| Individual number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Actual classification | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| Predicted classification | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| Result | FN | FN | TP | TP | TP | TP | TP | TP | FP | TN | TN | TN |
The template for any binary confusion matrix uses the four kinds of results discussed above (true positives, false negatives, false positives, and true negatives) along with the positive and negative classifications. The four outcomes can be formulated in a 2×2 confusion matrix, as follows:
| Predicted condition | |||
| Total population = P + N |
Positive (PP) | Negative (PN) | |
Actual condition
|
Positive (P) | True positive (TP) |
False negative (FN) |
| Negative (N) | False positive (FP) |
True negative (TN) | |
| Sources: [4][5][6][7][8][9][10] | |||
The color convention of the three data tables above were picked to match this confusion matrix, in order to easily differentiate the data.
Now, we can simply total up each type of result, substitute into the template, and create a confusion matrix that will concisely summarize the results of testing the classifier:
| Predicted condition | |||
| Total
8 + 4 = 12 |
Cancer 7 |
Non-cancer 5 | |
Actual condition
|
Cancer 8 |
6 | 2 |
| Non-cancer 4 |
1 | 3 | |
In this confusion matrix, of the 8 samples with cancer, the system judged that 2 were cancer-free, and of the 4 samples without cancer, it predicted that 1 did have cancer. All correct predictions are located in the diagonal of the table (highlighted in green), so it is easy to visually inspect the table for prediction errors, as values outside the diagonal will represent them. By summing up the 2 rows of the confusion matrix, one can also deduce the total number of positive (P) and negative (N) samples in the original dataset, i.e. and .
Table of confusion
[edit]In predictive analytics, a table of confusion (sometimes also called a confusion matrix) is a table with two rows and two columns that reports the number of true positives, false negatives, false positives, and true negatives. This allows more detailed analysis than simply observing the proportion of correct classifications (accuracy). Accuracy will yield misleading results if the data set is unbalanced; that is, when the numbers of observations in different classes vary greatly.
For example, if there were 95 cancer samples and only 5 non-cancer samples in the data, a particular classifier might classify all the observations as having cancer. The overall accuracy would be 95%, but in more detail the classifier would have a 100% recognition rate (sensitivity) for the cancer class but a 0% recognition rate for the non-cancer class. F1 score is even more unreliable in such cases, and here would yield over 97.4%, whereas informedness removes such bias and yields 0 as the probability of an informed decision for any form of guessing (here always guessing cancer).
According to Davide Chicco and Giuseppe Jurman, the most informative metric to evaluate a confusion matrix is the Matthews correlation coefficient (MCC).[11]
Other metrics can be included in a confusion matrix, each of them having their significance and use.
| Predicted condition | Sources: [12][13][14][15][16][17][18][19] | ||||
| Total population = P + N |
Predicted positive | Predicted negative | Informedness, bookmaker informedness (BM) = TPR + TNR − 1 |
Prevalence threshold (PT) = √TPR × FPR − FPR/TPR − FPR | |
Actual condition
|
Real Positive (P) [a] | True positive (TP), hit[b] |
False negative (FN), miss, underestimation |
True positive rate (TPR), recall, sensitivity (SEN), probability of detection, hit rate, power = TP/P = 1 − FNR |
False negative rate (FNR), miss rate type II error [c] = FN/P = 1 − TPR |
| Real Negative (N)[d] | False positive (FP), false alarm, overestimation |
True negative (TN), correct rejection[e] |
False positive rate (FPR), probability of false alarm, fall-out type I error [f] = FP/N = 1 − TNR |
True negative rate (TNR), specificity (SPC), selectivity = TN/N = 1 − FPR | |
| Prevalence = P/P + N |
Positive predictive value (PPV), precision = TP/TP + FP = 1 − FDR |
False omission rate (FOR) = FN/TN + FN = 1 − NPV |
Positive likelihood ratio (LR+) = TPR/FPR |
Negative likelihood ratio (LR−) = FNR/TNR | |
| Accuracy (ACC) = TP + TN/P + N |
False discovery rate (FDR) = FP/TP + FP = 1 − PPV |
Negative predictive value (NPV) = TN/TN + FN = 1 − FOR |
Markedness (MK), deltaP (Δp) = PPV + NPV − 1 |
Diagnostic odds ratio (DOR) = LR+/LR− | |
| Balanced accuracy (BA) = TPR + TNR/2 |
F1 score = 2 PPV × TPR/PPV + TPR = 2 TP/2 TP + FP + FN |
Fowlkes–Mallows index (FM) = √PPV × TPR |
phi or Matthews correlation coefficient (MCC) = √TPR × TNR × PPV × NPV - √FNR × FPR × FOR × FDR |
Threat score (TS), critical success index (CSI), Jaccard index = TP/TP + FN + FP | |
- ^ the number of real positive cases in the data
- ^ A test result that correctly indicates the presence of a condition or characteristic
- ^ Type II error: A test result which wrongly indicates that a particular condition or attribute is absent
- ^ the number of real negative cases in the data
- ^ A test result that correctly indicates the absence of a condition or characteristic
- ^ Type I error: A test result which wrongly indicates that a particular condition or attribute is present
Some researchers have argued that the confusion matrix, and the metrics derived from it, do not truly reflect a model's knowledge. In particular, the confusion matrix cannot show whether correct predictions were reached through sound reasoning or merely by chance (a problem known in philosophy as epistemic luck). It also does not capture situations where the facts used to make a prediction later change or turn out to be wrong (defeasibility). This means that while the confusion matrix is a useful tool for measuring classification performance, it may give an incomplete picture of a model’s true reliability.[20]

Confusion matrices with more than two categories
[edit]Confusion matrix is not limited to binary classification and can be used in multi-class classifiers as well. The confusion matrices discussed above have only two conditions: positive and negative. For example, the table below summarizes communication of a whistled language between two speakers, with zero values omitted for clarity.[21]
Perceived vowel Vowel
produced |
i | e | a | o | u |
|---|---|---|---|---|---|
| i | 15 | 1 | |||
| e | 1 | 1 | |||
| a | 79 | 5 | |||
| o | 4 | 15 | 3 | ||
| u | 2 | 2 |

Confusion matrices in multi-label and soft-label classification
[edit]Confusion matrices are not limited to single-label classification (where only one class is present) or hard-label settings (where classes are either fully present, 1, or absent, 0). They can also be extended to Multi-label classification (where multiple classes can be predicted at once) and soft-label classification (where classes can be partially present).
One such extension is the Transport-based Confusion Matrix (TCM),[22] which builds on the theory of optimal transport and the principle of maximum entropy. TCM applies to single-label, multi-label, and soft-label settings. It retains the familiar structure of the standard confusion matrix: a square matrix sized by the number of classes, with diagonal entries indicating correct predictions and off-diagonal entries indicating confusion. In the single-label case, TCM is identical to the standard confusion matrix.
TCM follows the same reasoning as the standard confusion matrix: if class A is overestimated (its predicted value is greater than its label value) and class B is underestimated (its predicted value is less than its label value), A is considered confused with B, and the entry (B, A) is increased. If a class is both predicted and present, it is correctly identified, and the diagonal entry (A, A) increases. Optimal transport and maximum entropy are used to determine the extent to which these entries are updated.[22]
TCM enables clearer comparison between predictions and labels in complex classification tasks, while maintaining a consistent matrix format across settings.[22]
See also
[edit]References
[edit]- ^ Stehman, Stephen V. (1997). "Selecting and interpreting measures of thematic classification accuracy". Remote Sensing of Environment. 62 (1): 77–89. Bibcode:1997RSEnv..62...77S. doi:10.1016/S0034-4257(97)00083-7.
- ^ Powers, David M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63. S2CID 55767944.
- ^ Opitz, Juri (2024). "A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice". Transactions of the Association for Computational Linguistics. 12: 820–836. arXiv:2404.16958. doi:10.1162/tacl_a_00675.
- ^ Provost, Foster; Fawcett, Tom (2013). Data science for business: what you need to know about data mining and data-analytic thinking (1. ed., 2. release ed.). Beijing Köln: O'Reilly. ISBN 978-1-4493-6132-7.
- ^ Fawcett, Tom (2006). "An Introduction to ROC Analysis" (PDF). Pattern Recognition Letters. 27 (8): 861–874. Bibcode:2006PaReL..27..861F. doi:10.1016/j.patrec.2005.10.010. S2CID 2027090.
- ^ Powers, David M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63.
- ^ Ting, Kai Ming (2011). Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of machine learning. Springer. doi:10.1007/978-0-387-30164-8. ISBN 978-0-387-30164-8.
- ^ Brooks, Harold; Brown, Barb; Ebert, Beth; Ferro, Chris; Jolliffe, Ian; Koh, Tieh-Yong; Roebber, Paul; Stephenson, David (2015-01-26). "WWRP/WGNE Joint Working Group on Forecast Verification Research". Collaboration for Australian Weather and Climate Research. World Meteorological Organisation. Retrieved 2019-07-17.
- ^ Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (1) 6: 6-1–6-13. doi:10.1186/s12864-019-6413-7. PMC 6941312. PMID 31898477.
- ^ Tharwat A. (August 2018). "Classification assessment methods". Applied Computing and Informatics. 17: 168–192. doi:10.1016/j.aci.2018.08.003.
- ^ Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (1) 6: 6-1–6-13. doi:10.1186/s12864-019-6413-7. PMC 6941312. PMID 31898477.
- ^ Fawcett, Tom (2006). "An Introduction to ROC Analysis" (PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010. S2CID 2027090.
- ^ Provost, Foster; Tom Fawcett (2013-08-01). "Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking". O'Reilly Media, Inc.
- ^ Powers, David M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies. 2 (1): 37–63.
- ^ Ting, Kai Ming (2011). Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of machine learning. Springer. doi:10.1007/978-0-387-30164-8. ISBN 978-0-387-30164-8.
- ^ Brooks, Harold; Brown, Barb; Ebert, Beth; Ferro, Chris; Jolliffe, Ian; Koh, Tieh-Yong; Roebber, Paul; Stephenson, David (2015-01-26). "WWRP/WGNE Joint Working Group on Forecast Verification Research". Collaboration for Australian Weather and Climate Research. World Meteorological Organisation. Retrieved 2019-07-17.
- ^ Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (1): 6-1–6-13. doi:10.1186/s12864-019-6413-7. PMC 6941312. PMID 31898477.
- ^ Chicco D, Toetsch N, Jurman G (February 2021). "The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation". BioData Mining. 14 (13): 13. doi:10.1186/s13040-021-00244-z. PMC 7863449. PMID 33541410.
- ^ Tharwat A. (August 2018). "Classification assessment methods". Applied Computing and Informatics. 17: 168–192. doi:10.1016/j.aci.2018.08.003.
- ^ van der Linde, Ian (2025). "Why the confusion matrix fails as a model of knowledge". AI & Society. doi:10.1007/s00146-025-02456-x.
- ^ Rialland, Annie (August 2005). "Phonological and phonetic aspects of whistled languages". Phonology. 22 (2): 237–271. CiteSeerX 10.1.1.484.4384. doi:10.1017/S0952675705000552. S2CID 18615779.
- ^ a b c Erbani, Johan; Portier, Pierre-Édouard; Egyed-Zsigmond, Előd; Nurbakova, Diana (2024). "Confusion Matrices: A Unified Theory". IEEE Access. 12. IEEE: 181372–181419. Bibcode:2024IEEEA..12r1372E. doi:10.1109/ACCESS.2024.3507199. ISSN 2169-3536.
Confusion matrix
View on GrokipediaBasic Concepts
Definition and Purpose
A confusion matrix is a table that summarizes the performance of a classification algorithm by comparing its predicted labels against the actual labels from a dataset, typically presenting the results as counts or normalized probabilities in a square layout where rows represent actual classes and columns represent predicted classes. This structure provides a detailed breakdown of correct and incorrect predictions, enabling a nuanced evaluation beyond simple overall accuracy. The confusion matrix is based on the contingency table concept introduced by Karl Pearson in 1904.[4] The term "confusion matrix" and its use in evaluating classification performance emerged in the mid-20th century, particularly in signal detection theory and psychophysics during the 1950s and 1960s, and was adopted in machine learning and pattern recognition from the 1960s onward.[5] The primary purposes of a confusion matrix are to assess a model's overall accuracy by revealing the distribution of correct predictions, to identify specific types of errors—such as false positives (incorrectly predicted positive instances) versus false negatives (missed positive instances)—and to serve as the basis for deriving summary statistics like precision (the proportion of true positives among predicted positives) and recall (the proportion of true positives among actual positives). This evaluation assumes basic knowledge of supervised learning, where models are trained on labeled data to predict categorical outcomes, often starting with binary classification setups before extending to more complex cases.Elements in Binary Classification
In binary classification, the confusion matrix is structured around four fundamental elements that capture the outcomes of predictions against actual labels. A true positive (TP) occurs when the model correctly identifies a positive instance, such as detecting a disease in a patient who truly has it.[6] A true negative (TN) represents a correct prediction of a negative instance, for example, identifying a healthy patient as disease-free.[6] Conversely, a false positive (FP), also known as a Type I error, happens when the model incorrectly predicts a positive outcome for a negative instance, like flagging a healthy individual as ill.[6][7] A false negative (FN), or Type II error, arises when a positive instance is wrongly classified as negative, such as failing to detect a disease in an affected patient.[6][7] These elements are arranged in a standard 2x2 table, where rows correspond to actual classes and columns to predicted classes, providing a clear visualization of model performance.[8] The layout is as follows:| Actual \ Predicted | Positive | Negative |
|---|---|---|
| Positive | TP | FN |
| Negative | FP | TN |
Construction and Examples
Building the Matrix
To construct a confusion matrix, paired datasets of ground truth labels (often denoted as $ y_{\text{true}} $) and predicted labels (denoted as $ y_{\text{pred}} $) are required, typically derived from a held-out test set to ensure unbiased evaluation of a classification model's performance.[6] These datasets must have matching lengths, with each element representing the true and predicted class for an individual sample, and labels can be numeric, string, or categorical.[6] The process begins by collecting these actual and predicted labels from the model's output on the test data. Next, each prediction is categorized according to the relevant rules for binary or multi-class settings, assigning instances to true positives (TP), true negatives (TN), false positives (FP), or false negatives (FN) in the binary case, as defined in the elements of binary classification.[8] The matrix is then populated as a square table where rows correspond to actual classes and columns to predicted classes, with cell entries recording the counts of instances falling into each category.[6] Optionally, the matrix can be normalized to express proportions rather than raw counts: by row (dividing by actual class totals, yielding recall-like values per class), by column (dividing by predicted class totals, yielding precision-like values), or by the overall total (yielding error rates).[6] This normalization aids in comparing performance across datasets of varying sizes.[8] In practice, libraries such as scikit-learn provide dedicated functions likeconfusion_matrix(y_true, y_pred, normalize=None) to automate this construction, handling label ordering and optional weighting for imbalanced samples, while tools like pandas can further tabulate and visualize the resulting array.[6] For probabilistic model outputs, a decision threshold—commonly 0.5 for binary classifiers—is applied to convert continuous scores (e.g., sigmoid probabilities) into discrete predictions before categorization, as thresholds below 0.5 increase TP and FP at the expense of TN and FN.[9]
Illustrative Example
Consider a hypothetical binary classification task involving the detection of spam emails from a dataset of 100 emails, where 40 are actual spam (positive class) and 60 are non-spam (negative class).[6] This scenario illustrates how a confusion matrix is constructed by comparing the model's predictions against the true labels. To build the matrix, first identify the true positives (TP): emails correctly classified as spam, which number 35 in this example. Next, the false negatives (FN): actual spam emails incorrectly labeled as non-spam, totaling 5 (since 40 - 35 = 5). For the negative class, true negatives (TN) are non-spam emails correctly identified, amounting to 50, while false positives (FP) are non-spam emails wrongly flagged as spam, totaling 10 (since 60 - 50 = 10). These values are arranged in the standard 2x2 confusion matrix format, with rows representing actual classes and columns representing predicted classes.[6] The resulting confusion matrix is:| Actual \ Predicted | Spam | Non-Spam |
|---|---|---|
| Spam | 35 (TP) | 5 (FN) |
| Non-Spam | 10 (FP) | 50 (TN) |
Multi-Class Extensions
Generalizing to Multiple Categories
In multi-class classification problems, the confusion matrix extends from the binary case to form an table, where represents the number of distinct classes.[6] The diagonal elements capture the true positives for each class, denoting the number of instances correctly predicted as belonging to that class, while the off-diagonal elements record false predictions, indicating misclassifications between classes.[6] The matrix is indexed such that rows correspond to the actual (true) classes and columns to the predicted classes; specifically, the element at position counts the number of instances that truly belong to class but were predicted as class .[6] This structure generalizes the 2×2 binary confusion matrix by accommodating multiple categories while maintaining the same interpretive logic.[6] To facilitate analysis, the confusion matrix can be normalized in various ways: row-wise normalization divides each row by its total to yield per-class recall (sensitivity), highlighting how well each actual class is identified; column-wise normalization divides each column by its total to produce per-class precision, showing the reliability of predictions for each class; or matrix-wide normalization scales all elements by the total number of instances to express proportions across the entire dataset.[6] As the number of classes increases, the matrix grows quadratically in size, introducing greater complexity in interpretation and visualization due to the expanded number of entries that must be examined. In datasets with class imbalance, particularly in multi-class settings, the matrix often becomes sparse, with many cells—especially those involving rare classes—containing low or zero counts, which can lead to unreliable performance estimates and bias toward majority classes.Multi-Class Example
The Iris dataset, originally collected by Ronald Fisher in 1936, comprises 150 samples of Iris flowers divided equally into three classes—setosa, versicolor, and virginica—with 50 observations per species based on four morphological features. To demonstrate a multi-class confusion matrix, consider the results from a support vector machine classifier (linear kernel, regularization parameter C=0.01) applied to a held-out test subset of 38 samples from this dataset.[10] The resulting 3×3 confusion matrix, with rows denoting actual classes and columns denoting predicted classes (setosa, versicolor, virginica), is presented below:| Actual \ Predicted | setosa | versicolor | virginica |
|---|---|---|---|
| setosa | 13 | 0 | 0 |
| versicolor | 0 | 10 | 6 |
| virginica | 0 | 0 | 9 |