Simple matching coefficient

Simple matching coefficientMain

Community hub

7 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

Simple matching coefficient

View on Wikipedia

from Wikipedia

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Simple matching coefficient" – news · newspapers · books · scholar · JSTOR (July 2023) (Learn how and when to remove this message)

The simple matching coefficient (SMC) or Rand similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets.^[1]^{[better source needed]}

		0	1
		A
B	0	$M_{00}$	$M_{10}$
B	1	$M_{01}$	$M_{11}$

Given two objects, A and B, each with n binary attributes, SMC is defined as: ${\begin{aligned}{\text{SMC}}&={\frac {\text{number of matching attributes}}{\text{total number of attributes}}}\\[8pt]&={\frac {M_{00}+M_{11}}{M_{00}+M_{11}+M_{01}+M_{10}}}\end{aligned}}$

where

$M_{00}$ is the total number of attributes where A and B both have a value of 0,
$M_{11}$ is the total number of attributes where A and B both have a value of 1,
$M_{01}$ is the total number of attributes where A has value 0 and B has value 1, and
$M_{10}$ is the total number of attributes where A has value 1 and B has value 0.

The simple matching distance (SMD), which measures dissimilarity between sample sets, is given by $1-{\text{SMC}}$ .^[2]^{[better source needed]}

SMC is linearly related to Hamann similarity: ${\text{SMC}}=({\text{Hamann}}+1)/2$ . Also, ${\text{SMC}}=1-D^{2}/n$ , where $D^{2}$ is the squared Euclidean distance between the two objects (binary vectors) and $n$ is the number of attributes.

The SMC is very similar to the more popular Jaccard index. The main difference is that the SMC has the term $M_{00}$ in its numerator and denominator, whereas the Jaccard index does not. Thus, the SMC counts both mutual presences (when an attribute is present in both sets) and mutual absence (when an attribute is absent in both sets) as matches and compares it to the total number of attributes in the universe, whereas the Jaccard index only counts mutual presence as matches and compares it to the number of attributes that have been chosen by at least one of the two sets.

In market basket analysis, for example, the basket of two consumers who we wish to compare might only contain a small fraction of all the available products in the store, so the SMC will usually return very high values of similarities even when the baskets bear very little resemblance, thus making the Jaccard index a more appropriate measure of similarity in that context. For example, consider a supermarket with 1000 products and two customers. The basket of the first customer contains salt and pepper and the basket of the second contains salt and sugar. In this scenario, the similarity between the two baskets as measured by the Jaccard index would be 1/3, but the similarity becomes 0.998 using the SMC.

In other contexts, where 0 and 1 carry equivalent information (symmetry), the SMC is a better measure of similarity. For example, vectors of demographic variables stored in dummy variables, such as binary gender, would be better compared with the SMC than with the Jaccard index since the impact of gender on similarity should be equal, independently of whether male is defined as a 0 and female as a 1 or the other way around. However, when we have symmetric dummy variables, one could replicate the behaviour of the SMC by splitting the dummies into two binary attributes (in this case, male and female), thus transforming them into asymmetric attributes, allowing the use of the Jaccard index without introducing any bias. By using this trick, the Jaccard index can be considered as making the SMC a fully redundant metric. The SMC remains, however, more computationally efficient in the case of symmetric dummy variables since it does not require adding extra dimensions.

The Jaccard index is also more general than the SMC and can be used to compare other data types than just vectors of binary attributes, such as probability measures.

Notes

[edit]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic loss
Clustering	Silhouette Calinski–Harabasz index Davies–Bouldin index Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC DBCV index
Ranking	MRR NDCG AP
Computer vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep learning	Inception score FID
Recommender system	Coverage Intra-list similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix

Revisions and contributors Edit on Wikipedia Read on Wikipedia

Simple matching coefficient

View on Grokipedia

from Grokipedia

The simple matching coefficient (SMC), also known as the Sokal-Michener coefficient, is a fundamental similarity measure in statistics used to quantify the resemblance between two binary data sets or objects based on the proportion of matching attributes, where matches include both shared presences (1s) and shared absences (0s).^[1] It is particularly suited for presence-absence data and ranges from 0 (no similarity) to 1 (identical objects), making it a straightforward metric for assessing overall agreement without weighting positive or negative matches differently.^[2] Introduced in the context of numerical taxonomy to evaluate systematic relationships among biological specimens, the SMC has become a standard tool in cluster analysis and pattern recognition.^[3] Formally, for two binary vectors of length

n

, the SMC is calculated as

S_{SM} = \frac{a + d}{a + b + c + d},

where

a

represents the number of attributes present (1) in both objects,

d

the number absent (0) in both,

b

present in the first but absent in the second, and

c

absent in the first but present in the second.^[1] This formula treats absences as informative, which distinguishes it from coefficients like Jaccard or Sorensen-Dice that ignore negative matches, potentially leading to different clustering outcomes in applications involving sparse or high-dimensional data.^[4] The metric's simplicity allows for efficient computation in large datasets, though it can be biased toward similarity in diverse systems where shared absences dominate.^[2] In practice, the SMC finds wide application across disciplines, including ecology for comparing species compositions in community samples (e.g., zooplankton assemblages), genetics for analyzing amplified fragment length polymorphism (AFLP) markers to assess population diversity in organisms like silkworms, and machine learning for binary feature clustering in pattern recognition tasks.^[2]^[4] Despite its utility, the inclusion of negative matches has drawn criticism in biological contexts, where absences may not carry equivalent weight to presences, prompting alternatives in high-diversity or closely related populations.^[4] Its implementation is readily available in statistical software like R, facilitating its use in exploratory data analysis and hierarchical clustering.^[5]

Fundamentals

Definition

The simple matching coefficient (SMC) is a symmetric similarity metric designed specifically for comparing binary data, where it quantifies the degree of resemblance between two objects by considering both agreements on positive attributes (presence of 1s) and negative attributes (absence of 0s) across their attribute vectors.^[2] Binary data in this context consists of vectors composed of 0s and 1s, typically representing the absence or presence of specific attributes, such as species characteristics in taxonomy or feature states in pattern recognition.^[2] Originating in the 1950s within the fields of pattern recognition and numerical taxonomy, the SMC was introduced as a foundational tool for evaluating systematic relationships among entities based on shared attributes.^[6] It is attributed to the seminal work of Robert R. Sokal and Charles D. Michener, who developed it to support quantitative methods in biological classification and beyond.^[6] Unlike many distance metrics that emphasize differences, the SMC functions as a direct similarity coefficient, yielding values bounded between 0 and 1, with a value of 1 signifying complete identity between the two binary vectors and 0 indicating no matches whatsoever.^[2] This normalization makes it particularly useful for interpreting resemblance in datasets where absences are informative.^[2]

Notation and Interpretation

The simple matching coefficient is defined using standard notation for two binary vectors

\mathbf{X} = (X_1, \dots, X_n)

and

\mathbf{Y} = (Y_1, \dots, Y_n)

, where each component is either 0 or 1, representing the states of

n

attributes for two objects. Let

a

denote the number of positions

i

where

X_i = 1

and

Y_i = 1

b

the number where

X_i = 1

and

Y_i = 0

c

the number where

X_i = 0

and

Y_i = 1

, and

d

the number where

X_i = 0

and

Y_i = 0

.^[6] The coefficient is computed as

\text{SMC}(\mathbf{X}, \mathbf{Y}) = \frac{a + d}{a + b + c + d},

where the denominator equals

n

, the total number of attributes.^[6] This expression captures the proportion of attributes on which

\mathbf{X}

and

\mathbf{Y}

agree, with

a + d

representing the total agreements—either both attributes present (positive matches) or both absent (negative matches). By including

d

in the numerator, the coefficient treats 0-matches as informative, which distinguishes it from measures that exclude double absences and makes it suitable for sparse binary data where absences provide meaningful similarity information.^[6] The symmetric treatment of presences and absences in the SMC contrasts with other measures, such as the Kulczyński coefficient, which weight positive and negative matches differently.^[7]

Properties

Mathematical Properties

The simple matching coefficient (SMC) possesses several fundamental mathematical properties that make it a useful similarity measure for binary data. It is symmetric, meaning that for any two binary vectors

X

and

Y

\mathrm{SMC}(X, Y) = \mathrm{SMC}(Y, X)

. This follows from the definition

\mathrm{SMC}(X, Y) = \frac{a + d}{n}

, where

a

is the number of positions where both vectors have 1,

d

is the number where both have 0, and

n = a + b + c + d

is the vector length; swapping

X

and

Y

interchanges

b

(positions where

X=1, Y=0

) and

c

(positions where

X=0, Y=1

), but leaves the numerator and denominator unchanged.^[8] The coefficient is also reflexive:

\mathrm{SMC}(X, X) = 1

for any binary vector

X

, as

a + d = n

and

b = c = 0

when comparing a vector to itself.^[8] Additionally, SMC is non-negative, with

\mathrm{SMC}(X, Y) \geq 0

for all

X, Y

, and equality holds when there are no matching positions (

a = d = 0

), corresponding to complete dissimilarity in both presences and absences.^[8]^[9] Although SMC serves as a similarity measure, it is not a true metric because it violates the triangle inequality. However, the transformation

d(X, Y) = 1 - \mathrm{SMC}(X, Y)

produces a valid distance metric, equivalent to the normalized Hamming distance, which does satisfy the triangle inequality.^[8] To demonstrate the bounds

0 \leq \mathrm{SMC}(X, Y) \leq 1

, consider the non-negative integers

a, b, c, d

satisfying

a + b + c + d = n

. Then,

\mathrm{SMC}(X, Y) = \frac{a + d}{n} = 1 - \frac{b + c}{n}.

Since

0 \leq b + c \leq n

, it follows that

0 \leq \frac{b + c}{n} \leq 1

, so

0 \leq \mathrm{SMC}(X, Y) \leq 1

. The lower bound is achieved when

b + c = n

(i.e.,

a = d = 0

), and the upper bound when

b + c = 0

(i.e.,

X = Y

).^[8]^[9]

Range and Bounds

The simple matching coefficient (SMC) is bounded between 0 and 1, as the counts

a

b

c

, and

d

represent non-negative integers that sum to the total number of attributes

n = a + b + c + d

. Consequently, the numerator

a + d

satisfies

0 \leq a + d \leq n

, implying

0 \leq \frac{a + d}{n} \leq 1

.^[10] The lower bound of 0 is attained when

a = d = 0

(all attributes mismatch), representing complete dissimilarity, while the upper bound of 1 is reached when

b = c = 0

(all attributes match), indicating perfect similarity.^[11] This formulation ensures SMC is inherently normalized to the interval [0, 1], with values approaching 1 denoting high similarity and those near 0 signaling strong dissimilarity, facilitating direct comparability across datasets without additional rescaling.^[10] In high-dimensional sparse data, SMC often trends toward 1 because numerous incidental 0-matches (co-absences) inflate the score, potentially biasing assessments by overemphasizing agreement in absent features.^[2] For a fixed set of

n

attributes, the coefficient is sensitive to the attribute count, as larger

n

amplifies the impact of random matches on the overall proportion.^[12]

Computation

Step-by-Step Calculation

To compute the simple matching coefficient between two binary vectors

\mathbf{X}

and

\mathbf{Y}

of equal length

n

, begin by aligning the vectors such that corresponding positions are compared pairwise.^[13] Next, construct a 2×2 contingency table for the pair by counting the occurrences in each category: let

a

be the number of positions where both

X_i = 1

and

Y_i = 1

(joint presences),

b

the number where

X_i = 1

and

Y_i = 0

c

where

X_i = 0

and

Y_i = 1

, and

d

where both

X_i = 0

and

Y_i = 0

(joint absences); note that

a + b + c + d = n

.^[13] The coefficient is then calculated as the proportion of matching attributes:

\text{SMC} = \frac{a + d}{n}

This yields a value between 0 and 1, where 1 indicates perfect similarity.^[13] For edge cases, if

n = 0

(empty vectors), the coefficient is undefined due to division by zero. If the vectors are identical, the value is directly 1, as

a + d = n

, bypassing explicit counting.^[13] The computation requires

O(n)

time per vector pair, involving a single pass to tally the counts, which scales efficiently for large datasets when implemented with vectorized operations in programming languages like R or Python.^[14] When evaluating multiple objects, the SMC is computed for each pair of objects from an m × n binary data matrix (m objects, n attributes), resulting in an m × m similarity matrix where each entry is the proportion of matching attributes between the pair. This can be efficiently implemented using vectorized operations.^[14]

Numerical Example

To illustrate the computation of the simple matching coefficient, consider a small binary dataset consisting of two vectors,

\mathbf{X} = [1, 0, 0, 0]

and

\mathbf{Y} = [1, 1, 0, 0]

, each with

n = 4

attributes. In this example, there is one position where both vectors have a 1 (position 1, so

a = 1

); no positions where the first vector has a 1 and the second has a 0 (so

b = 0

); one position where the first has a 0 and the second has a 1 (position 2, so

c = 1

); and two positions where both have a 0 (positions 3 and 4, so

d = 2

). The simple matching coefficient is then calculated as

\frac{a + d}{n} = \frac{1 + 2}{4} = 0.75

.^[15] This result indicates a 75% similarity between the vectors, primarily driven by the two matching 0s and one matching 1, with the mismatch arising from the single differing attribute. To demonstrate the sensitivity of the coefficient to individual attributes, consider flipping the value in position 2 of

\mathbf{Y}

from 1 to 0, yielding

\mathbf{Y}' = [1, 0, 0, 0]

. Now,

a = 1

b = 0

c = 0

, and

d = 3

, so the simple matching coefficient becomes

\frac{1 + 3}{4} = 1

, reflecting perfect similarity after this change.

Applications

In Cluster Analysis

The simple matching coefficient serves as a fundamental similarity measure in cluster analysis for binary data, providing input for agglomerative hierarchical clustering algorithms, including single-linkage and complete-linkage methods, where it facilitates the merging of clusters based on overall resemblance in attribute states.^[16] This application is particularly suited to datasets where observations are represented as binary vectors, allowing the coefficient to quantify pairwise similarities that guide the hierarchical grouping process.^[2] Historically, the simple matching coefficient was introduced by Sokal and Michener in 1958 as a statistical tool for evaluating systematic relationships in taxonomic data.^[17] It gained prominence in the 1950s and 1960s through the work of Sokal and Sneath, who integrated it into numerical taxonomy for classifying organisms using binary phenotypic traits, such as presence or absence of morphological features, thereby enabling objective, quantitative phenetic clustering without prior assumptions about evolutionary relationships.^[18] This approach revolutionized biological classification by treating all characters equally and using similarity matrices derived from the coefficient to construct dendrograms.^[19] In modern contexts, such as gene expression analysis, the simple matching coefficient is applied to cluster samples where attributes are binarized to indicate the presence or absence of gene features across conditions or tissues.^[20] For instance, in microarray or sequencing data with sparse expression patterns, it groups samples with similar profiles of expressed and non-expressed genes, aiding in the identification of co-regulated patterns or disease subtypes. A key advantage of the simple matching coefficient in clustering arises from its inclusion of matching absences (shared 0-states) in the similarity calculation, which proves effective for datasets dominated by negative matches, such as ecological or genomic binary data with high sparsity.^[2] This treatment enhances cluster cohesion by recognizing non-occurrence of features as informative similarity, avoiding the underestimation of relatedness that can occur with measures ignoring negative matches, and thus improving the stability and interpretability of resulting hierarchies in presence-absence scenarios.^[21]

In Categorical Data Analysis

The simple matching coefficient (SMC) serves as a measure of similarity between itemsets in association rule mining, particularly in market basket analysis where transaction data is represented as binary vectors indicating presence or absence of items. In this context, SMC quantifies the proportion of matching attributes (both presences and absences) between two baskets, enabling the identification of co-occurring items that inform rules like frequent itemset generation. For instance, in retail datasets, it helps assess how closely the purchase patterns of two customers align, supporting recommendations by highlighting shared buying behaviors. This application leverages SMC's symmetry in treating matches and non-matches equally, making it suitable for Boolean data typical of transactional records.^[22]^[23] In taxonomy and ecology, SMC quantifies resemblance between species or communities using binary trait matrices, such as the presence or absence of morphological features like organs or habitat indicators. For species classification, it compares binary coding of traits across taxa, where shared presences (e.g., both species having wings) and absences (e.g., neither having gills) contribute to similarity scores, aiding in constructing phylogenetic or phenetic trees. In ecological studies, it evaluates community similarity based on species occurrence data, treating sites as binary vectors to measure overlap in flora or fauna composition. This approach was notably applied in 1970s ecological research for comparing vegetation types, such as analyzing plant community structures in diverse habitats to discern patterns of succession or disturbance.^[2]^[12]^[24] As a modern extension in machine learning, SMC assesses variable co-occurrence in binary classifiers during feature selection, helping to detect redundant or highly similar features by computing similarity across samples. In datasets with binary attributes, it evaluates how often two features match in value (both 1 or both 0), allowing selection of non-redundant subsets that improve model efficiency and interpretability without overfitting. This is particularly useful in high-dimensional binary data, such as genomic or sensor inputs, where SMC's inclusion of negative matches captures complementary information beyond mere overlap.^[25]^[26]

Comparisons

With Jaccard Index

The Jaccard index, a binary similarity measure, is defined as

J = \frac{a}{a + b + c},

where

a

represents the number of attributes present in both objects,

b

the attributes present only in the first object, and

c

those present only in the second; it explicitly ignores

d

, the number of attributes absent in both.^[4] In comparison, the simple matching coefficient (SMC) incorporates all four terms as

S = \frac{a + d}{a + b + c + d}.

^[4] This fundamental distinction arises from the Jaccard index's focus on positive co-occurrences, treating absences as irrelevant to similarity assessment.^[27] A key difference lies in how each handles 0-matches (

d

): SMC credits shared absences as contributing to similarity, which can inflate scores in sparse datasets where positive attributes are rare and negative matches dominate.^[4] The Jaccard index, by excluding

d

, provides a more conservative measure that emphasizes only overlapping presences, avoiding overestimation from ubiquitous absences.^[4] This makes Jaccard particularly robust for scenarios with imbalanced binary data, such as presence-absence matrices in ecology or genetics.^[4] Preferences between the two depend on the data's nature and analytical goals. SMC is favored for balanced binary datasets where both presences and absences carry symmetric informational value, as in numerical taxonomy of closely related taxa.^[4] Conversely, the Jaccard index is preferred for set-like overlap problems, such as text similarity via shared keywords or document comparison, where only positive intersections matter and absences do not imply relatedness.^[27] To illustrate, consider binary vectors with

a=1

b=0

c=1

, and

d=2

across four attributes. The Jaccard index yields

J = \frac{1}{1 + 0 + 1} = 0.5

, while SMC gives

S = \frac{1 + 2}{1 + 0 + 1 + 2} = 0.75

. This contrast highlights SMC's higher valuation due to the two 0-matches, demonstrating its tendency to elevate similarity when negatives are prevalent.^[4]

With Other Binary Similarity Measures

The simple matching coefficient (SMC), defined as

S = \frac{a + d}{a + b + c + d}

, differs from the Dice coefficient, which is given by

D = \frac{2a}{2a + b + c}

, primarily in its treatment of negative matches (d).^[28] The Dice coefficient excludes d entirely and doubles the weight of positive matches (a), thereby emphasizing agreements on presence over absence and making it more robust to datasets where absences are not informative.^[28] In contrast, SMC's inclusion of d renders it sensitive to the overall prevalence of features, potentially inflating similarity scores in sparse or imbalanced binary data.^[29] Similarly, the Rogers-Tanimoto coefficient, formulated as

RT = \frac{a + d}{a + d + 2(b + c)}

, also incorporates d but assigns double weight to disagreements (b + c), which tempers the influence of negative matches compared to SMC.^[28] This adjustment makes Rogers-Tanimoto less prone to overemphasizing joint absences than SMC, particularly in scenarios with high discordance.^[29] Overall, while SMC offers the simplest symmetric measure by treating all matches equally, alternatives like Dice and Rogers-Tanimoto adjust for class imbalance by either ignoring or reweighting components, enhancing their utility in applications such as genetic marker analysis where negative co-occurrences may not indicate true similarity.^[29] SMC's reliance on d makes it less suitable for asymmetric datasets, such as those involving rare events, where joint absences dominate and can misleadingly suggest high similarity.^[29] For a broader overview, the following table summarizes key binary similarity measures, their formulas (using the 2x2 contingency table notation), ranges, and primary sensitivities:

Measure	Formula	Range	Sensitivity Notes
Simple Matching (SMC)	$\frac{a + d}{a + b + c + d}$	[0,1]	High to negative matches (d); treats presences and absences equally.^[28]
Jaccard	$\frac{a}{a + b + c}$	[0,1]	Ignores d; focuses on shared presences, sensitive to false positives/negatives.^[28]
Dice (Sorensen)	$\frac{2a}{2a + b + c}$	[0,1]	Excludes d; doubles weight on positive matches, robust to imbalances.^[28]
Rogers-Tanimoto	$\frac{a + d}{a + d + 2(b + c)}$	[0,1]	Includes d but penalizes disagreements heavily; balances absences with discordance.^[28]
Ochiai	$\frac{a}{\sqrt{(a + b)(a + c)}}$	[0,1]	Ignores d; cosine-like, sensitive to marginal totals, undefined for zero vectors.^[28]

Info Pages

Talk Pages

Special Pages

Simple matching coefficient

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Simple matching coefficient

See also

Notes

Simple matching coefficient

Fundamentals

Definition

Notation and Interpretation

Properties

Mathematical Properties

Range and Bounds

Computation

Step-by-Step Calculation

Numerical Example

Applications

In Cluster Analysis

In Categorical Data Analysis

Comparisons

With Jaccard Index

With Other Binary Similarity Measures

References

Add your contribution

Related Hubs

Contribute something

History

Simple matching coefficient

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Simple matching coefficient

See also

Notes

Simple matching coefficient

Fundamentals

Definition

Notation and Interpretation

Properties

Mathematical Properties

Range and Bounds

Computation

Step-by-Step Calculation

Numerical Example

Applications

In Cluster Analysis

In Categorical Data Analysis

Comparisons

With Jaccard Index

With Other Binary Similarity Measures

References

Add your contribution

Related Hubs

Contribute something