Recent from talks
Nothing was collected or created yet.
Discounted cumulative gain
View on WikipediaDiscounted cumulative gain (DCG) is a measure of ranking quality in information retrieval. It is often normalized so that it is comparable across queries, giving Normalized DCG (nDCG or NDCG). NDCG is often used to measure effectiveness of search engine algorithms and related applications. Using a graded relevance scale of documents in a search-engine result set, DCG sums the usefulness, or gain, of the results discounted by their position in the result list.[1] NDCG is DCG normalized by the maximum possible DCG of the result set when ranked from highest to lowest gain, thus adjusting for the different numbers of relevant results for different queries.
Overview
[edit]Two assumptions are made in using DCG and its related measures.
- Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)
- Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than non-relevant documents.
Cumulative Gain
[edit]DCG is a refinement of a simpler measure, Cumulative Gain (CG).[2] Cumulative Gain is the sum of the graded relevance values of all results in a search result list. CG does not take into account the rank (position) of a result in the result list. The CG at a particular rank position is defined as:
Where is the graded relevance of the result at position .
The value computed with the CG function is unaffected by changes in the ordering of search results. That is, moving a highly relevant document above a higher ranked, less relevant, document does not change the computed value for CG (assuming ). Based on the two assumptions made above about the usefulness of search results, (N)DCG is usually preferred over CG. Cumulative Gain is sometimes called Graded Precision.
Discounted Cumulative Gain
[edit]The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized, as the graded relevance value is reduced logarithmically proportional to the position of the result.
The usual formula of DCG accumulated at a particular rank position is defined as:[1]
Until 2013, there was no theoretically sound justification for using a logarithmic reduction factor[3] other than the fact that it produces a smooth reduction. But Wang et al. (2013)[2] gave theoretical guarantee for using the logarithmic reduction factor in Normalized DCG (NDCG). The authors show that for every pair of substantially different ranking functions, the NDCG can decide which one is better in a consistent manner.
An alternative formulation of DCG[4] places stronger emphasis on retrieving relevant documents:
The latter formula is commonly used in industrial applications including major web search companies[5] and data science competition platforms such as Kaggle.[6]
These two formulations of DCG are the same when the relevance values of documents are binary;[3]: 320 .
Note that Croft et al. (2010) and Burges et al. (2005) present the second DCG with a log of base e, while both versions of DCG above use a log of base 2. When computing NDCG with the first formulation of DCG, the base of the log does not matter, but the base of the log does affect the value of NDCG for the second formulation. Clearly, the base of the log affects the value of DCG in both formulations.
Convex and smooth approximations to DCG have also been developed, for use as an objective function in gradient based learning methods.[7]
Normalized DCG
[edit]This section needs additional citations for verification. (February 2020) |
Search result lists vary in length depending on the query. Comparing a search engine's performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of should be normalized across queries. This is done by sorting all relevant documents in the corpus by their relative relevance, producing the maximum possible DCG through position , also called Ideal DCG (IDCG) through that position. For a query, the normalized discounted cumulative gain, or nDCG, is computed as:
- ,
where IDCG is ideal discounted cumulative gain,
and represents the list of relevant documents (ordered by their relevance) in the corpus up to position p.
The nDCG values for all queries can be averaged to obtain a measure of the average performance of a search engine's ranking algorithm. Note that in a perfect ranking algorithm, the will be the same as the producing an nDCG of 1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so are cross-query comparable.
The main difficulty encountered in using nDCG is the unavailability of an ideal ordering of results when only partial relevance feedback is available.
Example
[edit]Presented with a list of documents in response to a search query, an experiment participant is asked to judge the relevance of each document to the query. Each document is to be judged on a scale of 0-3 with 0 meaning not relevant, 3 meaning highly relevant, and 1 and 2 meaning "somewhere in between". For the documents ordered by the ranking algorithm as
the user provides the following relevance scores:
That is: document 1 has a relevance of 3, document 2 has a relevance of 2, etc. The Cumulative Gain of this search result listing is:
Changing the order of any two documents does not affect the CG measure. If and are switched, the CG remains the same, 11. DCG is used to emphasize highly relevant documents appearing early in the result list. Using the logarithmic scale for reduction, the DCG for each result in order is:
| 1 | 3 | 1 | 3 |
| 2 | 2 | 1.585 | 1.262 |
| 3 | 3 | 2 | 1.5 |
| 4 | 0 | 2.322 | 0 |
| 5 | 1 | 2.585 | 0.387 |
| 6 | 2 | 2.807 | 0.712 |
So the of this ranking is:
Now a switch of and results in a reduced DCG because a less relevant document is placed higher in the ranking; that is, a more relevant document is discounted more by being placed in a lower rank.
The performance of this query to another is incomparable in this form since the other query may have more results, resulting in a larger overall DCG which may not necessarily be better. In order to compare, the DCG values must be normalized.
To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of all known relevance judgments. In addition to the six from this experiment, suppose we also know there is a document with relevance grade 3 to the same query and a document with relevance grade 2 to that query. Then the ideal ordering is:
The ideal ranking is cut again to length 6 to match the depth of analysis of the ranking:
The DCG of this ideal ordering, or IDCG (Ideal DCG) , is computed to rank 6:
And so the nDCG for this query is given as:
Limitations
[edit]- Normalized DCG does not penalize containing bad documents in the result. For example, if a query returns two results with scores 1,1,1 and 1,1,1,0 respectively, both would be considered equally good, even if the latter contains a bad document. For the ranking judgments Excellent, Fair, Bad one might use numerical scores 1,0,-1 instead of 2,1,0. This would cause the score to be lowered if bad results are returned, prioritizing the precision of the results over the recall; however, this approach can result in an overall negative score.
- Normalized DCG does not penalize missing documents in the result. For example, if a query returns two results with scores 1,1,1 and 1,1,1,1,1 respectively, both would be considered equally good, assuming ideal DCG is computed to rank 3 for the former and rank 5 for the latter. One way to take into account this limitation is to enforce a fixed set size for the result set and use minimum scores for the missing documents. In the previous example, we would use the scores 1,1,1,0,0 and 1,1,1,1,1 and quote nDCG as nDCG@5.
- Normalized DCG may not be suitable to measure the performance of queries that may have several equally good results. This is especially true when this metric is limited to only the first few results, as it is often done in practice. For example, for queries such as "restaurants" nDCG@1 accounts for only the first result. If one result set contains only 1 restaurant from the nearby area while the other contains 5, both would end up having the same score even though the latter is more comprehensive.
See also
[edit]References
[edit]- ^ a b Kalervo Järvelin, Jaana Kekäläinen, "Cumulated gain-based evaluation of IR techniques". ACM Transactions on Information Systems 20(4), 422–446 (2002)
- ^ a b Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, Tie-Yan Liu. 2013. A Theoretical Analysis of Normalized Discounted Cumulative Gain (NDCG) Ranking Measures. In Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013).
- ^ a b B. Croft; D. Metzler; T. Strohman (2010). Search Engines: Information Retrieval in Practice. Addison Wesley.
- ^ Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (ICML '05). ACM, New York, NY, USA, 89-96. DOI=10.1145/1102351.1102363 http://doi.acm.org/10.1145/1102351.1102363
- ^ "Introduction to Information Retrieval - Evaluation" (PDF). Stanford University. 21 April 2013. Retrieved 23 March 2014.
- ^ "Normalized Discounted Cumulative Gain". Archived from the original on 23 March 2014. Retrieved 23 March 2014.
- ^ D. Cossock and T. Zhang, "Statistical Analysis of Bayes Optimal Subset Ranking," in IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 5140-5154, Nov. 2008, doi: 10.1109/TIT.2008.929939.
Discounted cumulative gain
View on GrokipediaIntroduction
Definition and Purpose
Discounted cumulative gain (DCG) is a widely used metric for evaluating the quality of ranked lists in information retrieval and recommendation systems, incorporating both the relevance of individual items and their positions within the ranking.[4] Unlike traditional metrics that treat relevance in binary terms, DCG accounts for graded relevance levels, allowing for a more nuanced assessment of how well a system prioritizes useful content. This approach reflects the practical reality that users typically examine only the top portion of search results or recommendations, making position-sensitive evaluation essential for gauging real-world performance.[4] The primary purpose of DCG is to reward ranking algorithms that place highly relevant items near the top of the list, thereby simulating user behavior where early exposure to pertinent content enhances satisfaction and utility. By assigning higher weights to top positions, DCG penalizes systems that bury valuable items deeper in the ranking, encouraging optimizations that align with user preferences for concise and effective results.[4] Graded relevance in DCG is typically assessed on scales such as 0 (irrelevant) to 3 (highly relevant), enabling evaluators to capture varying degrees of document or item usefulness without relying solely on binary judgments, though scales can be adapted as needed. In the context of offline evaluation, DCG serves as a key tool for comparing ranking algorithms against ground-truth relevance judgments, often derived from human assessments or test collections like those in TREC evaluations.[4] It builds on the concept of cumulative gain, which aggregates relevance scores without position weighting, as a simpler baseline for understanding total relevance coverage. Additionally, normalized variants of DCG facilitate comparability across queries with differing relevance distributions by scaling scores to a [0,1] range.[4]Historical Background
The concept of cumulative gain emerged in information retrieval research as a response to the need for metrics that better account for graded relevance and prioritize highly relevant documents, rather than relying solely on binary judgments. In 2000, Kalervo Järvelin and Jaana Kekäläinen introduced cumulative gain (CG) and discounted cumulative gain (DCG) as position-sensitive and insensitive measures, respectively, in their SIGIR paper, "IR Evaluation Methods for Retrieving Highly Relevant Documents." These metrics aggregated relevance scores across ranked results to assess the overall utility gained by users examining retrieval outputs up to a specified depth, addressing shortcomings in traditional precision-at-k metrics that undervalue the placement of top-tier results.[4] Järvelin and Kekäläinen extended the framework in 2002 with a more detailed formalization and empirical validation in their ACM Transactions on Information Systems paper, "Cumulated Gain-Based Evaluation of IR Techniques," which introduced normalized DCG (nDCG) to scale scores relative to an ideal ranking, facilitating cross-query comparisons regardless of the total number of relevant documents. The work used graded relevance scales (e.g., 0-3) and validated the metrics on TREC-7 ad hoc data with 20 queries, showing superior discrimination of IR system performance compared to earlier measures.[1] DCG's practical adoption accelerated shortly after its proposal, with integration into Text REtrieval Conference (TREC) evaluations beginning in the Web Track of 2001 and continuing in subsequent years, where it supported assessments of large-scale web retrieval tasks emphasizing navigational and ad hoc search.[1] As noted in the original formulation, this early use in TREC highlighted DCG's robustness for real-world benchmarks involving diverse document collections and user-oriented relevance grading. Over the subsequent decades, DCG and nDCG solidified as cornerstone metrics in IR, shaping evaluation standards at major venues like the ACM SIGIR Conference; for example, SIGIR 2024 proceedings routinely employ nDCG to quantify ranking effectiveness in neural retrieval models, underscoring its enduring influence as of 2025.[5]Core Concepts
Relevance Assessment
Relevance in information retrieval (IR) refers to the degree to which a retrieved document or item satisfies a user's information need, serving as the foundational input for evaluation metrics like discounted cumulative gain (DCG).[1] Traditionally, relevance is assessed on a binary scale, classifying items as either relevant (1) or irrelevant (0), but this approach overlooks nuances in usefulness.[6] Graded relevance scales address this limitation by assigning integer scores to reflect varying levels of utility, such as 0 for irrelevant, 1 for partially relevant, 2 for relevant, and 3 for highly relevant, with some schemes extending up to 4 or 5 for even finer distinctions.[1] A prominent example of a graded scale is the one used in the Text REtrieval Conference (TREC) organized by the National Institute of Standards and Technology (NIST), which typically employs a 0-3 scale: 0 (irrelevant), 1 (relevant), 2 (highly relevant), and 3 (perfect).[7] In advanced setups, continuous scores may be applied, allowing for probabilistic or nuanced judgments beyond discrete grades, though integer scales remain standard for practicality.[8] These scales enable IR systems to be evaluated based on the quality of ranked results, prioritizing highly relevant items over merely relevant ones.[1] Human annotation for relevance assessment involves trained assessors following structured guidelines, such as those provided by NIST for TREC evaluations, where topics (queries) are defined with detailed descriptions of the information need, and assessors judge document relevance against this criteria.[7] To ensure reliability, inter-assessor agreement is measured using the Cohen's Kappa statistic, which accounts for chance agreement; values above 0.8 indicate good agreement, 0.67-0.8 fair, and below 0.67 poor, with TREC judgments often achieving fair to good levels through assessor training and adjudication of disagreements.[6][9] Despite these efforts, relevance assessment faces significant challenges, including inherent subjectivity, as judgments can vary based on individual assessor backgrounds, leading to inconsistencies even with guidelines.[10] The process is also costly and time-intensive, requiring manual review of large document pools, which limits scalability for comprehensive evaluations.[11] Additionally, relevance is multi-faceted, encompassing topical alignment (how well the content matches the query) versus user-specific factors (such as context or preferences), complicating uniform assessments across diverse scenarios.[12] In the context of DCG, the relevance grade assigned to each item at position , denoted as , directly feeds into the metric as the core score, which is then aggregated in the cumulative gain summation to reflect overall ranking quality.[1]Cumulative Gain
Cumulative gain (CG) serves as a foundational metric in information retrieval evaluation, measuring the total relevance accumulated from a ranked list of documents without considering their positions. It sums the relevance grades assigned to documents up to a specified cutoff position , treating all retrieved items equally regardless of rank. This approach provides a straightforward assessment of an information retrieval (IR) system's ability to deliver relevant content overall.[1] The formula for cumulative gain is given by: where denotes the relevance grade of the document at position , typically on a multi-level scale such as 0 (irrelevant) to 3 (highly relevant). This direct summation extends traditional binary metrics like precision and recall, which are limited to perfect or imperfect relevance, by accommodating graded assessments that better reflect user perceptions of document utility. As a result, CG rewards systems for retrieving highly relevant documents in aggregate, without penalizing the placement of less relevant ones lower in the list.[1] In practice, CG functions as a baseline for non-discounted evaluation, particularly useful when assessing complete result sets where position bias is not a primary concern. Its simplicity and intuitiveness make it ideal for handling multi-level relevance scores, enabling more nuanced comparisons across IR techniques. For instance, in laboratory settings like those using TREC datasets, CG facilitates statistical testing of effectiveness differences between systems. However, by ignoring positional effects, CG may not fully capture user effort in examining results, motivating the development of position-sensitive variants.[1]Mathematical Formulation
Discounted Cumulative Gain
Discounted cumulative gain (DCG) modifies the basic cumulative gain by applying a position-based discount factor, which reduces the contribution of relevant items appearing lower in the ranked list to better reflect user behavior in examining search results. This discounting accounts for the observation that users are less likely to view documents beyond the top few positions, thus emphasizing the importance of accurate ranking in the initial results. The metric was introduced to address limitations in traditional measures like precision and recall, which treat all relevant documents equally regardless of position.[1] The standard formula for DCG up to position is: where denotes the graded relevance score of the item at rank , typically an integer from 0 to a maximum relevance level (e.g., 0 for irrelevant, 3 for highly relevant). The use of base-2 logarithm provides a smooth, gradually increasing discount that mimics human attention decay.[1] The logarithmic discount with base 2 is chosen to model diminishing returns in user examination, where the effective weight for position 1 is , for position 2 is approximately , and for position 4 is approximately , penalizing lower placements progressively. This derivation stems from dividing each relevance score by an increasing denominator that grows logarithmically with position, thereby de-emphasizing contributions from deeper ranks while maintaining additivity.[1] For scenarios involving non-integer relevance scores, such as continuous ratings in recommendation systems, an alternative formulation replaces the linear relevance term: This exponential mapping ensures that higher relevance values contribute disproportionately more, aligning with the intuition that highly relevant items provide exponentially greater utility.[2] The cutoff represents the depth of the ranking considered, typically set to 10 or 20 in top-k evaluations to focus on the most visible portion of results, as seen in benchmarks like TREC where NDCG@10 is standard. DCG values are often normalized against an ideal ranking for query-specific comparability, though the raw form captures absolute gain with discounting.[1]Normalized Discounted Cumulative Gain
The Normalized Discounted Cumulative Gain (nDCG) addresses a key limitation of the raw DCG by scaling scores relative to an ideal ranking, producing values between 0 and 1 that are comparable across queries regardless of their relevance distributions.[1] This normalization is achieved by dividing the DCG of a given ranking by the DCG of the optimal possible ranking for the same set of documents.[1] A score of 1 denotes a perfect ranking that fully matches the ideal order, while scores closer to 0 indicate poorer performance in prioritizing relevant items.[1] The formula for nDCG at cutoff position is given by where is the discounted cumulative gain of the evaluated ranking up to position , and is the discounted cumulative gain of the ideal ranking up to the same position.[1] The ideal DCG () is computed by first reordering the document relevance scores in descending order and then applying the DCG summation formula to this optimal sequence.[1] This approach ensures that represents the maximum achievable gain for the query's relevance profile. When relevance scores include ties, the ideal ranking for sorts documents in descending order of relevance, using a stable sort to maintain consistent ordering among items with identical scores and avoid arbitrary variations in normalization.[13] One primary benefit of nDCG is its ability to facilitate meaningful averages and comparisons across diverse queries, as varying relevance depths or grades no longer skew absolute scores.[1] This property has made nDCG a standard metric in information retrieval evaluations, such as those in the Text REtrieval Conference (TREC), where it supports robust statistical analysis of ranking systems.[1]Computation and Examples
Step-by-Step Calculation
To compute Discounted Cumulative Gain (DCG) and Normalized Discounted Cumulative Gain (nDCG) for a ranked list of items, begin by assigning graded relevance scores to each item in the list, typically using integer values such as 0 for irrelevant, 1 for marginally relevant, 2 for relevant, and 3 for highly relevant, based on assessor judgments.[14] These scores, denoted as for the item at position , form the basis for all subsequent calculations. Next, if cumulative gain (CG) is required as an intermediate step (as outlined in prior sections), compute it by summing the relevance scores in ranked order up to the desired position, though CG is often bypassed directly in DCG computation. Then, apply the DCG formula position-by-position from the top of the ranked list to obtain DCG at a cutoff (e.g., the top 10 results), ignoring positions beyond to focus on user-visible portions of the list; this yields , where is the base of the logarithm (commonly 2).[14] To normalize, first determine the ideal DCG (IDCG) by sorting the relevance scores in descending order to simulate a perfect ranking, then computing DCG on this ideal list up to the same cutoff . Finally, calculate nDCG as , which scales the score between 0 and 1 for comparability across queries.[14] In software implementations, sorting the relevance scores for IDCG requires time complexity, where is the list length, making it efficient for typical ranking tasks; this is handled in libraries such as scikit-learn'sdcg_score and ndcg_score functions, which support sample weights and ignore scores beyond the cutoff, or RankLib, which integrates DCG evaluation in its ranking algorithms.[15]
Edge cases include empty lists, where DCG is defined as 0 since no items contribute relevance; lists with all irrelevant items (all ), yielding nDCG of 0; and perfect rankings matching the ideal order, resulting in nDCG of 1.[16][13]
For numerical precision, relevance grades are typically integers to reflect discrete judgment scales, while the logarithmic discounts use floating-point arithmetic to avoid overflow in summation, ensuring accurate representation even for long lists.[15]
Illustrative Example
Consider a hypothetical search query retrieving five documents, ranked in order with assigned relevance grades of 3 (highly relevant), 2 (relevant), 3 (highly relevant), 0 (irrelevant), and 1 (marginally relevant). The cumulative gain (CG) at position 5, which sums the relevance grades without discounting, is calculated as CG_5 = 3 + 2 + 3 + 0 + 1 = 9. To compute the discounted cumulative gain (DCG) at position 5, apply the logarithmic discount using base-2 logarithm of (position + 1):- Position 1: 3 / \log_2(2) = 3 / 1 = 3
- Position 2: 2 / \log_2(3) ≈ 2 / 1.585 = 1.26
- Position 3: 3 / \log_2(4) = 3 / 2 = 1.50
- Position 4: 0 / \log_2(5) ≈ 0 / 2.322 = 0
- Position 5: 1 / \log_2(6) ≈ 1 / 2.585 = 0.39
- Position 1: 3 / 1 = 3
- Position 2: 3 / 1.585 ≈ 1.89
- Position 3: 2 / 2 = 1
- Position 4: 1 / 2.322 ≈ 0.43
- Position 5: 0 / 2.585 = 0
| Position | Ranked Relevance | Discount Factor (\log_2(i+1)) | Contribution to DCG |
|---|---|---|---|
| 1 | 3 | 1.000 | 3.00 |
| 2 | 2 | 1.585 | 1.26 |
| 3 | 3 | 2.000 | 1.50 |
| 4 | 0 | 2.322 | 0.00 |
| 5 | 1 | 2.585 | 0.39 |
| Total | 6.15 |
| Position | Ideal Relevance | Discount Factor (\log_2(i+1)) | Contribution to IDCG |
|---|---|---|---|
| 1 | 3 | 1.000 | 3.00 |
| 2 | 3 | 1.585 | 1.89 |
| 3 | 2 | 2.000 | 1.00 |
| 4 | 1 | 2.322 | 0.43 |
| 5 | 0 | 2.585 | 0.00 |
| Total | 6.32 |
