Hubbry Logo
Aggregate functionAggregate functionMain
Open search
Aggregate function
Community hub
Aggregate function
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Aggregate function
Aggregate function
from Wikipedia

In database management, an aggregate function or aggregation function is a function where multiple values are processed together to form a single summary statistic.

(Figure 1) Entity relationship diagram representation of aggregation.

Common aggregate functions include:

Others include:

  • Nanmean (mean ignoring NaN values, also known as "nil" or "null")
  • Stddev

Formally, an aggregate function takes as input a set, a multiset (bag), or a list from some input domain I and outputs an element of an output domain O.[1] The input and output domains may be the same, such as for SUM, or may be different, such as for COUNT.

Aggregate functions occur commonly in numerous programming languages, in spreadsheets, and in relational algebra.

The listagg function, as defined in the SQL:2016 standard[2] aggregates data from multiple rows into a single concatenated string.

In the entity relationship diagram, aggregation is represented as seen in Figure 1 with a rectangle around the relationship and its entities to indicate that it is being treated as an aggregate entity.[3]

Decomposable aggregate functions

[edit]

Aggregate functions present a bottleneck, because they potentially require having all input values at once. In distributed computing, it is desirable to divide such computations into smaller pieces, and distribute the work, usually computing in parallel, via a divide and conquer algorithm.

Some aggregate functions can be computed by computing the aggregate for subsets, and then aggregating these aggregates; examples include COUNT, MAX, MIN, and SUM. In other cases the aggregate can be computed by computing auxiliary numbers for subsets, aggregating these auxiliary numbers, and finally computing the overall number at the end; examples include AVERAGE (tracking sum and count, dividing at the end) and RANGE (tracking max and min, subtracting at the end). In other cases the aggregate cannot be computed without analyzing the entire set at once, though in some cases approximations can be distributed; examples include DISTINCT COUNT (Count-distinct problem), MEDIAN, and MODE.

Such functions are called decomposable aggregation functions[4] or decomposable aggregate functions. The simplest may be referred to as self-decomposable aggregation functions, which are defined as those functions f such that there is a merge operator such that

where is the union of multisets (see monoid homomorphism).

For example, SUM:

, for a singleton;
, meaning that merge is simply addition.

COUNT:

,
.

MAX:

,
.

MIN:

,[2]
.

Note that self-decomposable aggregation functions can be combined (formally, taking the product) by applying them separately, so for instance one can compute both the SUM and COUNT at the same time, by tracking two numbers.

More generally, one can define a decomposable aggregation function f as one that can be expressed as the composition of a final function g and a self-decomposable aggregation function h, . For example, AVERAGE=SUM/COUNT and RANGE=MAX−MIN.

In the MapReduce framework, these steps are known as InitialReduce (value on individual record/singleton set), Combine (binary merge on two aggregations), and FinalReduce (final function on auxiliary values),[5] and moving decomposable aggregation before the Shuffle phase is known as an InitialReduce step,[6]

Decomposable aggregation functions are important in online analytical processing (OLAP), as they allow aggregation queries to be computed on the pre-computed results in the OLAP cube, rather than on the base data.[7] For example, it is easy to support COUNT, MAX, MIN, and SUM in OLAP, since these can be computed for each cell of the OLAP cube and then summarized ("rolled up"), but it is difficult to support MEDIAN, as that must be computed for every view separately.

Other decomposable aggregate functions

[edit]

In order to calculate the average and standard deviation from aggregate data, it is necessary to have available for each group: the total of values (Σxi = SUM(x)), the number of values (N=COUNT(x)) and the total of squares of the values (Σxi2=SUM(x2)) of each groups.[8]

AVG: or
or, only if COUNT(X)=COUNT(Y)

SUM(x2): The sum of squares of the values is important in order to calculate the Standard Deviation of groups

STDDEV:
For a finite population with equal probabilities at all points, we have[9][circular reference]

This means that the standard deviation is equal to the square root of the difference between the average of the squares of the values and the square of the average value.

See also

[edit]

References

[edit]

Literature

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
An aggregate function, also known as an aggregation function, is a mathematical operation that processes a set of input values to produce a single representative output value, effectively summarizing or combining the inputs for analysis or decision-making. These functions are foundational in fields such as , , and , where they enable the reduction of complex datasets into meaningful scalars. In mathematics, aggregate functions are rigorously defined with properties like monotonicity (non-decreasing behavior), boundary conditions (mapping extremes to extremes, such as all zeros to zero), and often (reproducing inputs when applied to identical values), making them suitable for applications in , , and multi-criteria analysis. Common examples include the , which computes the of numbers, the minimum or maximum functions for selecting extremes, and more advanced forms like t-norms and t-conorms used in probabilistic modeling. These properties ensure that aggregate functions preserve order and provide interpretable results, with extensive study in aggregation theory exploring decomposability and continuity. In and , aggregate functions are implemented in query languages like SQL to perform computations over grouped , such as for tallying rows, SUM for totals, AVG for averages, MIN and MAX for bounds, and specialized variants like STRING_AGG for concatenating text. They classify into distributive (e.g., SUM, computable via partitions), algebraic (e.g., AVG, derived from simpler aggregates), and holistic types (e.g., , requiring full access), supporting efficient processing in databases, big data frameworks like Hadoop and Spark, and stream analytics. This versatility underscores their role in data summarization, reporting, and scalable computation across modern information systems.

Definition and Fundamentals

Formal Definition

In mathematics, an aggregate function, also known as an aggregation function, is formally defined as a mapping f:n=1[0,1]n[0,1]f: \bigcup_{n=1}^\infty [0,1]^n \to [0,1] that combines multiple input values into a single representative output, where each component function f(n):[0,1]n[0,1]f^{(n)}: [0,1]^n \to [0,1] is nondecreasing in each argument and satisfies the boundary conditions f(n)(0,,0)=0f^{(n)}(0, \dots, 0) = 0 and f(n)(1,,1)=1f^{(n)}(1, \dots, 1) = 1. This definition captures the essence of summarizing a finite collection of numerical elements, typically from an interval like [0,1][0,1] (representing degrees of certainty or utility in applications such as fuzzy logic), into one value that preserves key informational aspects of the inputs. The domain is the union over all finite dimensions nn, allowing for variable input sizes, and the function is often extended to act symmetrically on unordered collections via multisets. This formulation assumes familiarity with basic set theory, including the construction of Cartesian products [0,1]n[0,1]^n for finite nn and the notion of monotonicity. For handling collections with potential repetitions (multisets), the notation extends naturally, such as evaluating f(AB)f(A \cup B) where \cup denotes multiset union, ensuring the aggregation respects the combined counts without regard to order. Unlike general set functions, which map subsets of a universe to real numbers without imposed structural constraints (e.g., capacities used in Choquet integrals), aggregate functions incorporate monotonicity and boundary preservation to ensure interpretable summarization, facilitating efficient computation in hierarchical or recursive applications. These properties distinguish them as specialized tools for data fusion rather than arbitrary valuations. The concept of aggregate functions traces its origins to , where means and averaging techniques were first studied, laying the groundwork for modern formalizations in and .

Basic Examples

To illustrate aggregate functions, consider simple numerical sets where these operations reduce multiple values to a single representative result, aligning with their formal definition as mappings from a collection of to a scalar output. The , or average, measures by dividing the sum by the number of elements, given by 1ni=1nxi\frac{1}{n} \sum_{i=1}^n x_i. For the set {0.1, 0.2, 0.3}, it yields 0.2. The minimum function identifies the smallest value in the set, denoted min{xi}\min\{x_i\}, while the maximum identifies the largest, denoted max{xi}\max\{x_i\}. For {0.1, 0.2, 0.3}, these are 0.1 and 0.3, respectively. These computations can be summarized in the following table for the set {0.1, 0.2, 0.3}:
SetMinMax
{0.1,0.2,0.3}0.20.10.3

Key Properties

Decomposability

Decomposability is a fundamental property of aggregate functions that facilitates their efficient evaluation over large datasets by allowing computation on partitions that can later be combined. An aggregate function ff is decomposable if there exists an associative and commutative binary operator \oplus such that, for any two disjoint finite sets AA and BB, f(AB)=f(A)f(B)f(A \cup B) = f(A) \oplus f(B). This structure often forms a commutative , where \oplus has a neutral ( ee satisfying xe=ex=xx \oplus e = e \oplus x = x for all xx in the , and f({e})=ef(\{e\}) = e; for instance, e=0e = 0 serves as the neutral element for the sum function, as summing a singleton set containing 0 yields 0. To demonstrate, consider the sum function \sum, a classic decomposable aggregate. For disjoint finite sets A={a1,,am}A = \{a_1, \dots, a_m\} and B={b1,,bn}B = \{b_1, \dots, b_n\}, the sum over the union is: (AB)=i=1mai+j=1nbj=A+B,\sum (A \cup B) = \sum_{i=1}^m a_i + \sum_{j=1}^n b_j = \sum A + \sum B, where \oplus is . This equality holds by the of , as the total is simply the additive combination of the partial totals without overlap. Such decomposability underpins the of scalable aggregation in relational systems, as seen in early formulations for multi-dimensional cubes. The property's importance lies in its support for parallel and distributed processing: data can be partitioned into subsets, aggregated independently on each (e.g., across processors or network nodes), and the results merged via \oplus, enabling stepwise reduction while minimizing transfer and enabling for massive datasets. Decomposability extends seamlessly to multisets, where elements may have multiplicities greater than one, by incorporating a multiplicity function that accounts for duplicates in the aggregation. Duplicate-sensitive decomposable functions, such as sum, incorporate these multiplicities into the partial results (e.g., repeated values add multiple times), while duplicate-insensitive ones, like minimum, ignore multiplicity and depend only on the support set.

Monotonicity and Idempotence

In aggregate functions, monotonicity refers to the behavior of the function under subset inclusions of the input data. An aggregation function ff is monotonically increasing if, for any two multisets S1S_1 and S2S_2 with S1S2S_1 \subseteq S_2, it holds that f(S1)f(S2)f(S_1) \leq f(S_2); conversely, it is monotonically decreasing if f(S1)f(S2)f(S_1) \geq f(S_2) under the same condition. Common examples of monotonically increasing aggregates include the sum, which accumulates values and thus grows or stays the same with added elements, and the maximum, where including more elements cannot decrease the largest value. However, not all aggregates are monotonic; for instance, variance is non-monotonic, as adding elements to a set can either increase or decrease the spread depending on their deviation from the —for example, adding a value close to the current may reduce variance relative to adding an . Idempotence describes the stability of an aggregate under repeated application to identical inputs. An aggregation function ff is idempotent if f(x,x,,x)=xf(x, x, \dots, x) = x (applied n times for any n). This property holds for the minimum and maximum functions: for the maximum, max(x,x,,x)=x\max(x, x, \dots, x) = x, since the operation selects the largest element regardless of repetition. In contrast, the sum is not idempotent; (x,x,,x)=nxx\sum(x, x, \dots, x) = n x \neq x unless n=1. The is idempotent, as mean(x,x,,x)=x\mathrm{mean}(x, x, \dots, x) = x. For decomposable aggregates, idempotence corresponds to the combining operator \oplus satisfying xx=xx \oplus x = x. Certain aggregate functions exhibit Schur-convexity or Schur-concavity, which capture sensitivity to the inequality in data distributions via order—where a vector yy majorizes xx if yy is more spread out while preserving the sum. Schur-convex functions, such as variance, increase under majorization, reflecting greater dispersion. The , an inequality measure often treated as an aggregate of income distributions, is Schur-convex, meaning it rises with increased inequality in the majorized direction.
FunctionMonotonicIdempotent
SumYes (increasing)No
MaxYes (increasing)Yes
MeanNoYes
These properties are derived from standard analyses in database query optimization and aggregation theory.

Types of Aggregate Functions

Decomposable Aggregate Functions

Decomposable aggregate functions are those that permit over disjoint subsets of , with partial results combinable to yield the overall aggregate, enabling efficient parallel or distributed . This property, formalized as the existence of partial aggregation functions F1F_1 and combination function F2F_2 such that F(S1S2)=F2(F1(S1),F1(S2))F(S_1 \cup S_2) = F_2(F_1(S_1), F_1(S_2)) for S1S_1 and S2S_2, supports optimizations like eager aggregation in query . Common examples include the product function, defined as i=1nxi\prod_{i=1}^n x_i, which decomposes multiplicatively since the product over a union equals the product of partial products. Similarly, logical operations on booleans, such as AND (conjunction) and OR (disjunction), are decomposable: AND over a union is the AND of partial ANDs, and likewise for OR, treating true/false as 1/0 values analogous to or under modulo 2. These functions allow incremental updates via simple combination. Extended means, particularly power means, exemplify decomposability through summation-based partial computations. The power mean of order pp is given by mp(x)=(1ni=1nxip)1/p,m_p(\mathbf{x}) = \left( \frac{1}{n} \sum_{i=1}^n x_i^p \right)^{1/p}, for p0p \neq 0 and positive xix_i, which decomposes because partial power sums can be aggregated additively before normalization and rooting. As quasi-arithmetic means satisfying , continuity, strict monotonicity, idempotency, and decomposability axioms, power means enable replacement of data subsets with their partial aggregates without altering the result. Decomposable functions categorize into additive types like sum (xi\sum x_i), where partial sums combine via addition; multiplicative types like product, where partial products multiply; and extremal functions like min and max, which serve as limiting cases of power means (as pp \to -\infty for min and pp \to \infty for max) and decompose via selection of partial minima or maxima. These categories facilitate partition-tolerant computation in distributed systems. Incremental aggregation algorithms exploit decomposability for efficiency. For sum, initializes a partial accumulator and updates iteratively:

function partial_sum(values): partial = 0 for x in values: partial += x return partial function combine_sums(p1, p2): return p1 + p2

function partial_sum(values): partial = 0 for x in values: partial += x return partial function combine_sums(p1, p2): return p1 + p2

For product, replaces :

function partial_product(values): partial = 1 for x in values: partial *= x return partial function combine_products(p1, p2): return p1 * p2

function partial_product(values): partial = 1 for x in values: partial *= x return partial function combine_products(p1, p2): return p1 * p2

Min decomposes via comparison:

function partial_min(values): partial = infinity # or first value for x in values: if x < partial: partial = x return partial function combine_mins(p1, p2): return min(p1, p2)

function partial_min(values): partial = infinity # or first value for x in values: if x < partial: partial = x return partial function combine_mins(p1, p2): return min(p1, p2)

These patterns extend to power means by maintaining partial power sums and counts for combination. Such algorithms reduce communication in parallel environments by transmitting fixed-size partial states. While many means are decomposable, not all are directly so without adjustment; for instance, the harmonic mean HM(x)=n/i=1n(1/xi)HM(\mathbf{x}) = n / \sum_{i=1}^n (1/x_i) requires tracking partial sums of reciprocals and subgroup sizes for combination, rather than a single partial value. This adjustment preserves correctness but increases state complexity compared to purely additive or multiplicative cases.

Non-Decomposable Aggregate Functions

Non-decomposable aggregate functions are those for which no combining operator \oplus exists that allows the computation of the aggregate over arbitrary partitions solely from the aggregates computed on each partition, necessitating access to the full dataset or additional auxiliary information. Unlike decomposable functions, these aggregates depend holistically on the entire data distribution, making them challenging for distributed or parallel processing. A classic example is the variance, defined as σ2=1ni=1n(xixˉ)2\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2, where xˉ\bar{x} is the mean; direct merging of subgroup variances requires knowledge of the subgroup means and sizes to avoid inaccuracies. Prominent examples include the , which selects the middle value after sorting the complete and thus resists partitioning without resorting; the mode, which identifies the most frequent value by requiring a global frequency count; and the standard deviation, σ=σ2\sigma = \sqrt{\sigma^2}
Add your contribution
Related Hubs
User Avatar
No comments yet.