Hubbry Logo
Kolmogorov complexityKolmogorov complexityMain
Open search
Kolmogorov complexity
Community hub
Kolmogorov complexity
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Kolmogorov complexity
Kolmogorov complexity
from Wikipedia
This image illustrates part of the Mandelbrot set fractal. Simply storing the 24-bit color of each pixel in this image would require 23 million bytes, but a small computer program can reproduce these 23 MB using the definition of the Mandelbrot set, the corner coordinates of the image and the parameters of the color mapping. Thus, the Kolmogorov complexity of this image is much less than 23 MB in any pragmatic model of computation. PNG's general-purpose image compression only reduces it to 1.6 MB, smaller than the raw data but much larger than the Kolmogorov complexity.

In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963[1][2] and is a generalization of classical information theory.

The notion of Kolmogorov complexity can be used to state and prove impossibility results akin to Cantor's diagonal argument, Gödel's incompleteness theorem, and Turing's halting problem. In particular, no program P computing a lower bound for each text's Kolmogorov complexity can return a value essentially larger than P's own length (see section § Chaitin's incompleteness theorem); hence no single program can compute the exact Kolmogorov complexity for infinitely many texts.

Definition

[edit]

Intuition

[edit]

Consider the following two strings of 32 lowercase letters and digits:

abababababababababababababababab, and
4c1j5b2p0cv4w1x8rx2y39umgw5q85s7

The first string has a short English-language description, namely "write ab 16 times", which consists of 17 characters. The second one has no obvious simple description (using the same character set) other than writing down the string itself, i.e., "write 4c1j5b2p0cv4w1x8rx2y39umgw5q85s7" which has 38 characters. Hence the operation of writing the first string can be said to have "less complexity" than writing the second.

More formally, the complexity of a string is the length of the shortest possible description of the string in some fixed universal description language (the sensitivity of complexity relative to the choice of description language is discussed below). It can be shown that the Kolmogorov complexity of any string cannot be more than a few bytes larger than the length of the string itself. Strings like the abab example above, whose Kolmogorov complexity is small relative to the string's size, are not considered to be complex.

The Kolmogorov complexity can be defined for any mathematical object, but for simplicity the scope of this article is restricted to strings. We must first specify a description language for strings. Such a description language can be based on any computer programming language, such as Lisp, Pascal, or Java. If P is a program which outputs a string x, then P is a description of x. The length of the description is just the length of P as a character string, multiplied by the number of bits in a character (e.g., 7 for ASCII).

We could, alternatively, choose an encoding for Turing machines, where an encoding is a function which associates to each Turing Machine M a bitstring <M>. If M is a Turing Machine which, on input w, outputs string x, then the concatenated string <M> w is a description of x. For theoretical analysis, this approach is more suited for constructing detailed formal proofs and is generally preferred in the research literature. In this article, an informal approach is discussed.

Any string s has at least one description. For example, the second string above is output by the pseudo-code:

function GenerateString2()
    return "4c1j5b2p0cv4w1x8rx2y39umgw5q85s7"

whereas the first string is output by the (much shorter) pseudo-code:

function GenerateString1()
    return "ab" × 16

If a description d(s) of a string s is of minimal length (i.e., using the fewest bits), it is called a minimal description of s, and the length of d(s) (i.e. the number of bits in the minimal description) is the Kolmogorov complexity of s, written K(s). Symbolically,

K(s) = |d(s)|.

The length of the shortest description will depend on the choice of description language; but the effect of changing languages is bounded (a result called the invariance theorem, see below).

Plain Kolmogorov complexity C

[edit]

There are two definitions of Kolmogorov complexity: plain and prefix-free. The plain complexity is the minimal description length of any program, and denoted while the prefix-free complexity is the minimal description length of any program encoded in a prefix-free code, and denoted . The plain complexity is more intuitive, but the prefix-free complexity is easier to study.

By default, all equations hold only up to an additive constant. For example, really means that , that is, .

Let be a computable function mapping finite binary strings to binary strings. It is a universal function if, and only if, for any computable , we can encode the function in a "program" , such that . We can think of as a program interpreter, which takes in an initial segment describing the program, followed by data that the program should process.

One problem with plain complexity is that , because intuitively speaking, there is no general way to tell where to divide an output string just by looking at the concatenated string. We can divide it by specifying the length of or , but that would take extra symbols. Indeed, for any there exists such that .[3]

Typically, inequalities with plain complexity have a term like on one side, whereas the same inequalities with prefix-free complexity have only .

The main problem with plain complexity is that there is something extra sneaked into a program. A program not only represents for something with its code, but also represents its own length. In particular, a program may represent a binary number up to , simply by its own length. Stated in another way, it is as if we are using a termination symbol to denote where a word ends, and so we are not using 2 symbols, but 3. To fix this defect, we introduce the prefix-free Kolmogorov complexity.[4]

Prefix-free Kolmogorov complexity K

[edit]

A prefix-free code is a subset of such that given any two different words in the set, neither is a prefix of the other. The benefit of a prefix-free code is that we can build a machine that reads words from the code forward in one direction, and as soon as it reads the last symbol of the word, it knows that the word is finished, and does not need to backtrack or a termination symbol.

Define a prefix-free Turing machine to be a Turing machine that comes with a prefix-free code, such that the Turing machine can read any string from the code in one direction, and stop reading as soon as it reads the last symbol. Afterwards, it may compute on a work tape and write to a write tape, but it cannot move its read-head anymore.

This gives us the following formal way to describe K.[5]

  • Fix a prefix-free universal Turing machine, with three tapes: a read tape infinite in one direction, a work tape infinite in two directions, and a write tape infinite in one direction.
  • The machine can read from the read tape in one direction only (no backtracking), and write to the write tape in one direction only. It can read and write the work tape in both directions.
  • The work tape and write tape start with all zeros. The read tape starts with an input prefix code, followed by all zeros.
  • Let be the prefix-free code on , used by the universal Turing machine.

Note that some universal Turing machines may not be programmable with prefix codes. We must pick only a prefix-free universal Turing machine.

The prefix-free complexity of a string is the shortest prefix code that makes the machine output :

Invariance theorem

[edit]

Informal treatment

[edit]

There are some description languages which are optimal, in the following sense: given any description of an object in a description language, said description may be used in the optimal description language with a constant overhead. The constant depends only on the languages involved, not on the description of the object, nor the object being described.

Here is an example of an optimal description language. A description will have two parts:

  • The first part describes another description language.
  • The second part is a description of the object in that language.

In more technical terms, the first part of a description is a computer program (specifically: a compiler for the object's language, written in the description language), with the second part being the input to that computer program which produces the object as output.

The invariance theorem follows: Given any description language L, the optimal description language is at least as efficient as L, with some constant overhead.

Proof: Any description D in L can be converted into a description in the optimal language by first describing L as a computer program P (part 1), and then using the original description D as input to that program (part 2). The total length of this new description D′ is (approximately):

|D′ | = |P| + |D|

The length of P is a constant that doesn't depend on D. So, there is at most a constant overhead, regardless of the object described. Therefore, the optimal language is universal up to this additive constant.

A more formal treatment

[edit]

Theorem: If K1 and K2 are the complexity functions relative to Turing complete description languages L1 and L2, then there is a constant c – which depends only on the languages L1 and L2 chosen – such that

s. −cK1(s) − K2(s) ≤ c.

Proof: By symmetry, it suffices to prove that there is some constant c such that for all strings s

K1(s) ≤ K2(s) + c.

Now, suppose there is a program in the language L1 which acts as an interpreter for L2:

function InterpretLanguage(string p)

where p is a program in L2. The interpreter is characterized by the following property:

Running InterpretLanguage on input p returns the result of running p.

Thus, if P is a program in L2 which is a minimal description of s, then InterpretLanguage(P) returns the string s. The length of this description of s is the sum of

  1. The length of the program InterpretLanguage, which we can take to be the constant c.
  2. The length of P which by definition is K2(s).

This proves the desired upper bound.

History and context

[edit]

Algorithmic information theory is the area of computer science that studies Kolmogorov complexity and other complexity measures on strings (or other data structures).

The concept and theory of Kolmogorov Complexity is based on a crucial theorem first discovered by Ray Solomonoff, who published it in 1960, describing it in "A Preliminary Report on a General Theory of Inductive Inference"[6] as part of his invention of algorithmic probability. He gave a more complete description in his 1964 publications, "A Formal Theory of Inductive Inference," Part 1 and Part 2 in Information and Control.[7][8]

Andrey Kolmogorov later independently published this theorem in Problems Inform. Transmission in 1965.[9] Gregory Chaitin also presents this theorem in the Journal of the ACM – Chaitin's paper was submitted October 1966 and revised in December 1968, and cites both Solomonoff's and Kolmogorov's papers.[10]

The theorem says that, among algorithms that decode strings from their descriptions (codes), there exists an optimal one. This algorithm, for all strings, allows codes as short as allowed by any other algorithm up to an additive constant that depends on the algorithms, but not on the strings themselves. Solomonoff used this algorithm and the code lengths it allows to define a "universal probability" of a string on which inductive inference of the subsequent digits of the string can be based. Kolmogorov used this theorem to define several functions of strings, including complexity, randomness, and information.

When Kolmogorov became aware of Solomonoff's work, he acknowledged Solomonoff's priority.[11] For several years, Solomonoff's work was better known in the Soviet Union than in the West. The general consensus in the scientific community, however, was to associate this type of complexity with Kolmogorov, who was concerned with randomness of a sequence, while Algorithmic Probability became associated with Solomonoff, who focused on prediction using his invention of the universal prior probability distribution. The broader area encompassing descriptional complexity and probability is often called Kolmogorov complexity. The computer scientist Ming Li considers this an example of the Matthew effect: "...to everyone who has, more will be given..."[12]

There are several other variants of Kolmogorov complexity or algorithmic information. The most widely used one is based on self-delimiting programs, and is mainly due to Leonid Levin (1974).

An axiomatic approach to Kolmogorov complexity based on Blum axioms (Blum 1967) was introduced by Mark Burgin in the paper presented for publication by Andrey Kolmogorov.[13]

Basic results

[edit]

We write to be , where means some fixed way to code for a tuple of strings x and y.

Inequalities

[edit]

We omit additive factors of . This section is based on.[5]

Theorem.

Proof. Take any program for the universal Turing machine used to define plain complexity, and convert it to a prefix-free program by first coding the length of the program in binary, then convert the length to prefix-free coding. For example, suppose the program has length 9, then we can convert it as follows:where we double each digit, then add a termination code. The prefix-free universal Turing machine can then read in any program for the other machine as follows:The first part programs the machine to simulate the other machine, and is a constant overhead . The second part has length . The third part has length .

Theorem: There exists such that . More succinctly, . Similarly, , and .[clarification needed]

Proof. For the plain complexity, just write a program that simply copies the input to the output. For the prefix-free complexity, we need to first describe the length of the string, before writing out the string itself.

Theorem. (extra information bounds, subadditivity)

Note that there is no way to compare and or or or . There are strings such that the whole string is easy to describe, but its substrings are very hard to describe.

Theorem. (symmetry of information) .

Proof. One side is simple. For the other side with , we need to use a counting argument (page 38 [14]).

Theorem. (information non-increase) For any computable function , we have .

Proof. Program the Turing machine to read two subsequent programs, one describing the function and one describing the string. Then run both programs on the work tape to produce , and write it out.

Uncomputability of Kolmogorov complexity

[edit]

A naive attempt at a program to compute K

[edit]

At first glance it might seem trivial to write a program which can compute K(s) for any s, such as the following:

function KolmogorovComplexity(string s)
    for i = 1 to infinity:
        for each string p of length exactly i
            if isValidProgram(p) and evaluate(p) == s
                return i

This program iterates through all possible programs (by iterating through all possible strings and only considering those which are valid programs), starting with the shortest. Each program is executed to find the result produced by that program, comparing it to the input s. If the result matches then the length of the program is returned.

However this will not work because some of the programs p tested will not terminate, e.g. if they contain infinite loops. There is no way to avoid all of these programs by testing them in some way before executing them due to the non-computability of the halting problem.

What is more, no program at all can compute the function K, be it ever so sophisticated. This is proven in the following.

Formal proof of uncomputability of K

[edit]

Theorem: There exist strings of arbitrarily large Kolmogorov complexity. Formally: for each natural number n, there is a string s with K(s) ≥ n.[note 1]

Proof: Otherwise all of the infinitely many possible finite strings could be generated by the finitely many[note 2] programs with a complexity below n bits.

Theorem: K is not a computable function. In other words, there is no program which takes any string s as input and produces the integer K(s) as output.

The following proof by contradiction uses a simple Pascal-like language to denote programs; for sake of proof simplicity assume its description (i.e. an interpreter) to have a length of 1400000 bits. Assume for contradiction there is a program

function KolmogorovComplexity(string s)

which takes as input a string s and returns K(s). All programs are of finite length so, for sake of proof simplicity, assume it to be 7000000000 bits. Now, consider the following program of length 1288 bits:

function GenerateComplexString()
    for i = 1 to infinity:
        for each string s of length exactly i
            if KolmogorovComplexity(s) ≥ 8000000000
                return s

Using KolmogorovComplexity as a subroutine, the program tries every string, starting with the shortest, until it returns a string with Kolmogorov complexity at least 8000000000 bits,[note 3] i.e. a string that cannot be produced by any program shorter than 8000000000 bits. However, the overall length of the above program that produced s is only 7001401288 bits,[note 4] which is a contradiction. (If the code of KolmogorovComplexity is shorter, the contradiction remains. If it is longer, the constant used in GenerateComplexString can always be changed appropriately.)[note 5]

The above proof uses a contradiction similar to that of the Berry paradox: "1The 2smallest 3positive 4integer 5that 6cannot 7be 8defined 9in 10fewer 11than 12twenty 13English 14words". It is also possible to show the non-computability of K by reduction from the non-computability of the halting problem H, since K and H are Turing-equivalent.[15]

There is a corollary, humorously called the "full employment theorem" in the programming language community, stating that there is no perfect size-optimizing compiler.

Chain rule for Kolmogorov complexity

[edit]

The chain rule[16] for Kolmogorov complexity states that there exists a constant c such that for all X and Y:

K(X,Y) = K(X) + K(Y|X) + c*max(1,log(K(X,Y))).

It states that the shortest program that reproduces X and Y is no more than a logarithmic term larger than a program to reproduce X and a program to reproduce Y given X. Using this statement, one can define an analogue of mutual information for Kolmogorov complexity.

Compression

[edit]

It is straightforward to compute upper bounds for K(s) – simply compress the string s with some method, implement the corresponding decompressor in the chosen language, concatenate the decompressor to the compressed string, and measure the length of the resulting string – concretely, the size of a self-extracting archive in the given language.

A string s is compressible by a number c if it has a description whose length does not exceed |s| − c bits. This is equivalent to saying that K(s) ≤ |s| − c. Otherwise, s is incompressible by c. A string incompressible by 1 is said to be simply incompressible – by the pigeonhole principle, which applies because every compressed string maps to only one uncompressed string, incompressible strings must exist, since there are 2n bit strings of length n, but only 2n − 1 shorter strings, that is, strings of length less than n, (i.e. with length 0, 1, ..., n − 1).[note 6]

For the same reason, most strings are complex in the sense that they cannot be significantly compressed – their K(s) is not much smaller than |s|, the length of s in bits. To make this precise, fix a value of n. There are 2n bitstrings of length n. The uniform probability distribution on the space of these bitstrings assigns exactly equal weight 2n to each string of length n.

Theorem: With the uniform probability distribution on the space of bitstrings of length n, the probability that a string is incompressible by c is at least 1 − 2c+1 + 2n.

To prove the theorem, note that the number of descriptions of length not exceeding nc is given by the geometric series:

1 + 2 + 22 + ... + 2nc = 2nc+1 − 1.

There remain at least

2n − 2nc+1 + 1

bitstrings of length n that are incompressible by c. To determine the probability, divide by 2n.

Chaitin's incompleteness theorem

[edit]
Kolmogorov complexity K(s), and two computable lower bound functions prog1(s), prog2(s). The horizontal axis (logarithmic scale) enumerates all strings s, ordered by length; the vertical axis (linear scale) measures Kolmogorov complexity in bits. Most strings are incompressible, i.e. their Kolmogorov complexity exceeds their length by a constant amount. 9 compressible strings are shown in the picture, appearing as almost vertical slopes. Due to Chaitin's incompleteness theorem (1974), the output of any program computing a lower bound of the Kolmogorov complexity cannot exceed some fixed limit, which is independent of the input string s.

By the above theorem (§ Compression), most strings are complex in the sense that they cannot be described in any significantly "compressed" way. However, it turns out that the fact that a specific string is complex cannot be formally proven, if the complexity of the string is above a certain threshold. The precise formalization is as follows. First, fix a particular axiomatic system S for the natural numbers. The axiomatic system has to be powerful enough so that, to certain assertions A about complexity of strings, one can associate a formula FA in S. This association must have the following property:

If FA is provable from the axioms of S, then the corresponding assertion A must be true. This "formalization" can be achieved based on a Gödel numbering.

Theorem: There exists a constant L (which only depends on S and on the choice of description language) such that there does not exist a string s for which the statement

K(s) ≥ L       (as formalized in S)

can be proven within S.[17][18]

Proof Idea: The proof of this result is modeled on a self-referential construction used in Berry's paradox. We firstly obtain a program which enumerates the proofs within S and we specify a procedure P which takes as an input an integer L and prints the strings x which are within proofs within S of the statement K(x) ≥ L. By then setting L to greater than the length of this procedure P, we have that the required length of a program to print x as stated in K(x) ≥ L as being at least L is then less than the amount L since the string x was printed by the procedure P. This is a contradiction. So it is not possible for the proof system S to prove K(x) ≥ L for L arbitrarily large, in particular, for L larger than the length of the procedure P, (which is finite).

Proof:

We can find an effective enumeration of all the formal proofs in S by some procedure

function NthProof(int n)

which takes as input n and outputs some proof. This function enumerates all proofs. Some of these are proofs for formulas we do not care about here, since every possible proof in the language of S is produced for some n. Some of these are complexity formulas of the form K(s) ≥ n where s and n are constants in the language of S. There is a procedure

function NthProofProvesComplexityFormula(int n)

which determines whether the nth proof actually proves a complexity formula K(s) ≥ L. The strings s, and the integer L in turn, are computable by procedure:

function StringNthProof(int n)
function ComplexityLowerBoundNthProof(int n)

Consider the following procedure:

function GenerateProvablyComplexString(int n)
    for i = 1 to infinity:
        if NthProofProvesComplexityFormula(i) and ComplexityLowerBoundNthProof(i) ≥ n
            return StringNthProof(i)

Given an n, this procedure tries every proof until it finds a string and a proof in the formal system S of the formula K(s) ≥ L for some L ≥ n; if no such proof exists, it loops forever.

Finally, consider the program consisting of all these procedure definitions, and a main call:

GenerateProvablyComplexString(n0)

where the constant n0 will be determined later on. The overall program length can be expressed as U+log2(n0), where U is some constant and log2(n0) represents the length of the integer value n0, under the reasonable assumption that it is encoded in binary digits. We will choose n0 to be greater than the program length, that is, such that n0 > U+log2(n0). This is clearly true for n0 sufficiently large, because the left hand side grows linearly in n0 whilst the right hand side grows logarithmically in n0 up to the fixed constant U.

Then no proof of the form "K(s)≥L" with Ln0 can be obtained in S, as can be seen by an indirect argument: If ComplexityLowerBoundNthProof(i) could return a value ≥n0, then the loop inside GenerateProvablyComplexString would eventually terminate, and that procedure would return a string s such that

K(s)
n0           by construction of GenerateProvablyComplexString
> U+log2(n0) by the choice of n0
K(s) since s was described by the program with that length

This is a contradiction, Q.E.D.

As a consequence, the above program, with the chosen value of n0, must loop forever.

Similar ideas are used to prove the properties of Chaitin's constant.

Minimum message length

[edit]

The minimum message length principle of statistical and inductive inference and machine learning was developed by C.S. Wallace and D.M. Boulton in 1968. MML is Bayesian (i.e. it incorporates prior beliefs) and information-theoretic. It has the desirable properties of statistical invariance (i.e. the inference transforms with a re-parametrisation, such as from polar coordinates to Cartesian coordinates), statistical consistency (i.e. even for very hard problems, MML will converge to any underlying model) and efficiency (i.e. the MML model will converge to any true underlying model about as quickly as is possible). C.S. Wallace and D.L. Dowe (1999) showed a formal connection between MML and algorithmic information theory (or Kolmogorov complexity).[19]

Kolmogorov randomness

[edit]

Kolmogorov randomness defines a string (usually of bits) as being random if and only if every computer program that can produce that string is at least as long as the string itself. To make this precise, a universal computer (or universal Turing machine) must be specified, so that "program" means a program for this universal machine. A random string in this sense is "incompressible" in that it is impossible to "compress" the string into a program that is shorter than the string itself. For every universal computer, there is at least one algorithmically random string of each length.[20] Whether a particular string is random, however, depends on the specific universal computer that is chosen. This is because a universal computer can have a particular string hard-coded in itself, and a program running on this universal computer can then simply refer to this hard-coded string using a short sequence of bits (i.e. much shorter than the string itself).

This definition can be extended to define a notion of randomness for infinite sequences from a finite alphabet. These algorithmically random sequences can be defined in three equivalent ways. One way uses an effective analogue of measure theory; another uses effective martingales. The third way defines an infinite sequence to be random if the prefix-free Kolmogorov complexity of its initial segments grows quickly enough — there must be a constant c such that the complexity of an initial segment of length n is always at least nc. This definition, unlike the definition of randomness for a finite string, is not affected by which universal machine is used to define prefix-free Kolmogorov complexity.[21]

Relation to entropy

[edit]

For dynamical systems, entropy rate and algorithmic complexity of the trajectories are related by a theorem of Brudno, that the equality holds for almost all .[22]

It can be shown[23] that for the output of Markov information sources, Kolmogorov complexity is related to the entropy of the information source. More precisely, the Kolmogorov complexity of the output of a Markov information source, normalized by the length of the output, converges almost surely (as the length of the output goes to infinity) to the entropy of the source.

Theorem. (Theorem 14.2.5 [24]) The conditional Kolmogorov complexity of a binary string satisfieswhere is the binary entropy function (not to be confused with the entropy rate).

Halting problem

[edit]

The Kolmogorov complexity function is equivalent to deciding the halting problem.

If we have a halting oracle, then the Kolmogorov complexity of a string can be computed by simply trying every halting program, in lexicographic order, until one of them outputs the string.

The other direction is much more involved.[25][26] It shows that given a Kolmogorov complexity function, we can construct a function , such that for all large , where is the Busy Beaver shift function (also denoted as ). By modifying the function at lower values of we get an upper bound on , which solves the halting problem.

Consider this program , which takes input as , and uses .

  • List all strings of length .
  • For each such string , enumerate all (prefix-free) programs of length until one of them does output . Record its runtime .
  • Output the largest .

We prove by contradiction that for all large .

Let be a Busy Beaver of length . Consider this (prefix-free) program, which takes no input:

  • Run the program , and record its runtime length .
  • Generate all programs with length . Run every one of them for up to steps. Note the outputs of those that have halted.
  • Output the string with the lowest lexicographic order that has not been output by any of those.

Let the string output by the program be .

The program has length , where comes from the length of the Busy Beaver , comes from using the (prefix-free) Elias delta code for the number , and comes from the rest of the program. Therefore,for all big . Further, since there are only so many possible programs with length , we have by pigeonhole principle. By assumption, , so every string of length has a minimal program with runtime . Thus, the string has a minimal program with runtime . Further, that program has length . This contradicts how was constructed.

Universal probability

[edit]

Fix a universal Turing machine , the same one used to define the (prefix-free) Kolmogorov complexity. Define the (prefix-free) universal probability of a string to beIn other words, it is the probability that, given a uniformly random binary stream as input, the universal Turing machine would halt after reading a certain prefix of the stream, and output .

Note. does not mean that the input stream is , but that the universal Turing machine would halt at some point after reading the initial segment , without reading any further input, and that, when it halts, its has written to the output tape.

Theorem. (Theorem 14.11.1[24])

Implications in biology

[edit]

Kolmogorov complexity has been used in the context of biology to argue that the symmetries and modular arrangements observed in multiple species emerge from the tendency of evolution to prefer minimal Kolmogorov complexity.[27] Considering the genome as a program that must solve a task or implement a series of functions, shorter programs would be preferred on the basis that they are easier to find by the mechanisms of evolution.[28] An example of this approach is the eight-fold symmetry of the compass circuit that is found across insect species, which correspond to the circuit that is both functional and requires the minimum Kolmogorov complexity to be generated from self-replicating units.[29]

Conditional versions

[edit]

The conditional Kolmogorov complexity of two strings is, roughly speaking, defined as the Kolmogorov complexity of x given y as an auxiliary input to the procedure.[30][31] So while the (unconditional) Kolmogorov complexity of a sequence is the length of the shortest binary program that outputs on a universal computer and can be thought of as the minimal amount of information necessary to produce , the conditional Kolmogorov complexity is defined as the length of the shortest binary program that computes when is given as input, using a universal computer.[32]

There is also a length-conditional complexity , which is the complexity of x given the length of x as known/input.[33][34]

Time-bounded complexity

[edit]

Time-bounded Kolmogorov complexity is a modified version of Kolmogorov complexity where the space of programs to be searched for a solution is confined to only programs that can run within some pre-defined number of steps.[35] It is hypothesised that the possibility of the existence of an efficient algorithm for determining approximate time-bounded Kolmogorov complexity is related to the question of whether true one-way functions exist.[36][37]

See also

[edit]

Notes

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Kolmogorov complexity, also known as algorithmic complexity or descriptive complexity, is a measure of the computational resources required to describe an individual finite object, such as a binary string, defined as the length of the shortest program that outputs the object on a universal Turing machine. This concept quantifies the intrinsic information content or randomness of the object, independent of any probability distribution, by focusing on the minimal algorithmic description rather than statistical properties. Introduced by Soviet mathematician Andrey Kolmogorov in his 1965 paper "Three approaches to the quantitative definition of information," it provides a foundational tool in algorithmic information theory for understanding complexity without relying on ensemble averages like Shannon entropy. The theory emerged from efforts to formalize a notion of randomness for individual sequences, building on earlier ideas in cybernetics and probability, such as Richard von Mises' concept of Kollektivs and Claude Shannon's information theory. Independently, Ray Solomonoff developed related ideas in 1960 for inductive inference, and Gregory Chaitin proposed an equivalent formulation in 1966, leading to what is sometimes called Solomonoff–Kolmogorov–Chaitin complexity. Kolmogorov's approach specifically addressed three quantitative definitions of information: combinatorial (based on the number of objects), probabilistic (Shannon-like), and algorithmic (program-length based), with the latter becoming dominant due to its precision in capturing absolute randomness. Several variants of Kolmogorov complexity exist to address limitations in the basic definition, such as sensitivity to program delimiters or input ordering. The plain complexity C(x)C(x) is the length of the shortest program outputting string xx, while the prefix complexity K(x)K(x) uses self-delimiting (prefix-free) programs to ensure domain additivity, approximating K(x)logm(x)K(x) \approx -\log m(x) where m(x)m(x) is the universal a priori probability. The monotone complexity KM(x)KM(x) employs monotone Turing machines, allowing extensions beyond exact output, which is useful for analyzing infinite sequences. These measures are invariant up to an additive constant across equivalent universal machines, but all are uncomputable due to the undecidability of the halting problem, rendering exact values impossible to determine algorithmically. A key property is incompressibility: a string xx of length nn is algorithmically random if K(x)nO(1)K(x) \geq n - O(1), meaning it lacks compressible regularities. Despite its uncomputability, Kolmogorov complexity has profound applications across computer science and mathematics, serving as a cornerstone for algorithmic randomness tests, data compression heuristics (e.g., via universal coding), and inductive reasoning in machine learning. It underpins concepts like Martin-Löf randomness, where sequences pass all effective statistical tests, and connects to cryptography through pseudorandomness, as well as to logic via Gödel's incompleteness theorems and Chaitin's Omega number, an uncomputable real encoding the halting probabilities of Turing machines. In mathematical proofs, it has been used to establish results like the infinitude of primes by showing the complexity of prime enumerations grows linearly. Practical approximations appear in fields like bioinformatics for phylogeny reconstruction and pattern recognition, though resource-bounded versions address computability constraints.

Definitions and Intuition

Conceptual Intuition

The Kolmogorov complexity of an object measures its intrinsic information content by determining the length of the shortest computer program capable of generating that object as output. This approach shifts the focus from traditional probabilistic or combinatorial definitions of information to an algorithmic one, emphasizing the minimal descriptive effort required to specify the object precisely. Introduced by Andrey Kolmogorov in 1965, it serves as a foundational tool for understanding complexity in terms of computational reproducibility rather than mere statistical properties. To illustrate, consider a binary string of 1,000 zeros: its Kolmogorov complexity is low, as a concise program—such as one that loops to print zeros a fixed number of times—suffices to produce it. In contrast, a random 1,000-bit string lacks exploitable patterns, so the shortest program effectively embeds the entire string verbatim, resulting in a description nearly as long as the string itself. This highlights how Kolmogorov complexity rewards structured simplicity while penalizing apparent disorder. This measure differs fundamentally from statistical notions of randomness, like Shannon entropy, which quantify average uncertainty over probability distributions for ensembles of objects. Kolmogorov complexity, however, assesses individual objects based on their algorithmic compressibility, providing an absolute, machine-independent gauge of simplicity that aligns with intuitive ideas of pattern recognition. The core motivation is that the shortest such program distills the essential "structure" embedded in the object, akin to ideal data compression that strips away redundancy. Yet, for truly random strings, no such compression is possible, underscoring incompressibility as a hallmark of algorithmic randomness. For a given string xx, the complexity corresponds to the bit-length of the smallest program pp that a universal Turing machine can execute to output exactly xx. This framework is robust, as the invariance theorem guarantees the measure varies by at most a fixed constant across equivalent universal machines.

Plain Kolmogorov Complexity C

The plain Kolmogorov complexity, denoted C(x)C(x), of a binary string xx is formally defined as the length of the shortest program pp (measured in bits) such that a fixed universal Turing machine UU outputs xx upon input pp and halts: C(x)=min{p:U(p)=x}.C(x) = \min \{ |p| : U(p) = x \}.
Here, UU is a universal Turing machine capable of simulating any other Turing machine given its description as part of the input program pp.
In this formulation, programs pp are not required to be self-delimiting, meaning they rely on external markers or fixed-length delimiters to indicate their end, rather than being prefix-free codes that can be unambiguously concatenated. This non-self-delimiting nature allows for more concise individual descriptions but introduces issues, such as the lack of additivity when combining descriptions of multiple objects. For example, the string consisting of nn zeros, denoted 0n0^n, has plain Kolmogorov complexity C(0n)log2n+O(1)C(0^n) \approx \log_2 n + O(1), as a short program can specify nn in binary and then output the repeated zeros. In contrast, a typical random string of length nn has C(x)nC(x) \approx n, since no significantly shorter program exists to generate it. A key limitation arises in joint descriptions: C(x,y)C(x)+C(y)+O(log(C(x)+C(y)))C(x,y) \leq C(x) + C(y) + O(\log (C(x) + C(y))) holds, but equality generally fails due to ambiguities in encoding multiple non-self-delimiting programs without clear separation, potentially allowing shared or more efficient combined representations. Furthermore, plain complexity lacks monotonicity: for some strings xx and bit bb, C(xb)<C(x)ω(1)C(xb) < C(x) - \omega(1), meaning extending the string can significantly shorten the description length. The specific choice of universal Turing machine UU affects the value of C(x)C(x) by at most an additive constant, as differences stem from the fixed overhead of simulating one machine on another.

Prefix-Free Kolmogorov Complexity K

The prefix-free Kolmogorov complexity, often denoted K(x)K(x), provides a refined measure of the intrinsic information content of a finite binary string xx by restricting programs to a prefix-free domain. Formally, K(x)=min{p:U(p)=x}K(x) = \min \{ |p| : U(p) = x \}, where UU is a universal prefix Turing machine, pp is a binary string serving as input program, and the domain of UU (the set of all halting inputs) forms a prefix-free set, meaning no halting program is a proper prefix of another. This setup ensures one-way reading of the input tape without backtracking, making programs self-delimiting and avoiding the need for explicit length indicators or delimiters in the output. In some formulations, the machine outputs xx followed by a fixed delimiter to halt unambiguously, further emphasizing the instantaneous decodability. A crucial consequence of the prefix-free condition is the applicability of the Kraft inequality, which states that for any prefix-free set of strings {pi}\{ p_i \}, i2pi1\sum_i 2^{-|p_i|} \leq 1. This bounds the total "measure" of halting programs and enables the definition of a universal semimeasure m(x)={2p:U(p)=x}m(x) = \sum \{ 2^{-|p|} : U(p) = x \}, which dominates all other lower semicomputable semimeasures up to a multiplicative constant. Consequently, K(x)=log2m(x)+O(1)K(x) = -\log_2 m(x) + O(1), forging a direct link between complexity and probabilistic notions, where shorter programs contribute higher probabilities to their outputs. This probabilistic interpretation positions KK as a foundational tool for algorithmic information theory, facilitating analyses of randomness and universal priors akin to those in Bayesian inference. Unlike the plain Kolmogorov complexity C(x)C(x), which suffers from non-additive joint descriptions where extensions can be encoded cheaply, KK exhibits near-additivity via the chain rule: K(xy)=K(x)+K(yx)+O(logK(xy))K(xy) = K(x) + K(y \mid x) + O(\log K(xy)). Here, the conditional complexity K(yx)K(y \mid x) measures the additional information needed for yy given xx, approximating K(y)K(y) when xx and yy are independent. This property resolves additivity flaws in CC, making KK more suitable for information-theoretic applications like data compression and entropy estimation. For instance, K(x)=C(x)+O(logx)K(x) = C(x) + O(\log |x|), capturing the overhead of self-delimitation, which grows logarithmically with the string length. Prefix-free complexity also possesses weak monotonicity: if yy is a proper extension of xx, then K(y)K(x)O(1)K(y) \geq K(x) - O(1), reflecting that extending a string cannot drastically reduce its minimal description length. Its role in universal semimeasures extends to defining algorithmic randomness tests, where strings with K(x)xO(1)K(x) \geq |x| - O(1) are deemed incompressible and thus random. Overall, these features elevate KK beyond mere description length, embedding it deeply in the study of computable probability distributions and inductive inference.

Theoretical Foundations

Invariance Theorem

The invariance theorem establishes that the Kolmogorov complexity of an object is independent of the choice of universal Turing machine, up to an additive constant. Formally, for any two universal prefix Turing machines UU and VV, there exists a constant cU,Vc_{U,V} (depending only on UU and VV, but independent of the input xx) such that KU(x)KV(x)cU,V|K_U(x) - K_V(x)| \leq c_{U,V} for all xx. This result ensures that the measure K(x)K(x) can be regarded as well-defined without specifying a particular universal machine, providing a canonical notion of complexity. The proof relies on the simulation capabilities of universal Turing machines. Given a shortest prefix program pp for VV that outputs xx, UU can execute pp by first running a fixed simulator program qU,Vq_{U,V} that translates and emulates VV's computations on UU; the length of this combined program is qU,V+p|q_{U,V}| + |p|, adding at most the fixed overhead qU,V|q_{U,V}| to the description length. Symmetrically, a simulator qV,Uq_{V,U} allows VV to run UU's programs, yielding the bound KU(x)KV(x)+qU,VK_U(x) \leq K_V(x) + |q_{U,V}| and KV(x)KU(x)+qV,UK_V(x) \leq K_U(x) + |q_{V,U}|, so cU,V=max(qU,V,qV,U)c_{U,V} = \max(|q_{U,V}|, |q_{V,U}|). A similar invariance holds for the plain Kolmogorov complexity C(x)C(x), where for universal plain Turing machines UU and VV, CU(x)CV(x)cU,V|C_U(x) - C_V(x)| \leq c'_{U,V} for some constant cc' independent of xx, though the constants are typically larger due to the lack of prefix-free constraints. This machine-independence underpins the stability of Kolmogorov complexity as a foundational measure in algorithmic information theory, allowing consistent theoretical analysis across different computational models.

Historical Development

Andrey Kolmogorov introduced the notion of algorithmic complexity in his 1965 paper "Three approaches to the quantitative definition of information," where he proposed three distinct measures, with the complexity of a string defined as the length of the shortest program that outputs it on a universal Turing machine. This work built on earlier probabilistic notions but emphasized deterministic, machine-based descriptions to capture the intrinsic information content of data. Independently, Ray Solomonoff developed precursor concepts in 1960 through his paper "A preliminary report on a general theory of inductive inference," introducing algorithmic probability as a universal prior for prediction and inference based on program lengths. In parallel, Gregory Chaitin advanced the field in 1966 with "On the length of programs for computing finite binary sequences," establishing key bounds on program complexity and linking it to limits in formal systems. These contributions emerged from the foundations of computability theory laid by Alan Turing in his 1936 paper "On computable numbers, with an application to the Entscheidungsproblem," which demonstrated the undecidability of the halting problem and highlighted inherent limits in algorithmic processes. The 1970s saw the consolidation of these ideas into algorithmic information theory, with pivotal work by A. K. Zvonkin and L. A. Levin in their 1970 paper "The complexity of finite objects and the algorithmic concepts of information and randomness," which refined notions of randomness and complexity using prefix-free codes. A landmark development was Chaitin's 1975 introduction of the halting probability Ω, or Chaitin's constant, representing the probability that a random program halts, which encapsulated deep connections between complexity, randomness, and uncomputability. This era shifted focus from ensemble-based measures like Shannon's 1948 entropy, which quantifies average uncertainty in probabilistic sources, to individualized algorithmic measures that assess the minimal descriptive resources for specific objects. While core theoretical advances have remained stable since the 1970s, post-2000 integrations with machine learning and artificial intelligence have revitalized the field, notably through Marcus Hutter's AIXI model in 2005, which employs algorithmic probability for optimal universal reinforcement learning agents. These extensions leverage Kolmogorov complexity to formalize principles like Occam's razor in predictive modeling, bridging classical computability with modern AI paradigms.

Fundamental Properties

Basic Inequalities

One of the fundamental properties of prefix-free Kolmogorov complexity is subadditivity, which asserts that for any binary strings xx and yy, K(xy)K(x)+K(y)+O(1).K(xy) \leq K(x) + K(y) + O(1). This inequality arises because a shortest program for xyxy can be constructed by concatenating a shortest program for xx with a self-delimiting program for yy, incurring only a constant overhead for the universal prefix-free Turing machine to execute the sequence. Approximations to equality hold when xx and yy share little common structure, as the concatenated program then nearly achieves the sum of the individual complexities. A basic upper bound relates Kolmogorov complexity to string length: for any string xx, K(x)x+O(1).K(x) \leq |x| + O(1). This follows from the existence of a simple program that outputs xx literally, encoding its bits directly without compression, which is optimal for incompressible strings where no shorter description exists. Complementing this, a lower bound demonstrates near-incompressibility for typical strings: for a random string xx of length nn, K(x)xO(logx).K(x) \geq |x| - O(\log |x|). Such strings require descriptions almost as long as themselves, reflecting their lack of discernible patterns, with the logarithmic term accounting for minor encoding efficiencies in the universal machine. The Kolmogorov mutual information between strings xx and yy is defined as IK(x:y)=K(x)+K(y)K(xy),I_K(x:y) = K(x) + K(y) - K(xy), up to an additive constant, quantifying the shared algorithmic information between them. This measure is symmetric up to O(1)O(1) and non-negative, bounding the reduction in complexity when describing one string given the other. These inequalities, underpinned by the invariance theorem, hold universally across equivalent choice of prefix-free machines.

Chain Rule

The chain rule for Kolmogorov complexity provides a decomposition of the complexity of a joint object into the complexity of its components, analogous to the chain rule in information entropy. For prefix-free Kolmogorov complexity KK, it states that K(xy)=K(x)+K(yx)+O(logK(xy)),K(xy) = K(x) + K(y \mid x) + O(\log K(xy)), where K(yx)K(y \mid x) denotes the conditional complexity of yy given xx, representing the length of the shortest program that outputs yy when provided with xx as input. This relation holds up to an additive logarithmic term that accounts for the overhead in encoding the description of xx and the interaction between the components. The original formulation of this rule for prefix-free complexity appears in the work of Zvonkin and Levin, who established the foundational properties of algorithmic complexity measures. Intuitively, the derivation follows from constructing a program for the pair xyxy: first, output xx using a shortest program of length K(x)K(x), then append a program that, given this output, generates yy, incurring a cost of K(yx)K(y \mid x). The logarithmic overhead arises from the need to specify the length of the program for xx or handle self-delimiting codes in the prefix-free setting, ensuring no ambiguity in parsing. This construction yields the upper bound, while the lower bound follows from the subadditivity of complexity and the invariance properties of universal Turing machines. The chain rule enables recursive decompositions, particularly for sequences. For a sequence x1x2xnx_1 x_2 \dots x_n, it implies K(x1xn)i=1nK(xix1xi1),K(x_1 \dots x_n) \approx \sum_{i=1}^n K(x_i \mid x_1 \dots x_{i-1}), up to accumulated logarithmic terms, allowing the total complexity to be broken down into incremental contributions conditioned on prior elements. This decomposition is crucial for analyzing structured data, such as time series or hierarchical objects, where each new component's complexity depends on the context provided by predecessors. For plain Kolmogorov complexity CC, which lacks the prefix-free restriction, the chain rule is less tight. Specifically, C(xy)C(x)+C(yx)+O(logx+logy),C(xy) \leq C(x) + C(y \mid x) + O(\log |x| + \log |y|), with the overhead growing due to the need to encode lengths explicitly without self-delimitation, leading to potential ambiguities in program interpretation. This makes the prefix-free version preferable for precise information-theoretic applications. Overall, the chain rule bridges algorithmic information theory and classical information theory by mirroring Shannon's chain rule for entropy, H(X,Y)=H(X)+H(YX)H(X,Y) = H(X) + H(Y \mid X), but with the additive uncertainty inherent to uncomputable measures.

Uncomputability

The Kolmogorov complexity function K(x)K(x) is uncomputable, meaning there does not exist a Turing machine that, on input xx, outputs K(x)K(x) for every binary string xx. This result establishes that KK is not a recursive function, placing it beyond the reach of algorithmic computation. The uncomputability of KK follows from its relation to the halting problem. Specifically, K(x)K(x) is computable relative to an oracle for the halting problem: to compute K(x)K(x), enumerate all self-delimiting programs in order of increasing length; for each, query the oracle whether it halts (on the empty input); simulate only those that do until finding the shortest one that outputs xx, and output its length. Since the halting problem is undecidable, no such oracle exists, and thus KK is uncomputable. A naive approach to computing K(x)K(x) would involve enumerating all possible programs in order of increasing length, simulating each until one outputs xx, and taking the length of the shortest such program. However, this fails because some programs may loop indefinitely without halting, preventing the enumeration from ever confirming the minimal length. This dovetailing simulation cannot distinguish halting from non-halting behaviors in finite time, underscoring the inherent non-recursiveness of KK. The uncomputability of KK is intimately connected to the halting problem: for any string xx produced as the output of a halting computation of a program pp on input yy, it holds that K(x)xO(1)K(x) \geq |x| - O(1) only if the computation is "simple," but in general, outputs from halting computations satisfy K(x)p+O(1)K(x) \leq |p| + O(1), while non-halting cases lead to higher complexity bounds that cannot be algorithmically verified. If xx arises from a non-halting simulation or random process, K(x)K(x) tends to be close to x|x|, reflecting incompressibility. The implications of this uncomputability are profound: while exact values of K(x)K(x) are undecidable for arbitrary xx, computable approximations exist, such as resource-bounded variants that trade precision for feasibility. Moreover, the non-recursive nature of KK provides a foundational tool for proving incompleteness results in formal systems, linking algorithmic information theory to Gödel's incompleteness theorems by quantifying the limits of provability in terms of program-size complexity.

Information and Randomness Measures

Relation to Entropy

Kolmogorov complexity K(x)K(x) measures the minimal length of a program that outputs a binary string xx, providing an absolute notion of the information content intrinsic to xx itself, without reference to any probabilistic model. In contrast, Shannon entropy H(X)H(X) quantifies the average number of bits needed to encode outcomes from a random variable XX distributed according to a known probability mass function, focusing on ensemble uncertainty rather than individual instances. Despite these foundational differences, Kolmogorov complexity and Shannon entropy converge in key theoretical respects, particularly when analyzing typical sequences from computable sources, where K(x)K(x) approximates the self-information log2P(x)-\log_2 P(x) up to bounded terms. For sequences generated from a stationary ergodic source with entropy rate HH, the Kolmogorov complexity of a typical long string xx of length nn satisfies K(x)nH+O(logn)K(x) \approx n H + O(\log n), demonstrating asymptotic equivalence between the two measures. More precisely, the expected value of K(x)K(x) under a computable distribution PP equals the Shannon entropy HP(X)H_P(X) up to the complexity of describing PP itself: xP(x)K(x)=HP(X)+K(P)+O(1)\sum_x P(x) K(x) = H_P(X) + K(P) + O(1). Similarly, for the empirical entropy H^(x)\hat{H}(x) of xx, which estimates the entropy from the frequency counts in xx, it holds that K(x)H^(x)+O(logn)K(x) \approx \hat{H}(x) + O(\log n) for typical xx, highlighting how KK captures the intrinsic randomness of individual objects in a manner parallel to ensemble averages. This equivalence underscores Kolmogorov complexity's role as a universal benchmark for information content, aligning with Shannon's limits in the large-nn regime. The connection deepens through the universal prior m(x)=y:U(y)=x2ym(x) = \sum_{y: U(y)=x} 2^{-|y|}, where UU is a universal Turing machine, which assigns algorithmic probability to xx based on the shortest programs producing it. The associated algorithmic entropy log2m(x)-\log_2 m(x) equals K(x)K(x) up to an additive constant: K(x)=log2m(x)+O(1)K(x) = -\log_2 m(x) + O(1), establishing exact equivalence under this prior. A fundamental theorem reinforces this by stating that for any computable probability distribution PP, K(x)log2P(x)+O(1)K(x) \geq -\log_2 P(x) + O(1), meaning the shortest program for xx cannot be significantly shorter than the self-information under PP; equality holds asymptotically for the universal mm. This inequality arises because mm dominates all computable priors up to a multiplicative constant, ensuring KK provides a minimax interpretation of entropy that is distribution-independent. Key distinctions persist, however: K(x)K(x) requires no probabilistic assumptions and applies directly to individual strings, rendering it absolute and non-extensive—its chain rule decomposition K(xy)K(x)+K(yx)+O(logK(xy))K(xy) \leq K(x) + K(y|x) + O(\log K(xy)) incurs logarithmic overhead for dependencies, unlike Shannon entropy's additivity for independent sources. Shannon entropy, being ensemble-oriented, demands a specified distribution and remains computable when that distribution is known, whereas K(x)K(x) is uncomputable due to the undecidability of the halting problem, limiting practical approximations but affirming its theoretical purity as an information measure. Standard information-theoretic estimation libraries, such as the ITE toolbox in Matlab/Octave, NPEET and infomeasure in Python, and InfoTheory.jl in Julia, focus on estimating Shannon information measures like entropy, mutual information, and divergence using non-parametric methods such as k-nearest neighbors or kernel density estimation, which are computable approximations suitable for data analysis. In contrast, universal search, which operates within the framework of algorithmic (Kolmogorov) complexity, is uncomputable in general due to its reliance on exhaustive program search tied to the halting problem, explaining its absence from these standard libraries.

Kolmogorov Randomness

In Kolmogorov complexity, a binary string xx is considered cc-random, or incompressible, if its prefix-free Kolmogorov complexity satisfies K(x)xcK(x) \geq |x| - c, where x|x| denotes the length of xx and cc is a fixed constant independent of xx. This definition captures the intuition that random strings lack short algorithmic descriptions and cannot be significantly compressed by any Turing machine. Introduced as part of an algorithmic approach to quantifying information, this criterion provides an absolute measure of individual randomness for finite objects, distinct from probabilistic notions. Random strings exhibit key properties: they have no concise descriptions beyond their literal enumeration, and the proportion of such strings of length nn approaches 1 as nn grows, with at most a 2c2^{-c} fraction being compressible. Specifically, the number of cc-incompressible strings of length nn is at least 2n2nc+12^n - 2^{n-c} + 1, ensuring that almost all strings in the space {0,1}n\{0,1\}^n (in the measure-theoretic sense) are random. This incompressibility implies that random strings resist pattern extraction by any computable process. The notion of Kolmogorov randomness for finite strings relates to Martin-Löf randomness for infinite sequences, where a string with high K(x)K(x) passes effective statistical tests and implies Martin-Löf randomness in the limit, but the converse does not hold for finite cases due to potential low-complexity extensions. For example, a sequence of fair coin flips is typically Kolmogorov random if its realization is incompressible, as it exhibits no exploitable regularities; conversely, strings with patterns, such as alternating 010101..., have reduced K(x)K(x) due to short generating programs. This framework offers an absolute test for randomness via incompressibility, which is more robust than traditional statistical tests that can be fooled by adversaries crafting strings to pass specific checks while remaining compressible. Due to the uncomputability of KK, exact verification is impossible, but approximations guide practical assessments.

Universal Probability

The universal probability, often denoted as the universal semimeasure m(x)m(x), assigns to each finite string xx a probability based on the algorithmic descriptions of xx. It is formally defined as m(x)=pU(p)=x2p,m(x) = \sum_{\substack{p \\ U(p) = x}} 2^{-|p|}, where the sum is over all halting programs pp for a universal prefix Turing machine UU that output xx, and p|p| is the length of pp. This definition relies on prefix-free programs to ensure the sum converges. A key property of m(x)m(x) is that it forms a semimeasure, meaning xm(x)1\sum_x m(x) \leq 1, rather than a full probability measure, due to the possibility of non-halting computations. It is universal in the sense that for any computable semimeasure μ\mu, there exists a constant c>0c > 0 such that m(x)cμ(x)m(x) \geq c \cdot \mu(x) for all xx, making it a dominant prior over all effective probability distributions. Furthermore, m(x)m(x) is dominated by 2K(x)2^{-K(x)}, where K(x)K(x) is the Kolmogorov complexity of xx, ensuring that the measure favors simpler descriptions. The universal probability links directly to algorithmic information theory through the relation log2m(x)K(x)-\log_2 m(x) \approx K(x), with the approximation holding up to an additive constant, thus interpreting K(x)K(x) as an algorithmic entropy measure. In applications, m(x)m(x) serves as a prior for predictive coding and Solomonoff induction, where it enables optimal prediction of future sequence elements by weighting hypotheses according to their descriptive complexity, achieving minimal total prediction error bounded by the complexity of the true data-generating process. Although m(x)m(x) is incomputable due to the halting problem, it can be approximated from below by monotone semimeasures, forming the foundation of algorithmic probability theory for universal induction.

Applications

Data Compression

Kolmogorov complexity provides a theoretical foundation for lossless data compression by defining the shortest possible description length of a string xx, denoted K(x)K(x), in bits. This length represents the minimal program size required to generate xx on a universal Turing machine. The optimal compression ratio for xx is thus K(x)/xK(x)/|x|, where x|x| is the length of xx in bits; strings are incompressible if this ratio approaches 1, meaning no shorter description exists beyond the string itself. A key measure of redundancy in a string is K(x)xK(x) - |x|, which quantifies the excess bits in the original representation relative to the shortest description. For structured strings with patterns, this value is negative, indicating compressibility, while for random strings it is approximately zero, reflecting no exploitable redundancy. Practical compression algorithms approximate this ideal through universal coding schemes, such as arithmetic coding, which encodes strings using the universal semimeasure m(x)=p:U(p)=x2pm(x) = \sum_{p: U(p)=x^*} 2^{-|p|}, where xx^* is a self-delimiting encoding of xx and the sum is over programs pp that output xx^* on universal machine UU. The code length log2m(x)-\log_2 m(x) approximates K(x)K(x) up to an additive constant, achieving rates asymptotically close to Kolmogorov complexity for typical sources. However, no computable compression algorithm can achieve K(x)K(x) for all strings xx, as K(x)K(x) is uncomputable; any Turing machine attempting to compute it would require solving the halting problem for arbitrary programs. The Lempel-Ziv algorithm exemplifies a practical universal compressor that approaches Kolmogorov complexity for strings with repetitions, such as those generated by finite-state sources, by building a dictionary of substrings during sequential processing. In contrast, truly random strings remain uncompressible under this or any lossless scheme, aligning with the notion of Kolmogorov randomness where no pattern allows reduction below the original length.

Minimum Message Length

The minimum message length (MML) principle is an information-theoretic approach to inductive inference that identifies the optimal statistical model by minimizing the total length of a two-part message: the length required to encode the model itself plus the length needed to encode the observed data given that model. This formulation approximates the sum of the Kolmogorov complexities of the model and the data conditional on the model, i.e., K(model) + K(data|model), thereby providing a formal basis for model selection in the face of data. Developed by Chris Wallace and colleagues starting in the late 1960s, with foundational work on classification measures, MML evolved in the 1990s into practical, computable approximations tailored for applications in statistics and machine learning. These approximations employ two-stage coding schemes, where the first stage encodes the model using a prefix-free code and the second stage encodes the data using a model-specific code, enabling efficient inference without requiring full Turing completeness. In Bayesian inference, MML aligns with the use of the universal semimeasures prior m, which inherently penalizes overly complex models by assigning shorter codes to simpler hypotheses, thus implementing Occam's razor through a preference for models that balance explanatory power and brevity. Unlike pure Kolmogorov complexity, which is uncomputable due to the halting problem, MML relaxes universality to asserted properties—such as additivity and monotonicity—that ensure practical applicability while preserving asymptotic optimality. A key theoretical result is that the MML estimate for a dataset x satisfies MML(x) ≈ K(x) + O(log |x|), where the additive term accounts for approximation overhead, making MML a viable surrogate for Kolmogorov complexity in real-world scenarios. This principle has found applications in diverse fields, including phylogenetics, where it aids in reconstructing evolutionary trees by selecting models that concisely explain genetic data variations.

Biological Implications

In biology, Kolmogorov complexity serves as a measure for assessing the algorithmic information content of DNA sequences, distinguishing structured functional elements from less organized regions. Functional genes and regulatory sequences often exhibit relatively low Kolmogorov complexity due to repetitive patterns, codon biases, and evolutionary constraints that allow for compression into shorter descriptive programs, whereas much of what was once termed "junk DNA"—including non-coding regions without apparent repetitive structure—tends to display higher complexity, approximating randomness and resisting efficient compression. This distinction aids in identifying potential coding versus non-functional genomic segments, as demonstrated in analyses of human chromosome 22 where low-complexity windows (e.g., Cˉ(X)0.13\bar{C}(X) \approx 0.13 for 1000 bp segments) correlate with gene-rich areas exhibiting periodicity like 3 bp codon repeats, while higher-complexity regions align with presumed non-functional zones. Evolutionary processes leverage Kolmogorov complexity to model natural selection's role as an information compressor, bounding the rate at which genomic complexity can increase amid mutation pressures. Seminal work by Adami and colleagues illustrates how selection transforms environmental entropy into heritable information in digital organism simulations, with complexity increasing under fixed environmental pressures. In fitness landscapes, random mutations typically elevate Kolmogorov complexity by disrupting compressible patterns, but those preserving functional structure—such as synonymous substitutions or conserved motifs—are retained, maintaining low-complexity "valleys" that correspond to fitness peaks and enabling gradual adaptation without excessive informational bloat. Illustrative applications highlight these principles: protein folding exemplifies decompression, where a compact DNA sequence (the "program") generates an intricate three-dimensional structure (the "output") through biophysical rules, underscoring how low-complexity genetic code yields high-phenotypic specificity. Similarly, biodiversity can be viewed as an evolutionary drive toward complexity maximization, with diverse ecosystems aggregating incompressible variations that enhance resilience and exploratory capacity across species. Foundational contributions from Schneider in the 1990s, via sequence logos, provide a visual proxy for information content in aligned biological sequences, revealing conserved patterns in binding sites that align with low-entropy (and thus potentially low Kolmogorov) functional motifs. Post-2020 advancements integrate these ideas with AI in synthetic biology, employing Kolmogorov-Arnold networks to model genomic tasks like sequence prediction and design, thereby approximating complexity for engineering novel biological systems. Approximations of Kolmogorov complexity, such as those based on compression algorithms, are commonly used in practice due to its uncomputability, though debates persist on its direct relevance to evolutionary fitness.

Extensions

Conditional Complexity

The conditional Kolmogorov complexity of a binary string yy given another binary string xx, denoted K(yx)K(y|x), is the length of the shortest self-delimiting program pp such that a universal prefix Turing machine UU outputs yy upon receiving pp as input and xx as auxiliary input. Formally, K(yx)=min{p:U(p,x)=y},K(y|x) = \min \{ |p| : U(p, x) = y \}, where p|p| denotes the length of pp in bits. This definition captures the minimal additional description length required to specify yy when xx is freely available, extending the unconditional complexity K(y)=K(yϵ)K(y) = K(y|\epsilon) where ϵ\epsilon is the empty string. A basic property is non-negativity: K(yx)0K(y|x) \geq 0 for all binary strings xx and yy, as program lengths are non-negative, with equality holding trivially when yy is the empty string. Additionally, K(yy)=O(1)K(y|y) = O(1), since yy can be output using a constant-length program that simply copies the input yy. The symmetry property states that K(xy)K(yx)+O(log)K(x|y) \approx K(y|x) + O(\log), reflecting the near-equivalence in the information each string provides about the other, up to a logarithmic additive term; this follows from the symmetry of information principle, which equates the joint complexity K(xy)K(xy) to either K(x)+K(yx)K(x) + K(y|x) or K(y)+K(xy)K(y) + K(x|y) up to constants. The chain rule provides a decomposition: K(xy)=K(x)+K(yx)+O(1)K(xy) = K(x) + K(y|x) + O(1), up to an additive constant independent of xx and yy. More generally, for prefix complexity, the rule includes a logarithmic term: K(xy)K(x)+K(yx)+O(logmin{K(x),K(yx)})K(xy) \leq K(x) + K(y|x) + O(\log \min\{K(x), K(y|x)\}). This integration allows conditional complexity to build upon unconditional measures, facilitating analysis of composite objects. Conditional complexity underpins the definition of algorithmic mutual information I(x:y)=K(y)K(yx)I(x:y) = K(y) - K(y|x), which quantifies the dependence between xx and yy as the reduction in yy's complexity afforded by knowledge of xx; equivalently, up to constants, I(x:y)=K(x)+K(y)K(xy)I(x:y) = K(x) + K(y) - K(xy). High mutual information indicates strong dependence, while I(x:y)=O(1)I(x:y) = O(1) implies near-independence. This measure is central to applications in pattern recognition and data analysis, as it provides a universal, distribution-free notion of shared information. A key insight is the concept of conditional randomness: a string yy is algorithmically random given xx if K(yx)yO(1)K(y|x) \geq |y| - O(1), meaning yy remains incompressible even with xx as given input, analogous to unconditional randomness but relative to the conditioning information.

Time-Bounded Complexity

Time-bounded Kolmogorov complexity, also known as resource-bounded or Levin complexity, addresses the uncomputability of plain Kolmogorov complexity by restricting the runtime of the generating program. Formally, for a string xx and time bound t(n)t(n) where n=xn = |x|, it is defined as Kt(x)=min{p:U(p)=x within at most t(x) steps},K^t(x) = \min \{ |p| : U(p) = x \text{ within at most } t(|x|) \text{ steps} \}, where UU is a universal Turing machine and p|p| is the length of the program pp. This notion was introduced by Leonid Levin in his work on universal sequential search problems. Unlike the unbounded version, KtK^t considers only programs that halt within the specified time, making it suitable for analyzing computational resources in practical settings. A key property of time-bounded complexity is its convergence to the classical Kolmogorov complexity: as tt \to \infty, Kt(x)K(x)K^t(x) \uparrow K(x), meaning the time-bounded measure approaches the full complexity from below. This allows for polynomial-time approximations, where for sufficiently large but feasible tt, such as t(n)=nkt(n) = n^k for constant kk, Kt(x)K^t(x) provides a lower bound on K(x)K(x) that is useful for approximation algorithms. However, while KtK^t for fixed tt can be computed exactly by exhaustive search over all programs up to length x|x| within the time limit, the universal time-bounded complexity remains uncomputable as a function over all inputs due to the halting problem's influence on program enumeration. In applications, time-bounded Kolmogorov complexity enables resource-limited tests for algorithmic randomness, where a string is deemed random if its KtK^t-complexity exceeds a certain threshold relative to its length, adapted to feasible computation times. The Schnorr-Levin theorem establishes that Martin-Löf randomness is equivalent to high prefix complexity for initial segments, and its resource-bounded variants create hierarchies of randomness notions based on time limits, separating degrees of computational unpredictability. These hierarchies have been pivotal in complexity theory, demonstrating strict inclusions between classes like polynomial-time versus exponential-time bounded randomness. Recent post-2010 developments have bridged time-bounded Kolmogorov complexity to practical measures in artificial intelligence and machine learning, particularly through connections to descriptive complexity and circuit minimization problems. For instance, it informs approximations of minimal circuit sizes for boolean functions, linking uncomputable information measures to verifiable hardness assumptions in learning algorithms. In AI, variants like space-time bounded complexity have been used to guide neural network discovery by favoring low-complexity models under cognitive resource constraints, enhancing efficiency in pattern recognition tasks.

Chaitin's Incompleteness Theorem

Chaitin's incompleteness theorem states that in any consistent formal axiomatic system, such as Peano arithmetic, capable of expressing basic properties of Turing machines, there exists a constant L depending on the system such that the system cannot prove K(x) > n for any string x whenever n > L, where K is the Kolmogorov complexity and L is approximately the size of the shortest program encoding the axioms of the system. The proof employs a self-referential argument akin to Berry's paradox and Gödel's diagonalization. A short program is designed to enumerate all theorems of the system in order of increasing proof length until it discovers a proof asserting K(x) > n for some large n; it then outputs x as its result. For sufficiently large n, this provides a description of x whose length is at most the program's size plus a small additive constant plus the log of n, contradicting the assumed bound. Self-reference is encoded via a Diophantine equation that captures the enumeration process. This result bridges Kolmogorov complexity with Gödel's incompleteness theorems, revealing that formal systems cannot verify high complexity (randomness) beyond a fixed threshold, thereby extending Turing's uncomputability results to the logical limits of proof. Chaitin's halting probability Ω, defined as the sum Ω=phalts2p\Omega = \sum_{\substack{p \\ \text{halts}}} 2^{-|p|} over all halting prefix-free programs p, is uncomputable by the halting problem and unprovable in formal systems, as only finitely many of its bits can be established by any given consistent theory. A key quantitative implication is that for any consistent theory T, there exists ε(T) > 0 such that the total universal probability mass of strings whose low Kolmogorov complexity is provable in T is less than 1 - ε(T). This demonstrates that T can certify only a limited portion of the universal probability distribution for low-complexity strings, constraining the formalization of algorithmic randomness. Chaitin introduced the theorem in 1971, building directly on the foundations laid by Turing and Gödel.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.