Recent from talks
Contribute something
Nothing was collected or created yet.
Gene expression programming
View on Wikipedia
A major contributor to this article appears to have a close connection with its subject. (November 2012) |
| Part of a series on the |
| Evolutionary algorithm |
|---|
| Genetic algorithm (GA) |
| Genetic programming (GP) |
| Differential evolution |
| Evolution strategy |
| Evolutionary programming |
| Related topics |
Gene expression programming (GEP) in computer programming is an evolutionary algorithm that creates computer programs or models. These computer programs are complex tree structures that learn and adapt by changing their sizes, shapes, and composition, much like a living organism. And like living organisms, the computer programs of GEP are also encoded in simple linear chromosomes of fixed length. Thus, GEP is a genotype–phenotype system, benefiting from a simple genome to keep and transmit the genetic information and a complex phenotype to explore the environment and adapt to it.
Background
[edit]Evolutionary algorithms use populations of individuals, select individuals according to fitness, and introduce genetic variation using one or more genetic operators. Their use in artificial computational systems dates back to the 1950s where they were used to solve optimization problems (e.g. Box 1957[1] and Friedman 1959[2]). But it was with the introduction of evolution strategies by Rechenberg in 1965[3] that evolutionary algorithms gained popularity. A good overview text on evolutionary algorithms is the book "An Introduction to Genetic Algorithms" by Mitchell (1996).[4]
Gene expression programming[5] belongs to the family of evolutionary algorithms and is closely related to genetic algorithms and genetic programming. From genetic algorithms it inherited the linear chromosomes of fixed length; and from genetic programming it inherited the expressive parse trees of varied sizes and shapes.
In gene expression programming the linear chromosomes work as the genotype and the parse trees as the phenotype, creating a genotype/phenotype system. This genotype/phenotype system is multigenic, thus encoding multiple parse trees in each chromosome. This means that the computer programs created by GEP are composed of multiple parse trees. Because these parse trees are the result of gene expression, in GEP they are called expression trees. Masood Nekoei, et al. utilized this expression programming style in ABC optimization to conduct ABCEP as a method that outperformed other evolutionary algorithms.ABCEP
Encoding: the genotype
[edit]The genome of gene expression programming consists of a linear, symbolic string or chromosome of fixed length composed of one or more genes of equal size. These genes, despite their fixed length, code for expression trees of different sizes and shapes. An example of a chromosome with two genes, each of size 9, is the string (position zero indicates the start of each gene):
012345678012345678L+a-baccd**cLabacd
where “L” represents the natural logarithm function and “a”, “b”, “c”, and “d” represent the variables and constants used in a problem.
Expression trees: the phenotype
[edit]As shown above, the genes of gene expression programming have all the same size. However, these fixed length strings code for expression trees of different sizes. This means that the size of the coding regions varies from gene to gene, allowing for adaptation and evolution to occur smoothly.
For example, the mathematical expression:
can also be represented as an expression tree:
|
where "Q” represents the square root function.
This kind of expression tree consists of the phenotypic expression of GEP genes, whereas the genes are linear strings encoding these complex structures. For this particular example, the linear string corresponds to:
01234567Q*-+abcd
which is the straightforward reading of the expression tree from top to bottom and from left to right. These linear strings are called k-expressions (from Karva notation).
Going from k-expressions to expression trees is also very simple. For example, the following k-expression:
01234567890Q*b**+baQba
is composed of two different terminals (the variables “a” and “b”), two different functions of two arguments (“*” and “+”), and a function of one argument (“Q”). Its expression gives:
|
K-expressions and genes
[edit]The k-expressions of gene expression programming correspond to the region of genes that gets expressed. This means that there might be sequences in the genes that are not expressed, which is indeed true for most genes. The reason for these noncoding regions is to provide a buffer of terminals so that all k-expressions encoded in GEP genes correspond always to valid programs or expressions.
The genes of gene expression programming are therefore composed of two different domains – a head and a tail – each with different properties and functions. The head is used mainly to encode the functions and variables chosen to solve the problem at hand, whereas the tail, while also used to encode the variables, provides essentially a reservoir of terminals to ensure that all programs are error-free.
For GEP genes the length of the tail is given by the formula:
where h is the head's length and nmax is maximum arity. For example, for a gene created using the set of functions F = {Q, +, −, ∗, /} and the set of terminals T = {a, b}, nmax = 2. And if we choose a head length of 15, then t = 15 (2–1) + 1 = 16, which gives a gene length g of 15 + 16 = 31. The randomly generated string below is an example of one such gene:
0123456789012345678901234567890*b+a-aQab+//+b+babbabbbababbaaa
It encodes the expression tree:
|
which, in this case, only uses 8 of the 31 elements that constitute the gene.
It's not hard to see that, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, with the simplest composed of only one node (when the first element of a gene is a terminal) and the largest composed of as many nodes as there are elements in the gene (when all the elements in the head are functions with maximum arity).
It's also not hard to see that it is trivial to implement all kinds of genetic modification (mutation, inversion, insertion, recombination, and so on) with the guarantee that all resulting offspring encode correct, error-free programs.
Multigenic chromosomes
[edit]The chromosomes of gene expression programming are usually composed of more than one gene of equal length. Each gene codes for a sub-expression tree (sub-ET) or sub-program. Then the sub-ETs can interact with one another in different ways, forming a more complex program. The figure shows an example of a program composed of three sub-ETs.

In the final program the sub-ETs could be linked by addition or some other function, as there are no restrictions to the kind of linking function one might choose. Some examples of more complex linkers include taking the average, the median, the midrange, thresholding their sum to make a binomial classification, applying the sigmoid function to compute a probability, and so on. These linking functions are usually chosen a priori for each problem, but they can also be evolved elegantly and efficiently by the cellular system[6][7] of gene expression programming.
Cells and code reuse
[edit]In gene expression programming, homeotic genes control the interactions of the different sub-ETs or modules of the main program. The expression of such genes results in different main programs or cells, that is, they determine which genes are expressed in each cell and how the sub-ETs of each cell interact with one another. In other words, homeotic genes determine which sub-ETs are called upon and how often in which main program or cell and what kind of connections they establish with one another.
Homeotic genes and the cellular system
[edit]Homeotic genes have exactly the same kind of structural organization as normal genes and they are built using an identical process. They also contain a head domain and a tail domain, with the difference that the heads contain now linking functions and a special kind of terminals – genic terminals – that represent the normal genes. The expression of the normal genes results as usual in different sub-ETs, which in the cellular system are called ADFs (automatically defined functions). As for the tails, they contain only genic terminals, that is, derived features generated on the fly by the algorithm.
For example, the chromosome in the figure has three normal genes and one homeotic gene and encodes a main program that invokes three different functions a total of four times, linking them in a particular way.

From this example it is clear that the cellular system not only allows the unconstrained evolution of linking functions but also code reuse. And it shouldn't be hard to implement recursion in this system.
Multiple main programs and multicellular systems
[edit]Multicellular systems are composed of more than one homeotic gene. Each homeotic gene in this system puts together a different combination of sub-expression trees or ADFs, creating multiple cells or main programs.
For example, the program shown in the figure was created using a cellular system with two cells and three normal genes.

The applications of these multicellular systems are multiple and varied and, like the multigenic systems, they can be used both in problems with just one output and in problems with multiple outputs.
Other levels of complexity
[edit]The head/tail domain of GEP genes (both normal and homeotic) is the basic building block of all GEP algorithms. However, gene expression programming also explores other chromosomal organizations that are more complex than the head/tail structure. Essentially these complex structures consist of functional units or genes with a basic head/tail domain plus one or more extra domains. These extra domains usually encode random numerical constants that the algorithm relentlessly fine-tunes in order to find a good solution. For instance, these numerical constants may be the weights or factors in a function approximation problem (see the GEP-RNC algorithm below); they may be the weights and thresholds of a neural network (see the GEP-NN algorithm below); the numerical constants needed for the design of decision trees (see the GEP-DT algorithm below); the weights needed for polynomial induction; or the random numerical constants used to discover the parameter values in a parameter optimization task.
The basic gene expression algorithm
[edit]The fundamental steps of the basic gene expression algorithm are listed below in pseudocode:
- Select function set;
- Select terminal set;
- Load dataset for fitness evaluation;
- Create chromosomes of initial population randomly;
- For each program in population:
- Express chromosome;
- Execute program;
- Evaluate fitness;
- Verify stop condition;
- Select programs;
- Replicate selected programs to form the next population;
- Modify chromosomes using genetic operators;
- Go to step 5.
The first four steps prepare all the ingredients that are needed for the iterative loop of the algorithm (steps 5 through 10). Of these preparative steps, the crucial one is the creation of the initial population, which is created randomly using the elements of the function and terminal sets.
Populations of programs
[edit]Like all evolutionary algorithms, gene expression programming works with populations of individuals, which in this case are computer programs. Therefore, some kind of initial population must be created to get things started. Subsequent populations are descendants, via selection and genetic modification, of the initial population.
In the genotype/phenotype system of gene expression programming, it is only necessary to create the simple linear chromosomes of the individuals without worrying about the structural soundness of the programs they code for, as their expression always results in syntactically correct programs.
Fitness functions and the selection environment
[edit]Fitness functions and selection environments (called training datasets in machine learning) are the two facets of fitness and are therefore intricately connected. Indeed, the fitness of a program depends not only on the cost function used to measure its performance but also on the training data chosen to evaluate fitness
The selection environment or training data
[edit]The selection environment consists of the set of training records, which are also called fitness cases. These fitness cases could be a set of observations or measurements concerning some problem, and they form what is called the training dataset.
The quality of the training data is essential for the evolution of good solutions. A good training set should be representative of the problem at hand and also well-balanced, otherwise the algorithm might get stuck at some local optimum. In addition, it is also important to avoid using unnecessarily large datasets for training as this will slow things down unnecessarily. A good rule of thumb is to choose enough records for training to enable a good generalization in the validation data and leave the remaining records for validation and testing.
Fitness functions
[edit]Broadly speaking, there are essentially three different kinds of problems based on the kind of prediction being made:
- Problems involving numeric (continuous) predictions;
- Problems involving categorical or nominal predictions, both binomial and multinomial;
- Problems involving binary or Boolean predictions.
The first type of problem goes by the name of regression; the second is known as classification, with logistic regression as a special case where, besides the crisp classifications like "Yes" or "No", a probability is also attached to each outcome; and the last one is related to Boolean algebra and logic synthesis.
Fitness functions for regression
[edit]In regression, the response or dependent variable is numeric (usually continuous) and therefore the output of a regression model is also continuous. So it's quite straightforward to evaluate the fitness of the evolving models by comparing the output of the model to the value of the response in the training data.
There are several basic fitness functions for evaluating model performance, with the most common being based on the error or residual between the model output and the actual value. Such functions include the mean squared error, root mean squared error, mean absolute error, relative squared error, root relative squared error, relative absolute error, and others.
All these standard measures offer a fine granularity or smoothness to the solution space and therefore work very well for most applications. But some problems might require a coarser evolution, such as determining if a prediction is within a certain interval, for instance less than 10% of the actual value. However, even if one is only interested in counting the hits (that is, a prediction that is within the chosen interval), making populations of models evolve based on just the number of hits each program scores is usually not very efficient due to the coarse granularity of the fitness landscape. Thus the solution usually involves combining these coarse measures with some kind of smooth function such as the standard error measures listed above.
Fitness functions based on the correlation coefficient and R-square are also very smooth. For regression problems, these functions work best by combining them with other measures because, by themselves, they only tend to measure correlation, not caring for the range of values of the model output. So by combining them with functions that work at approximating the range of the target values, they form very efficient fitness functions for finding models with good correlation and good fit between predicted and actual values.
Fitness functions for classification and logistic regression
[edit]The design of fitness functions for classification and logistic regression takes advantage of three different characteristics of classification models. The most obvious is just counting the hits, that is, if a record is classified correctly it is counted as a hit. This fitness function is very simple and works well for simple problems, but for more complex problems or datasets highly unbalanced it gives poor results.
One way to improve this type of hits-based fitness function consists of expanding the notion of correct and incorrect classifications. In a binary classification task, correct classifications can be 00 or 11. The "00" representation means that a negative case (represented by "0”) was correctly classified, whereas the "11" means that a positive case (represented by "1”) was correctly classified. Classifications of the type "00" are called true negatives (TN) and "11" true positives (TP).
There are also two types of incorrect classifications and they are represented by 01 and 10. They are called false positives (FP) when the actual value is 0 and the model predicts a 1; and false negatives (FN) when the target is 1 and the model predicts a 0. The counts of TP, TN, FP, and FN are usually kept on a table known as the confusion matrix.
| Predicted class | |||
|---|---|---|---|
| Yes | No | ||
Actual class |
Yes | TP | FN |
| No | FP | TN | |
So by counting the TP, TN, FP, and FN and further assigning different weights to these four types of classifications, it is possible to create smoother and therefore more efficient fitness functions. Some popular fitness functions based on the confusion matrix include sensitivity/specificity, recall/precision, F-measure, Jaccard similarity, Matthews correlation coefficient, and cost/gain matrix which combines the costs and gains assigned to the 4 different types of classifications.
These functions based on the confusion matrix are quite sophisticated and are adequate to solve most problems efficiently. But there is another dimension to classification models which is key to exploring more efficiently the solution space and therefore results in the discovery of better classifiers. This new dimension involves exploring the structure of the model itself, which includes not only the domain and range, but also the distribution of the model output and the classifier margin.
By exploring this other dimension of classification models and then combining the information about the model with the confusion matrix, it is possible to design very sophisticated fitness functions that allow the smooth exploration of the solution space. For instance, one can combine some measure based on the confusion matrix with the mean squared error evaluated between the raw model outputs and the actual values. Or combine the F-measure with the R-square evaluated for the raw model output and the target; or the cost/gain matrix with the correlation coefficient, and so on. More exotic fitness functions that explore model granularity include the area under the ROC curve and rank measure.
Also related to this new dimension of classification models, is the idea of assigning probabilities to the model output, which is what is done in logistic regression. Then it is also possible to use these probabilities and evaluate the mean squared error (or some other similar measure) between the probabilities and the actual values, then combine this with the confusion matrix to create very efficient fitness functions for logistic regression. Popular examples of fitness functions based on the probabilities include maximum likelihood estimation and hinge loss.
Fitness functions for Boolean problems
[edit]In logic there is no model structure (as defined above for classification and logistic regression) to explore: the domain and range of logical functions comprises only 0's and 1's or false and true. So, the fitness functions available for Boolean algebra can only be based on the hits or on the confusion matrix as explained in the section above.
Selection and elitism
[edit]Roulette-wheel selection is perhaps the most popular selection scheme used in evolutionary computation. It involves mapping the fitness of each program to a slice of the roulette wheel proportional to its fitness. Then the roulette is spun as many times as there are programs in the population in order to keep the population size constant. So, with roulette-wheel selection programs are selected both according to fitness and the luck of the draw, which means that some times the best traits might be lost. However, by combining roulette-wheel selection with the cloning of the best program of each generation, one guarantees that at least the very best traits are not lost. This technique of cloning the best-of-generation program is known as simple elitism and is used by most stochastic selection schemes.
Reproduction with modification
[edit]The reproduction of programs involves first the selection and then the reproduction of their genomes. Genome modification is not required for reproduction, but without it adaptation and evolution won't take place.
Replication and selection
[edit]The selection operator selects the programs for the replication operator to copy. Depending on the selection scheme, the number of copies one program originates may vary, with some programs getting copied more than once while others are copied just once or not at all. In addition, selection is usually set up so that the population size remains constant from one generation to another.
The replication of genomes in nature is very complex and it took scientists a long time to discover the DNA double helix and propose a mechanism for its replication. But the replication of strings is trivial in artificial evolutionary systems, where only an instruction to copy strings is required to pass all the information in the genome from generation to generation.
The replication of the selected programs is a fundamental piece of all artificial evolutionary systems, but for evolution to occur it needs to be implemented not with the usual precision of a copy instruction, but rather with a few errors thrown in. Indeed, genetic diversity is created with genetic operators such as mutation, recombination, transposition, inversion, and many others.
Mutation
[edit]In gene expression programming mutation is the most important genetic operator.[8] It changes genomes by changing an element by another. The accumulation of many small changes over time can create great diversity.
In gene expression programming mutation is totally unconstrained, which means that in each gene domain any domain symbol can be replaced by another. For example, in the heads of genes any function can be replaced by a terminal or another function, regardless of the number of arguments in this new function; and a terminal can be replaced by a function or another terminal.
Recombination
[edit]Recombination usually involves two parent chromosomes to create two new chromosomes by combining different parts from the parent chromosomes. And as long as the parent chromosomes are aligned and the exchanged fragments are homologous (that is, occupy the same position in the chromosome), the new chromosomes created by recombination will always encode syntactically correct programs.
Different kinds of crossover are easily implemented either by changing the number of parents involved (there's no reason for choosing only two); the number of split points; or the way one chooses to exchange the fragments, for example, either randomly or in some orderly fashion. For example, gene recombination, which is a special case of recombination, can be done by exchanging homologous genes (genes that occupy the same position in the chromosome) or by exchanging genes chosen at random from any position in the chromosome.
Transposition
[edit]Transposition involves the introduction of an insertion sequence somewhere in a chromosome. In gene expression programming insertion sequences might appear anywhere in the chromosome, but they are only inserted in the heads of genes. This method guarantees that even insertion sequences from the tails result in error-free programs.
For transposition to work properly, it must preserve chromosome length and gene structure. So, in gene expression programming transposition can be implemented using two different methods: the first creates a shift at the insertion site, followed by a deletion at the end of the head; the second overwrites the local sequence at the target site and therefore is easier to implement. Both methods can be implemented to operate between chromosomes or within a chromosome or even within a single gene.
Inversion
[edit]Inversion is an interesting operator, especially powerful for combinatorial optimization.[9] It consists of inverting a small sequence within a chromosome.
In gene expression programming it can be easily implemented in all gene domains and, in all cases, the offspring produced is always syntactically correct. For any gene domain, a sequence (ranging from at least two elements to as big as the domain itself) is chosen at random within that domain and then inverted.
Other genetic operators
[edit]Several other genetic operators exist and in gene expression programming, with its different genes and gene domains, the possibilities are endless. For example, genetic operators such as one-point recombination, two-point recombination, gene recombination, uniform recombination, gene transposition, root transposition, domain-specific mutation, domain-specific inversion, domain-specific transposition, and so on, are easily implemented and widely used.
The GEP-RNC algorithm
[edit]Numerical constants are essential elements of mathematical and statistical models and therefore it is important to allow their integration in the models designed by evolutionary algorithms.
Gene expression programming solves this problem very elegantly through the use of an extra gene domain – the Dc – for handling random numerical constants (RNC). By combining this domain with a special terminal placeholder for the RNCs, a richly expressive system can be created.
Structurally, the Dc comes after the tail, has a length equal to the size of the tail t, and is composed of the symbols used to represent the RNCs.
For example, below is shown a simple chromosome composed of only one gene a head size of 7 (the Dc stretches over positions 15–22):
01234567890123456789012+?*+?**aaa??aaa68083295
where the terminal "?” represents the placeholder for the RNCs. This kind of chromosome is expressed exactly as shown above, giving:
|
Then the ?'s in the expression tree are replaced from left to right and from top to bottom by the symbols (for simplicity represented by numerals) in the Dc, giving:
|
The values corresponding to these symbols are kept in an array. (For simplicity, the number represented by the numeral indicates the order in the array.) For instance, for the following 10 element array of RNCs:
- C = {0.611, 1.184, 2.449, 2.98, 0.496, 2.286, 0.93, 2.305, 2.737, 0.755}
the expression tree above gives:
|
This elegant structure for handling random numerical constants is at the heart of different GEP systems, such as GEP neural networks and GEP decision trees.
Like the basic gene expression algorithm, the GEP-RNC algorithm is also multigenic and its chromosomes are decoded as usual by expressing one gene after another and then linking them all together by the same kind of linking process.
The genetic operators used in the GEP-RNC system are an extension to the genetic operators of the basic GEP algorithm (see above), and they all can be straightforwardly implemented in these new chromosomes. On the other hand, the basic operators of mutation, inversion, transposition, and recombination are also used in the GEP-RNC algorithm. Furthermore, special Dc-specific operators such as mutation, inversion, and transposition, are also used to aid in a more efficient circulation of the RNCs among individual programs. In addition, there is also a special mutation operator that allows the permanent introduction of variation in the set of RNCs. The initial set of RNCs is randomly created at the beginning of a run, which means that, for each gene in the initial population, a specified number of numerical constants, chosen from a certain range, are randomly generated. Then their circulation and mutation is enabled by the genetic operators.
Neural networks
[edit]An artificial neural network (ANN or NN) is a computational device that consists of many simple connected units or neurons. The connections between the units are usually weighted by real-valued weights. These weights are the primary means of learning in neural networks and a learning algorithm is usually used to adjust them.
Structurally, a neural network has three different classes of units: input units, hidden units, and output units. An activation pattern is presented at the input units and then spreads in a forward direction from the input units through one or more layers of hidden units to the output units. The activation coming into one unit from other unit is multiplied by the weights on the links over which it spreads. All incoming activation is then added together and the unit becomes activated only if the incoming result is above the unit's threshold.
In summary, the basic components of a neural network are the units, the connections between the units, the weights, and the thresholds. So, in order to fully simulate an artificial neural network one must somehow encode these components in a linear chromosome and then be able to express them in a meaningful way.
In GEP neural networks (GEP-NN or GEP nets), the network architecture is encoded in the usual structure of a head/tail domain.[10] The head contains special functions/neurons that activate the hidden and output units (in the GEP context, all these units are more appropriately called functional units) and terminals that represent the input units. The tail, as usual, contains only terminals/input units.
Besides the head and the tail, these neural network genes contain two additional domains, Dw and Dt, for encoding the weights and thresholds of the neural network. Structurally, the Dw comes after the tail and its length dw depends on the head size h and maximum arity nmax and is evaluated by the formula:
The Dt comes after Dw and has a length dt equal to t. Both domains are composed of symbols representing the weights and thresholds of the neural network.
For each NN-gene, the weights and thresholds are created at the beginning of each run, but their circulation and adaptation are guaranteed by the usual genetic operators of mutation, transposition, inversion, and recombination. In addition, special operators are also used to allow a constant flow of genetic variation in the set of weights and thresholds.
For example, below is shown a neural network with two input units (i1 and i2), two hidden units (h1 and h2), and one output unit (o1). It has a total of six connections with six corresponding weights represented by the numerals 1–6 (for simplicity, the thresholds are all equal to 1 and are omitted):
|
This representation is the canonical neural network representation, but neural networks can also be represented by a tree, which, in this case, corresponds to:
|
where "a” and "b” represent the two inputs i1 and i2 and "D” represents a function with connectivity two. This function adds all its weighted arguments and then thresholds this activation in order to determine the forwarded output. This output (zero or one in this simple case) depends on the threshold of each unit, that is, if the total incoming activation is equal to or greater than the threshold, then the output is one, zero otherwise.
The above NN-tree can be linearized as follows:
0123456789012DDDabab654321
where the structure in positions 7–12 (Dw) encodes the weights. The values of each weight are kept in an array and retrieved as necessary for expression.
As a more concrete example, below is shown a neural net gene for the exclusive-or problem. It has a head size of 3 and Dw size of 6:
0123456789012DDDabab393257
Its expression results in the following neural network:
|
which, for the set of weights:
- W = {−1.978, 0.514, −0.465, 1.22, −1.686, −1.797, 0.197, 1.606, 0, 1.753}
it gives:
|
which is a perfect solution to the exclusive-or function.
Besides simple Boolean functions with binary inputs and binary outputs, the GEP-nets algorithm can handle all kinds of functions or neurons (linear neuron, tanh neuron, atan neuron, logistic neuron, limit neuron, radial basis and triangular basis neurons, all kinds of step neurons, and so on). Also interesting is that the GEP-nets algorithm can use all these neurons together and let evolution decide which ones work best to solve the problem at hand. So, GEP-nets can be used not only in Boolean problems but also in logistic regression, classification, and regression. In all cases, GEP-nets can be implemented not only with multigenic systems but also cellular systems, both unicellular and multicellular. Furthermore, multinomial classification problems can also be tackled in one go by GEP-nets both with multigenic systems and multicellular systems.
Decision trees
[edit]Decision trees (DT) are classification models where a series of questions and answers are mapped using nodes and directed edges.
Decision trees have three types of nodes: a root node, internal nodes, and leaf or terminal nodes. The root node and all internal nodes represent test conditions for different attributes or variables in a dataset. Leaf nodes specify the class label for all different paths in the tree.
Most decision tree induction algorithms involve selecting an attribute for the root node and then make the same kind of informed decision about all the nodes in a tree.
Decision trees can also be created by gene expression programming,[11] with the advantage that all the decisions concerning the growth of the tree are made by the algorithm itself without any kind of human input.
There are basically two different types of DT algorithms: one for inducing decision trees with only nominal attributes and another for inducing decision trees with both numeric and nominal attributes. This aspect of decision tree induction also carries to gene expression programming and there are two GEP algorithms for decision tree induction: the evolvable decision trees (EDT) algorithm for dealing exclusively with nominal attributes and the EDT-RNC (EDT with random numerical constants) for handling both nominal and numeric attributes.
In the decision trees induced by gene expression programming, the attributes behave as function nodes in the basic gene expression algorithm, whereas the class labels behave as terminals. This means that attribute nodes have also associated with them a specific arity or number of branches that will determine their growth and, ultimately, the growth of the tree. Class labels behave like terminals, which means that for a k-class classification task, a terminal set with k terminals is used, representing the k different classes.
The rules for encoding a decision tree in a linear genome are very similar to the rules used to encode mathematical expressions (see above). So, for decision tree induction the genes also have a head and a tail, with the head containing attributes and terminals and the tail containing only terminals. This again ensures that all decision trees designed by GEP are always valid programs. Furthermore, the size of the tail t is also dictated by the head size h and the number of branches of the attribute with more branches nmax and is evaluated by the equation:
For example, consider the decision tree below to decide whether to play outside:
|
It can be linearly encoded as:
01234567HOWbaaba
where “H” represents the attribute Humidity, “O” the attribute Outlook, “W” represents Windy, and “a” and “b” the class labels "Yes" and "No" respectively. Note that the edges connecting the nodes are properties of the data, specifying the type and number of branches of each attribute, and therefore don't have to be encoded.
The process of decision tree induction with gene expression programming starts, as usual, with an initial population of randomly created chromosomes. Then the chromosomes are expressed as decision trees and their fitness evaluated against a training dataset. According to fitness they are then selected to reproduce with modification. The genetic operators are exactly the same that are used in a conventional unigenic system, for example, mutation, inversion, transposition, and recombination.
Decision trees with both nominal and numeric attributes are also easily induced with gene expression programming using the framework described above for dealing with random numerical constants. The chromosomal architecture includes an extra domain for encoding random numerical constants, which are used as thresholds for splitting the data at each branching node. For example, the gene below with a head size of 5 (the Dc starts at position 16):
012345678901234567890WOTHabababbbabba46336
encodes the decision tree shown below:
|
In this system, every node in the head, irrespective of its type (numeric attribute, nominal attribute, or terminal), has associated with it a random numerical constant, which for simplicity in the example above is represented by a numeral 0–9. These random numerical constants are encoded in the Dc domain and their expression follows a very simple scheme: from top to bottom and from left to right, the elements in Dc are assigned one-by-one to the elements in the decision tree. So, for the following array of RNCs:
- C = {62, 51, 68, 83, 86, 41, 43, 44, 9, 67}
the decision tree above results in:
|
which can also be represented more colorfully as a conventional decision tree:
|
Criticism
[edit]GEP has been criticized for not being a major improvement over other genetic programming techniques. In many experiments, it did not perform better than existing methods.[12]
Software
[edit]Commercial applications
[edit]- GeneXproTools
- GeneXproTools is a predictive analytics suite developed by Gepsoft. GeneXproTools modeling frameworks include logistic regression, classification, regression, time series prediction, and logic synthesis. GeneXproTools implements the basic gene expression algorithm and the GEP-RNC algorithm, both used in all the modeling frameworks of GeneXproTools.
Open-source libraries
[edit]- GEP4J – GEP for Java Project
- Created by Jason Thomas, GEP4J is an open-source implementation of gene expression programming in Java. It implements different GEP algorithms, including evolving decision trees (with nominal, numeric, or mixed attributes) and automatically defined functions. GEP4J is hosted at Google Code.
- PyGEP – Gene Expression Programming for Python
- Created by Ryan O'Neil with the goal to create a simple library suitable for the academic study of gene expression programming in Python, aiming for ease of use and rapid implementation. It implements standard multigenic chromosomes and the genetic operators mutation, crossover, and transposition. PyGEP is hosted at Google Code.
- jGEP – Java GEP toolkit
- Created by Matthew Sottile to rapidly build Java prototype codes that use GEP, which can then be written in a language such as C or Fortran for real speed. jGEP is hosted at SourceForge.
Further reading
[edit]- Ferreira, C. (2006). Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Springer-Verlag. ISBN 3-540-32796-7.
- Ferreira, C. (2002). Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Portugal: Angra do Heroismo. ISBN 972-95890-5-4.
See also
[edit]References
[edit]- ^ Box, G. E. P., 1957. Evolutionary operation: A method for increasing industrial productivity. Applied Statistics, 6, 81–101.
- ^ Friedman, G. J., 1959. Digital simulation of an evolutionary process. General Systems Yearbook, 4, 171–184.
- ^ Rechenberg, Ingo (1973). Evolutionsstrategie. Stuttgart: Holzmann-Froboog. ISBN 3-7728-0373-3.
- ^ Mitchell, Melanie (1996). 'An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press. ISBN 978-0-262-13316-6.
- ^ Ferreira, C. (2001). "Gene Expression Programming: A New Adaptive Algorithm for Solving Problems" (PDF). Complex Systems, Vol. 13, issue 2: 87–129.
- ^ Ferreira, C. (2002). Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Portugal: Angra do Heroismo. ISBN 972-95890-5-4.
- ^ Ferreira, C. (2006). "Automatically Defined Functions in Gene Expression Programming" (PDF). In N. Nedjah, L. de M. Mourelle, A. Abraham, eds., Genetic Systems Programming: Theory and Experiences, Studies in Computational Intelligence, Vol. 13, pp. 21–56, Springer-Verlag.
- ^ Ferreira, C. (2002). "Mutation, Transposition, and Recombination: An Analysis of the Evolutionary Dynamics" (PDF). In H. J. Caulfield, S.-H. Chen, H.-D. Cheng, R. Duro, V. Honavar, E. E. Kerre, M. Lu, M. G. Romay, T. K. Shih, D. Ventura, P. P. Wang, Y. Yang, eds., Proceedings of the 6th Joint Conference on Information Sciences, 4th International Workshop on Frontiers in Evolutionary Algorithms, pages 614–617, Research Triangle Park, North Carolina, USA.
- ^ Ferreira, C. (2002). "Combinatorial Optimization by Gene Expression Programming: Inversion Revisited" (PDF). In J. M. Santos and A. Zapico, eds., Proceedings of the Argentine Symposium on Artificial Intelligence, pages 160–174, Santa Fe, Argentina.
- ^ Ferreira, C. (2006). "Designing Neural Networks Using Gene Expression Programming" (PDF). In A. Abraham, B. de Baets, M. Köppen, and B. Nickolay, eds., Applied Soft Computing Technologies: The Challenge of Complexity, pages 517–536, Springer-Verlag.
- ^ Ferreira, C. (2006). Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Springer-Verlag. ISBN 3-540-32796-7.
- ^ Oltean, M.; Grosan, C. (2003), "A comparison of several linear genetic programming techniques", Complex Systems, 14 (4): 285–314, doi:10.25088/ComplexSystems.14.4.285
External links
[edit]- GEP home page, maintained by the inventor of gene expression programming.
- GeneXproTools, commercial GEP software.
Gene expression programming
View on GrokipediaIntroduction
Definition and Overview
Gene expression programming (GEP) is an evolutionary computation technique that evolves computer programs through a genotype-phenotype mapping system, where the genotype consists of linear chromosomes and the phenotype comprises nonlinear expression trees (ETs).[2] In GEP, populations of individuals—each represented by a chromosome—are subjected to selection based on fitness evaluation of their corresponding ETs, followed by the application of genetic operators to introduce variation, akin to processes in genetic algorithms (GAs) and genetic programming (GP).[2] This approach, developed by Candida Ferreira in 1999, enables the automatic discovery of mathematical expressions or decision trees for problem-solving in domains such as symbolic regression and classification.[2] A key feature of GEP is the clear separation between the genotype, which is a compact linear string easy to manipulate and modify, and the phenotype, which is a functional ET executed for fitness assessment.[2] This separation enhances evolvability by allowing efficient genetic operations on the genotype while evaluating complex, tree-structured phenotypes, addressing limitations in tree-based GP where direct manipulation of nonlinear structures leads to inefficiency and bloat.[2] Compared to traditional GP, GEP demonstrates superior computational efficiency, often solving complex problems orders of magnitude faster due to the streamlined representation and reduced structural complexity.[2] The basic workflow in GEP begins with the generation of initial linear chromosomes, each encoding a K-expression—a condensed notation that directly translates into an ET through a reading process from left to right and top to bottom.[2] The K-expression is then decoded into the corresponding ET, which is evaluated against a fitness function to measure performance on the target problem.[2] Selected high-fitness individuals reproduce, incorporating modifications via genetic operators, and the cycle repeats until convergence or a stopping criterion is met.[2]| Aspect | GEP | GP | GA |
|---|---|---|---|
| Representation | Linear chromosomes encoding nonlinear ETs | Nonlinear parse trees | Linear strings (bit or fixed-length) |
| Evolvability | High, due to genotype-phenotype separation | Moderate, prone to bloat and inefficiency | Low for complex functions |
| Computational Efficiency | High, linear manipulation with nonlinear execution | Low, direct tree operations | Moderate, but limited expressiveness |
History and Development
Gene expression programming (GEP) was invented by Cândida Ferreira in 1999 as a novel evolutionary computation technique inspired by biological gene expression processes.[7] This development occurred during her tenure as an assistant professor of biochemistry at the University of the Azores, where her research intersected molecular biology, evolution, and machine learning.[7] Ferreira's motivation stemmed from the need to overcome key limitations in traditional genetic programming (GP), particularly its tree-based representation, which often led to code bloat—increased program size without fitness improvements—and computational inefficiency due to complex structural modifications.[2] By introducing a linear chromosomal encoding that separates genotype from phenotype, GEP aimed to enable more efficient evolution of computer programs while maintaining the expressive power of GP.[2] The paradigm was first formally described in Ferreira's seminal 2001 paper, "Gene Expression Programming: A New Adaptive Algorithm for Solving Problems," published in Complex Systems, which outlined the core algorithm and demonstrated its application to problems like symbolic regression.[8] GEP was established as a distinct computational paradigm by 2002, coinciding with the release of Ferreira's first book on the topic, which provided detailed theoretical foundations and implementation guidance.[7] This was further solidified in her 2006 book, Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence, a comprehensive Springer-Verlag publication that expanded on modifications, variants, and practical uses of the algorithm.[9] In the early 2000s, GEP's initial applications focused on areas such as symbolic regression, where it evolved mathematical expressions to fit data patterns, and time-series prediction, including modeling phenomena like sunspot activity.[2] These demonstrations highlighted GEP's advantages in generating compact, interpretable models compared to GP.[2] Concurrently, Ferreira co-founded Gepsoft in 2000 to commercialize GEP through software tools like GeneXproTools, facilitating its adoption in mathematical modeling and beyond.[10] The technique was also patented in Portugal that year (Patent No. 102508), underscoring its innovative status.[7]Fundamental Principles
Genotype and Encoding
In gene expression programming (GEP), the genotype is represented as a fixed-length multigenic chromosome, which consists of multiple genes concatenated together to form a linear string of symbols. Each gene is a sequence of elements drawn from a predefined function set and terminal set, where the function set includes arithmetic or logical operators such as addition (+), subtraction (-), multiplication (*), and division (/), and the terminal set comprises variables (e.g., x, y) or constants (e.g., numerical values). This encoding allows the genotype to directly represent computer programs in a compact, symbolic form suitable for genetic operations.[2] The encoding rules specify that chromosomes are constructed by simply concatenating the genes without any delimiters, ensuring a seamless linear structure that facilitates storage and manipulation. For instance, a simple chromosome might be encoded as "+ab*cd", where the first gene "+ab" (with appropriate tail) indicates addition of terminals a and b, and the second gene "*cd" indicates multiplication of c and d, though the full interpretation depends on the mapping process to expression trees. This approach uses a discrete alphabet of symbols, enabling the genotype to evolve through standard genetic operators like mutation and recombination without disrupting the overall chromosomal integrity.[2] The linear structure of the GEP genotype offers several advantages, including efficient storage in memory due to its fixed length and compactness, as well as ease of genetic modification, which decouples the simplicity of the encoding from the potential complexity of the resulting programs. This representation contrasts with tree-based methods by providing a more streamlined substrate for evolution, while still allowing mapping to phenotypic expression trees for evaluation.[2]Phenotype and Expression Trees
In gene expression programming (GEP), the phenotype is manifested as an expression tree (ET), a nonlinear hierarchical structure composed of function nodes and terminal leaves, which directly encodes the executable program or mathematical expression derived from the linear genotype.[2] This tree representation allows for the evaluation of solutions in diverse problem domains, such as symbolic regression or logical inference, by traversing the structure to compute outputs from input values.[2] Unlike fixed architectures, ETs in GEP can vary in depth and shape, enabling flexible adaptation during evolution while maintaining syntactic validity.[2] The translation process from the genotype—a linear string of symbols representing functions and terminals—to the phenotype involves reading the gene string from left to right in prefix notation to construct the ET, recursively allocating subtrees for each function's arguments. The first symbol serves as the root node; if it is a terminal, the tree consists solely of that leaf, yielding a constant output. If it is a function, the process allocates child nodes equal to the function's arity (e.g., two for binary operators like addition or multiplication) and continues reading subsequent symbols to populate those subtrees recursively, filling each with either functions or terminals until all branches are complete.[2] This parsing ensures that the resulting ET adheres to the arity requirements of each node, producing a fully formed, executable structure without the need for intermediate transcription steps. The genotype's design, with sufficient terminal symbols appended, guarantees that the reading process always terminates with a valid tree, preventing incomplete or malformed phenotypes.[2] For illustration, consider a simplified genotype string "+ * x y", where "+" and "" denote binary arithmetic functions, and "x" and "y" are terminals (variables or constants). The ET construction begins with "+" as the root, followed by "" as the left child (requiring two arguments), with "x" and "y" as its terminal leaves; the right child of the root "+" would then draw from additional symbols if available, or default to a terminal in a complete gene context, resulting in an expression equivalent to (x * y) + terminal.[2] In practice, full GEP genes incorporate extra terminals to fill any remaining branches, ensuring the ET evaluates to a numerical or logical output for fitness assessment. If structural modifications during evolution were to produce an invalid configuration (rare due to the encoding), such phenotypes are either repaired by substituting valid symbols or excluded from selection, though the system's architecture minimizes these occurrences.[2] Once built, the ET is executed by post-order traversal, applying functions to terminal inputs to generate the phenotype's functional behavior.[2]K-Expressions and Gene Structure
In gene expression programming (GEP), the K-expression serves as the core unit representing the open reading frame (ORF) within each gene, which is the valid linear encoding that directly translates into a phenotypic expression tree.[11] This structure, derived from the Karva notation—a prefix-based language for representing mathematical expressions—ensures that the genotype, encoded as a fixed-length string, produces syntactically correct and executable phenotypes without invalid constructions.[12] The K-expression is the shortest complete subexpression starting from the beginning of the gene that satisfies the arity requirements of all functions encountered, thereby defining the effective computational content of the gene.[11] A GEP gene consists of two distinct regions: the head and the tail, forming a total length , where is the head length and is the tail length. The head, spanning positions 1 to , may contain a mix of function symbols (e.g., , , square root denoted as Q) and terminal symbols (e.g., variables like , , or constants). In contrast, the tail, from positions to , includes only terminal symbols, which acts as a buffer to accommodate varying tree depths during genetic operations. This architecture guarantees that every possible modification of the gene—through mutation or recombination—yields a valid K-expression, as the tail provides sufficient terminals to fulfill any arity needs arising from the head.[11] The tail length is calculated using the formula , where is the maximum arity among all functions in the function set; for instance, if (as with binary operators like addition and multiplication), the tail ensures compatibility with the deepest possible tree structure.[11] To illustrate, consider a gene with head length , maximum arity , yielding and total length . A sample gene string might be "+xyabc", where the head comprises "+x" (positions 1-3) and the tail "yabc" (positions 4-7). Positions: 1:+, 2:x, 3:, 4:y, 5:a, 6:b, 7:c. The K-expression is extracted by parsing from the start: the root "+" (arity 2) takes first argument "x" (terminal), second argument "" (arity 2) taking "y" and "a" (both terminals), completing at position 5 with "+x*y a". This K-expression corresponds to the expression tree , demonstrating how the structure limits the tree to a valid, compact form while unused tail elements (b, c) remain available for potential extensions in multigenic setups.[12]Chromosomal Organization
Multigenic Chromosomes
In gene expression programming (GEP), a multigenic chromosome is formed by concatenating multiple genes of fixed equal length into a single linear structure, where each gene encodes a distinct sub-expression tree (sub-ET).[11] This organization allows the chromosome to represent more intricate computational models than a single-gene (unigenic) setup, with the number of genes, denoted as , typically selected in advance based on the problem's complexity.[13] Each gene, akin to a K-expression in basic GEP, translates independently into its corresponding sub-ET during the expression phase.[14] The sub-ETs derived from the genes are linked post-translationally to form the complete phenotype, an expression tree (ET) that constitutes the program's output.[11] This linking is achieved through predefined linking functions, which integrate the sub-ETs in a hierarchical manner; for instance, the output of gene serves as an argument (effectively a terminal) to the linking function applied with the sub-ET from gene , proceeding sequentially across all genes.[13] Common linking functions include addition for algebraic expressions, where the overall ET computes the sum of sub-ET evaluations, or conditional functions like IF for logical domains, which nest sub-ETs based on precedence.[14] The final ET from the last gene's integration yields the full phenotypic expression, ensuring a modular assembly without altering the linear genotype.[11] Consider a simplified example with two genes in an algebraic context. The first gene encodes the sub-ET , while the second encodes . Using multiplication as the linking function, the outputs integrate hierarchically to form .[14] This results in a composite ET that evaluates to , demonstrating how sequential gene outputs build nested complexity.[13] Multigenic chromosomes enable the evolution of hierarchically complex programs by treating genes as reusable building blocks, fostering modularity and efficiency in genetic operations.[11] Unlike traditional genetic programming, which suffers from expression tree bloat due to variable-length growth, GEP's fixed-length chromosomes prevent uncontrolled expansion while allowing scalable complexity through the fixed or problem-specific number of genes.[14] This structure has proven effective in applications like symbolic regression, achieving 100% success rates with three genes compared to lower performance in unigenic variants.[11]Head-Tail Architecture and Open Reading Frames
In gene expression programming (GEP), the head-tail architecture divides each gene into two distinct regions to ensure the production of valid expression trees while maximizing expressive potential. The head region, of fixed length chosen by the user, contains symbols representing both functions (such as arithmetic operators or nonlinear functions) and terminals (such as constants or input variables), allowing for flexible construction of computational structures.[15] In contrast, the tail region consists exclusively of terminals and serves as a padding mechanism, with its length calculated as , where is the head length and is the maximum arity of functions in the function set; this ensures sufficient terminals to complete all branches induced by functions in the head, guaranteeing syntactic validity regardless of genetic modifications.[15] This division promotes evolvability by separating the coding (head) from non-coding (tail) parts, mimicking biological gene structure and enabling unconstrained application of evolutionary operators like mutation and recombination without risking invalid programs.[13] The tail's role as a buffer allows the head to evolve complex hierarchies while the overall gene translates into a well-formed Karva expression (K-expression), the linear notation from which expression trees are derived.[15] Open reading frames (ORFs) in GEP represent the coding subsequences within a gene that directly contribute to the expression tree, providing modularity by defining viable subexpressions. An ORF begins at the start of the gene or a function within the head and extends to a point where the subsequence forms a complete, valid substructure, typically ending before any invalid or extraneous symbols that would disrupt tree formation.[16] Unlike biological ORFs, GEP ORFs are not strictly terminated by stop codons but by the natural completion of the expression tree's branches, with the primary ORF being the longest valid prefix from the gene's beginning; additional ORFs can emerge as overlapping or nested subsequences starting from subsequent functions in the head, enabling subroutine-like modularity.[13] For instance, consider a gene with head "Q*-+" and tail "abcd" (assuming functions Q (sqrt, arity 1), * (arity 2), - (arity 2), + (arity 2)). The primary ORF is the full "Q*-+abcd", forming the ET . A sub-ORF could start at the inner "-" (positions for - + a b c d, but adjusted), representing - (a, b) as a valid subtree. The primary ORF encompasses the full valid prefix up to the point where the entire head and necessary tail terminals complete the tree.[13] These ORFs allow genes to encode hierarchical, reusable components, as sub-ORFs can evolve independently yet integrate into larger expressions. The presence of multiple ORFs per gene facilitates modular evolution in GEP by promoting the reuse of subexpressions across the population, enhancing search efficiency and adaptability in solving complex problems.[15] This modularity supports the emergence of sophisticated solutions through genetic operators that preserve ORF integrity, contributing to GEP's advantage over traditional genetic programming in exploring vast solution spaces.[16]Advanced Architectures
Cellular Systems and Code Reuse
In gene expression programming (GEP), the cellular system organizes multiple genes within a chromosome into discrete "cells," where each gene encodes a sub-expression tree (sub-ET) that functions as a modular computational unit. This mapping ensures that the linear genotype of each gene translates into a ramified phenotype via the Karva notation, producing valid, executable sub-programs without the need for repair mechanisms common in other genetic programming variants.[17] Cells interact through a linking process that integrates their outputs hierarchically, allowing the result of an earlier cell to propagate as an input to later ones, thereby facilitating efficient code reuse. Shared terminals derived from prior cells' computations eliminate redundant calculations, as the same sub-ET output can be referenced multiple times across the overall expression tree. This design draws an analogy to biological cellular differentiation, where specialized cells repurpose shared molecular outputs to build complex tissues without duplicating genetic material.[17] A representative example is a unicellular system with three genes: the sub-ET from the first gene, say computing a basic operation like addition of inputs, becomes a terminal symbol in the second and third genes' heads, enabling its result to feed into subsequent divisions or multiplications without re-encoding the operation. This reuse extends naturally to multicellular configurations, where independent cells can link outputs across boundaries for even greater efficiency.[17] The cellular architecture promotes modularity by treating each cell as a self-contained yet interconnectable module, which simplifies the evolution of diverse program structures. It also enhances scalability, as adding cells increases expressive power for tackling larger problems, such as multi-output function approximation, while maintaining the integrity of genetic operators like mutation and recombination.[17]Homeotic Genes and Multicellular Models
In gene expression programming (GEP), homeotic genes represent a specialized extension designed to regulate the expression and interaction of conventional genes within multigenic chromosomes, drawing inspiration from biological homeotic genes such as Hox genes that control body patterning in organisms.[17] These regulatory elements function as a "main program" that dynamically links sub-expression trees (sub-ETs) derived from standard genes, enabling modular and hierarchical program construction without the need for formal parameters, unlike automatically defined functions in traditional genetic programming.[17] Structurally, homeotic genes mirror conventional GEP genes, featuring a head domain with linking functions and genic terminals (e.g., indices 0, 1, 2 denoting specific sub-ETs) and a tail domain filled solely with genic terminals to ensure valid expression tree formation.[17] Homeotic genes output indices that determine which sub-expression trees (sub-ETs) derived from conventional genes are called and how many times they are invoked in the main program, enabling modular assembly.[17] For instance, in a chromosome encoding three conventional genes and one homeotic gene (e.g.,/-b/abbaa*a-/abbab-*+abbbaa **Q2+010102), the homeotic gene might resolve to call sub-ET 0 twice, sub-ET 1 once, and sub-ET 2 once, orchestrating the overall phenotype through selective gene expression.[17] This mechanism enhances evolvability by promoting code reuse and adaptability, building briefly on basic cellular reuse where sub-ETs are statically linked.[17]
Multicellular models in GEP extend this regulatory framework by dividing chromosomes into discrete "cells," each governed by its own homeotic gene to simulate tissue-like or organ-like modules in organismal systems.[17] In these architectures, multiple main programs—each comprising a homeotic gene and associated conventional genes—interact hierarchically, with homeotic oversight ensuring coordinated behavior across cells, such as in multi-output prediction tasks where fitness is evaluated based on the best-performing cell.[17] For example, a multicellular chromosome with three cells, each supporting one automatically defined function via a homeotic gene, achieves a 98% success rate in evolving solutions to complex polynomial problems, demonstrating improved scalability over unicellular setups.[17]
These models support varying complexity levels, progressing from unicellular configurations—where a single homeotic gene reuses a limited set of sub-ETs for straightforward modularity—to fully multicellular systems featuring hierarchical regulation among multiple cells, akin to developmental biology's control of organ differentiation.[17] In a four-automatically-defined-function setup with three cells, success rates reach 90%, highlighting the regulatory power of homeotic genes in fostering emergent, organism-like behaviors in evolved programs.[17]
Evolutionary Algorithm
Population and Fitness Evaluation
In gene expression programming, the initial population consists of a fixed number of individuals, where each individual is represented by a random multigenic chromosome constructed from symbols in the predefined function set and terminal set appropriate to the problem domain.[2] The value of is typically set between 20 and 100, depending on problem complexity, to balance computational efficiency and evolutionary diversity.[18] Fitness evaluation measures how well each individual's expression tree solves the target problem, using problem-dependent functions that quantify performance on a set of fitness cases. For symbolic regression tasks, a common approach employs a relative error metric to assess prediction accuracy across training data, defined as where denotes predicted values from the individual's expression for fitness case , denotes target values, is the number of fitness cases, and (e.g., 100) is a constant setting the selection range. This assumes ; adjustments may be needed otherwise. Higher fitness values indicate better alignment between predictions and observations, with a maximum of signifying perfect fit.[19] The selection environment comprises training data structured as input-output pairs, against which each individual's fitness is computed by evaluating the expressed model on these cases.[2] Elitism preserves solution quality by directly copying the top fittest individuals unchanged into the next generation, where is often 1 to 10 percent of the population size, preventing loss of superior solutions during evolution.[18]Selection and Elitism
In gene expression programming (GEP), selection is the process by which individuals from the current population are chosen as parents for the next generation based on their fitness values, ensuring that higher-fitness solutions have a greater probability of contributing to offspring.[20] This mechanism mimics natural selection and is crucial for directing the evolutionary search toward optimal solutions while maintaining population diversity.[21] Common selection methods in GEP include roulette-wheel selection and tournament selection. In roulette-wheel selection, individuals are selected proportionally to their fitness, where the probability of selection for an individual is its fitness divided by the total population fitness, allowing fitter individuals a larger "slice" of the wheel.[20] This method is favored in GEP for its simplicity, low computational overhead, and ability to preserve genetic diversity without requiring sorting, making it suitable for complex, real-world problems.[21] Tournament selection, on the other hand, involves randomly selecting a small subset of individuals (typically k=2) and choosing the fittest among them to reproduce; this is repeated until the mating pool is filled.[21] For example, in a tournament of size 2, two individuals are picked at random, and the one with higher fitness is selected, with ties resolved randomly.[21] Elitism in GEP is implemented by directly cloning the best individual(s)—often the top one or a small percentage (e.g., 1-10%)—into the next generation without modification, preventing the loss of superior solutions due to genetic operators.[20] This strategy, typically combined with roulette-wheel or tournament selection, guarantees monotonic improvement in the best fitness while the population evolves.[21] To balance elitism's tendency toward premature convergence, GEP employs diversity-preserving aspects of the selection methods themselves, such as the stochastic nature of roulette-wheel sampling, ensuring exploration alongside exploitation.[22]Reproduction and Genetic Operators
In gene expression programming (GEP), the reproduction phase generates the next population by selecting fitter individuals and applying genetic operators to their chromosomes, with offspring replacing all but the elite members to maintain the best solutions. This process operates on the linear chromosomal representations rather than the derived expression trees, ensuring that modifications always yield valid expressions. The selected parents, chosen via fitness-proportional selection like roulette-wheel sampling, undergo these operators to introduce variation while preserving the fixed-length gene structure.[11] Replication is the fundamental operator, simply copying selected chromosomes unchanged into the offspring population to transmit high-fitness genotypes directly. It introduces no diversity but supports elitism by guaranteeing survival of superior individuals.[11] Mutation provides point-wise variation by randomly replacing symbols within the chromosome; in the head region, the replacement can be any function or terminal symbol, whereas in the tail, only terminals are permitted to avoid invalid expressions. Typically, the mutation rate is calibrated to induce approximately two point mutations per chromosome, promoting gradual exploration of the search space without excessive disruption. For example, with a head length of 10 and alphabet size around 20-30, this equates to a low per-locus probability (roughly 0.01-0.05) focused primarily on the head for functional diversity.[11] Recombination facilitates the exchange of genetic material between two mating chromosomes, producing hybrid offspring that combine advantageous subsequences. Two-point recombination, a primary variant, selects two random crossover points and swaps the intervening segments, enabling larger structural rearrangements than single-point methods. Gene-specific recombination further refines this by swapping entire genes between chromosomes, preserving modularity in multigenic structures. The overall recombination rate is commonly set to 0.7, with sub-rates distributed among types (e.g., 0.5 for two-point and 0.1 for gene recombination) to balance exploration and exploitation.[11] Transposition operators mimic biological transposons to relocate and duplicate subsequences, enhancing evolvability. Insertion sequence (IS) transposition selects short motifs (typically lengths 1-3) from within a gene, duplicates them, and inserts the copy into the head region (excluding the root position) of another gene, allowing reuse of effective subexpressions. Root-IS transposition extends this by inserting function-headed fragments directly at the gene root, potentially altering the overall tree topology. Gene transposition moves an entire gene to the chromosome's starting position, overwriting the original and promoting gene-level innovation. These are usually applied at a rate of 0.1 each, with multiple IS lengths used to vary transposition scale.[11] As an extension, inversion operators can reverse segments within a gene to further diversify structures, though they are less central than the core operators described.[23]Variants and Extensions
GEP-RNC
Gene expression programming with random numerical constants (GEP-RNC) extends the standard GEP framework by incorporating evolvable numerical constants directly into the genetic representation, enabling more precise mathematical modeling without relying on a predefined set of constant terminals. In this variant, each gene appends a dedicated domain, denoted as Dc, to the conventional head and tail sections; the length of Dc typically equals the tail length , allowing for numerical constants per gene. These constants are represented in the gene sequence by a special terminal symbol, such as "?", which serves as a placeholder during the mapping to expression trees (ETs). Introduced by Cândida Ferreira in her 2006 monograph, GEP-RNC addresses the limitation of fixed constants in early GEP implementations by evolving these values dynamically through the genetic operators.[24] The Dc constants are evolved independently from the structural elements of the gene, primarily through specialized mutation operators that maintain their numerical integrity and diversity. Mutation in the Dc domain employs techniques such as creep mutation, where a constant is adjusted by adding or subtracting a small random value drawn from a Gaussian distribution (e.g., with ), or random mutation, which replaces the value with a new random number within predefined bounds. This separate evolution prevents disruptions to the gene's syntactic validity while allowing fine-tuning of numerical parameters. In multigenic chromosomes, linking functions connect subsequent genes to the ET of the first gene, enabling reuse of constants from earlier Dc domains as terminals, which promotes efficiency in complex models.[24][25] For instance, consider a gene with Dc constants [2.5, -1.3]; during mutation, the first constant might transform to , yielding a value like 2.57, while the structural symbols remain unchanged. These constants integrate seamlessly as leaf nodes in the ETs, where the "?" symbols are substituted with their corresponding Dc values in sequential order. This mechanism was demonstrated in early applications, such as symbolic regression tasks, where GEP-RNC achieved higher accuracy compared to basic GEP by adapting constants to fit target functions.[24][17] The primary advantage of GEP-RNC lies in its ability to enhance numerical precision in regression problems without expanding the function or terminal sets, thereby keeping the search space manageable while allowing the algorithm to discover data-driven constants. This approach improves model performance in scenarios requiring fine-grained adjustments, such as polynomial fitting, and has been widely adopted in subsequent GEP implementations for its balance of expressiveness and computational efficiency. However, the added Dc domain slightly increases chromosome length, which can impact evaluation times in large populations.[24][25]Recent Developments (Post-2020 Variants)
Since 2020, Gene Expression Programming (GEP) has seen several innovations aimed at overcoming limitations in the original model's static genetic operators and lack of adaptability, particularly in maintaining population diversity and scaling to complex problems. One prominent advancement is Dynamic Gene Expression Programming (DGEP), introduced in 2025, which dynamically adjusts mutation and regeneration rates based on population fitness trends and diversity metrics to prevent premature convergence.[26] DGEP incorporates two key operators: the Adaptive Regeneration Operator (DGEP-R), which introduces new individuals during evolutionary stagnation to boost diversity, and the Dynamically Adjusted Mutation Operator (DGEP-M), which modulates mutation probabilities using a fitness coefficient to balance exploration and exploitation.[26] These mechanisms address the original GEP's fixed parameters by enabling real-time adaptation, resulting in up to 15.7% higher R² scores and 2.3 times greater population diversity compared to standard GEP across benchmark functions.[26] In 2024, GEP with Structured Reusability (SR-GEP) was proposed to enhance modularity and scalability through explicit code reuse in chromosomes, allowing functional blocks (gene fragments) to be referenced multiple times without redundant evaluations.[27] This variant preserves GEP's linear head-tail encoding while integrating semantic reuse inspired by automatically defined functions, reducing expression tree complexity and improving efficiency for large-scale symbolic regression tasks.[27] SR-GEP demonstrates superior performance in predictive accuracy, outperforming baseline GEP and decision tree methods like C5.0 in applications such as cervical and breast cancer recurrence modeling, with notable reductions in computational overhead for problems involving repetitive substructures.[27] Another significant post-2020 extension is the Dual-Strategy GEP (DNSGEP), developed in 2023, which hybridizes global evolutionary search with local optimization techniques like hill-climbing on expression trees to accelerate convergence while escaping local optima.[28] DNSGEP employs a multi-parent crossover with neighborhood search and switches strategies based on population entropy—prioritizing balanced exploration early and intensified global search later—thereby minimizing the need for manual parameter tuning.[28] Evaluated on symbolic regression benchmarks, it achieves higher success rates and solution quality than traditional GEP variants, with faster convergence observed in parity and regression problems due to the integrated local refinements.[28] These variants have found practical application in engineering domains through symbolic regression. For instance, in 2024, GEP-based models were used to optimize concrete mix designs incorporating sustainable materials like quarry dust and superplasticizers, yielding predictive equations for compressive strength that support eco-friendly construction practices.[29] Similarly, in aerospace engineering, dimensional analysis-enhanced GEP has been applied to derive transition models for fluid dynamics, improving predictions of boundary layer behaviors in high-speed flows and aiding design transitions in aircraft components.[30] Such integrations highlight GEP's evolving role in addressing real-world adaptability gaps, fostering more robust evolutionary computations.Applications and Integrations
Symbolic Regression and Function Discovery
Gene expression programming (GEP) applies symbolic regression by evolving expression trees (ETs) encoded in linear chromosomes to discover mathematical expressions that fit input-output data, simultaneously identifying both the functional form and numerical parameters. In this process, the genome consists of fixed-length genes that map to ramified ETs through a parsing mechanism, allowing efficient genetic operations while separating the genotype from the phenotype. The evolution aims to minimize the discrepancy between predicted and observed values across a set of fitness cases, typically using algebraic, trigonometric, or logical operators depending on the problem domain.[2] The standard setup for symbolic regression in GEP involves defining a function set (e.g., {+, -, *, /}) and terminal set (e.g., {x}), initializing a population of chromosomes, and evaluating fitness via error minimization, such as the sum of absolute errors. For trigonometric functions, appropriate operators like sin and cos are included in the function set to evolve approximations of target expressions. This approach leverages multigenic chromosomes for complex expressions, linking sub-ETs additively or otherwise to build hierarchical models.[2] A notable case study from Ferreira's work demonstrates GEP's capability in Boolean function discovery, evolving an expression for the XOR gate from training data using logical operators like {AND, OR, NOT} in the function set and binary inputs as terminals. The evolved ET, such as (A AND NOT B) OR (NOT A AND B), achieves perfect classification on the four input combinations (00→0, 01→1, 10→1, 11→0), with fitness measured by the number of correct predictions exceeding a threshold (e.g., f_i = number of hits if >75% accuracy). This example highlights GEP's efficacy in logical synthesis, solving the non-linearly separable XOR problem in fewer generations than traditional methods.[31] GEP often outperforms genetic programming (GP) in symbolic regression speed due to its linear chromosomal representation, which enables more efficient mutation and recombination without invalid structures, reducing computational overhead by orders of magnitude—for example, demonstrating superior efficiency over GP in tasks like cellular automata rule discovery, often by several orders of magnitude in computational requirements. This efficiency stems from the decoupled genotype-phenotype system, allowing larger populations and faster convergence while maintaining solution diversity.[2]Machine Learning Integrations (Neural Networks and Decision Trees)
Gene expression programming (GEP) has been integrated with neural networks to evolve complete architectures, including topology, connection weights, and activation functions, by modifying the standard linear chromosomes to include dedicated domains for these elements. In the GEP-NN framework, the head of the chromosome encodes the network structure using function symbols for layers (D), neurons (T), and connections (Q), while additional domains (Dw for weights and Dt for thresholds) handle numerical parameters through evolutionary operators like mutation and recombination. This approach allows for an unconstrained search space where valid neural network phenotypes are guaranteed, as the expression trees derived from chromosomes directly map to functional networks. For instance, in evolving solutions for the exclusive-OR problem, GEP-NN achieved success rates of up to 77% with parsimonious architectures using head lengths of 4, demonstrating its ability to discover compact topologies without manual design.[32] Extensions incorporating GEP with random numerical constants (RNC) further enhance the evolution of weights and activation functions in neural networks, enabling finer-grained optimization of continuous parameters within hybrid models. GEP-RNC variants introduce floating-point terminals in the chromosomes, which evolve alongside the discrete structure, improving the precision of neural network predictions in classification tasks. A notable example is the GEP-NN hybrid applied to pattern recognition, where evolved networks outperform traditional fixed-architecture models by adapting both form and parameters evolutionarily, reducing the need for backpropagation. These integrations address bloat issues inherent in broader genetic programming by leveraging GEP's fixed-length genotypes, resulting in more scalable and interpretable neural architectures. For decision trees, GEP generates symbolic tree structures by evolving expression trees (ETs) that represent if-then rules, differing from classical methods like CART by employing a global evolutionary search rather than greedy local splits. In the GEPDT approach, chromosomes encode decision nodes and branches via function sets (e.g., conditional operators) and terminals (features), with the evolutionary process selecting for trees that minimize classification error while controlling depth to avoid overfitting. This one-against-all strategy allows handling multi-class problems in a single evolution run, making it robust to noisy data compared to recursive partitioning. GEP-evolved decision trees have been applied to benchmark classification problems, achieving performance comparable to established inducers like iterative dichotomer while providing transparent, symbolic outputs.[33] The primary benefits of these GEP integrations with neural networks and decision trees lie in their production of inherently interpretable models, as the evolved ETs offer explicit rules or topologies that can be inspected, contrasting black-box machine learning alternatives. By combining GEP's symbolic evolution with structured ML paradigms, these hybrids mitigate overfitting through parsimony pressure and elitism, enhancing generalization in domains like classification without exhaustive hyperparameter tuning. These methods further improve robustness and scalability in real-world applications.[34]Recent Applications (2023-2025)
In civil engineering, gene expression programming (GEP) has been applied to predict the mechanical properties of sustainable concrete mixes incorporating mineral fillers like quarry dust as a partial sand replacement. A 2024 study utilized GEP to develop explicit mathematical expressions for compressive strength and split tensile strength based on mix design parameters including cement content, aggregates, water, superplasticizer, and curing age. The models achieved high accuracy, with coefficients of determination (R²) of 0.96 for compressive strength and 0.92 for tensile strength on training data, outperforming traditional multilinear regression by approximately 5-6%. This data-driven approach facilitates optimized, eco-friendly concrete formulations by quantifying the influence of sustainable materials, such as cement's dominant 49.5% contribution to compressive strength.[29] In aerospace engineering, GEP integrated with dimensional analysis has advanced the discovery of transition models for fluid dynamics in transitional flows, such as those in turbomachinery. A 2024 investigation introduced dimensional analysis GEP (DA-GEP), which enforces physical dimension consistency during expression evolution to generate valid turbulence closure terms for Reynolds-averaged Navier-Stokes simulations. Evaluated on benchmark turbomachinery cases, DA-GEP reduced computational overhead compared to unconstrained GEP variants while improving model fidelity in predicting laminar-to-turbulent transitions. This method supports more accurate aerodynamic designs by evolving dimensionally homogeneous equations directly from simulation data.[30] In bioinformatics, enhanced GEP variants have enabled pattern discovery in protein-protein interaction networks through predictive scoring of biological interactions. A 2025 study proposed dynamic factor GEP (DF-GEP), incorporating Spearman correlation-based feature weighting and adaptive genetic operators, to forecast combined scores for proteins like PTEN and ALK using data from the STRING database. DF-GEP demonstrated superior performance over baselines such as support vector machines, achieving mean absolute percentage errors as low as 26.5% and correctly classifying 191 interactions in the ALK dataset, thereby revealing complex regulatory patterns in cellular processes. This application highlights GEP's role in interpretable modeling of genomic and proteomic data for disease mechanism elucidation.[35] Recent advancements in GEP, particularly dynamic variants, have extended to financial time-series forecasting by enabling adaptive evolution of predictive expressions for volatile markets like stock prices. The 2025 introduction of dynamic GEP (DGEP) dynamically adjusts mutation and regeneration rates based on fitness trends, yielding 15.7% higher R² scores in symbolic regression tasks suitable for time-series modeling. While primarily benchmarked on standard functions, DGEP's enhanced global search capabilities address non-stationarity in financial data, supporting evolved models for stock trend prediction with improved escape from local optima by 35%.[36] In renewable energy optimization, hybrid GEP approaches have improved solar photovoltaic performance under varying environmental conditions. A 2025 study combined GEP with adaptive neuro-fuzzy inference systems for maximum power point tracking in boost converters, using irradiance and temperature as inputs to evolve optimal control expressions. The hybrid model attained 99.84% efficiency in high-irradiation scenarios via MATLAB simulations, outperforming conventional perturb-and-observe methods by stabilizing output under dynamic weather. This integration promotes scalable, data-driven enhancements in solar energy harvesting.[37] Emerging trends from 2023 to 2025 emphasize GEP's integration with big data pipelines and dynamic optimization frameworks, as seen in DGEP's application to high-dimensional problems requiring real-time adaptation. These developments, including adaptive operators for population diversity (up to 2.3 times higher than standard GEP), position the algorithm for broader use in complex, data-intensive domains like renewable energy scheduling and financial volatility modeling. Such extensions bridge evolutionary computation with machine learning, fostering interpretable solutions for large-scale optimization challenges.[36]Implementations and Criticism
Software and Libraries
Gene expression programming (GEP) implementations are available in both open-source libraries and commercial software, facilitating practical applications in symbolic regression and evolutionary computation. Open-source options emphasize flexibility and integration with broader machine learning ecosystems, while commercial tools provide user-friendly interfaces for non-programmers. The primary open-source Python library for GEP is Geppy, which supports core GEP variants including multigenic chromosomes and provides tools for symbolic regression.[38] Geppy is built on the DEAP (Distributed Evolutionary Algorithms in Python) framework, leveraging its evolutionary operators and population management for efficient GEP prototyping and testing.[39] Key features include a primitive set for defining functions and terminals, genetic operators like one-point crossover and uniform mutation, and an API for custom fitness functions, enabling users to tailor evolution to specific problems such as function discovery.[39] Other open-source implementations include gepR, an R package for data mining with GEP to find functional relationships in datasets,[40] and LibGEP, a C++ library dedicated to GEP for system modeling.[41] For basic usage, Geppy allows straightforward population initialization. The following Python code snippet demonstrates creating a primitive set, registering toolbox components, and generating an initial population of 100 individuals for a simple symbolic regression task:import geppy as gep
import operator
from deap import creator, base, tools
# Define primitive set with inputs, functions, and constants
pset = gep.PrimitiveSet('main', input_names=['x', 'y'])
pset.add_function(operator.add, 2)
pset.add_function(operator.mul, 2)
pset.add_constant_terminal(3)
# Create fitness and individual classes
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create('Individual', gep.Chromosome, fitness=creator.FitnessMax)
# Setup toolbox for gene generation and population
toolbox = gep.[Toolbox](/page/Toolbox)()
toolbox.register('gene_gen', gep.Gene, pset=pset, head_length=7)
toolbox.register('individual', creator.Individual, gene_gen=toolbox.gene_gen, n_genes=2, linker=operator.add)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
# Initialize population
pop = toolbox.population(n=100)
import geppy as gep
import operator
from deap import creator, base, tools
# Define primitive set with inputs, functions, and constants
pset = gep.PrimitiveSet('main', input_names=['x', 'y'])
pset.add_function(operator.add, 2)
pset.add_function(operator.mul, 2)
pset.add_constant_terminal(3)
# Create fitness and individual classes
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create('Individual', gep.Chromosome, fitness=creator.FitnessMax)
# Setup toolbox for gene generation and population
toolbox = gep.[Toolbox](/page/Toolbox)()
toolbox.register('gene_gen', gep.Gene, pset=pset, head_length=7)
toolbox.register('individual', creator.Individual, gene_gen=toolbox.gene_gen, n_genes=2, linker=operator.add)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
# Initialize population
pop = toolbox.population(n=100)
gep_simple algorithm.[39]
On the commercial side, GeneXproTools, developed by Cândida Ferreira—the originator of GEP—offers a comprehensive platform for data modeling, including GEP-based symbolic regression, classification, and time series prediction.[42] It supports advanced features like multigenic chromosomes and relational network chromosomes (RNC), allowing complex model evolution from datasets.[43] The software includes a graphical user interface (GUI) for visual model building and analysis, such as curve-fitting charts and expression tree visualization, making it accessible for symbolic regression tasks without coding.[42] Additionally, GeneXproServer provides an API for programmatic model creation and integration into larger workflows.[43]
Another commercial tool applicable to GEP-like evolutionary tasks is Evolver, an Excel add-in that employs genetic algorithms for optimization problems.[44] While not exclusively for GEP, it supports evolutionary computation through mutation and selection mechanisms, suitable for prototyping GEP-inspired models in spreadsheet environments.[44]
Recent community efforts on GitHub include forks and implementations exploring GEP variants, such as those adapting to dynamic genetic operators in Dynamic GEP (DGEP) for improved global search capabilities, as detailed in 2025 research.[36] These extensions build on libraries like Geppy to address evolving computational needs.















