Hubbry Logo
search
logo

Hopfield network

logo
Community Hub0 Subscribers
Read side by side
from Wikipedia

A Hopfield network (or associative memory) is a form of recurrent neural network, or a spin glass system, that can serve as a content-addressable memory. The Hopfield network, named for John Hopfield, consists of a single layer of neurons, where each neuron is connected to every other neuron except itself. These connections are bidirectional and symmetric, meaning the weight of the connection from neuron i to neuron j is the same as the weight from neuron j to neuron i. Patterns are associatively recalled by fixing certain inputs, and dynamically evolve the network to minimize an energy function, towards local energy minimum states that correspond to stored patterns. Patterns are associatively learned (or "stored") by a Hebbian learning algorithm.

One of the key features of Hopfield networks is their ability to recover complete patterns from partial or noisy inputs, making them robust in the face of incomplete or corrupted data. Their connection to statistical mechanics, recurrent networks, and human cognitive psychology has led to their application in various fields, including physics, psychology, neuroscience, and machine learning theory and practice.

History

[edit]

One origin of associative memory is human cognitive psychology, specifically the associative memory. Frank Rosenblatt studied "close-loop cross-coupled perceptrons", which are 3-layered perceptron networks whose middle layer contains recurrent connections that change by a Hebbian learning rule.[1]: 73–75 [2]: Chapter 19, 21 

Another model of associative memory is where the output does not loop back to the input. W. K. Taylor proposed such a model trained by Hebbian learning in 1956.[3] Karl Steinbuch, who wanted to understand learning, and was inspired by watching his children learn,[4] published the Lernmatrix in 1961.[5][6] It was translated to English in 1963.[7] Similar research was done with the correlogram of D. J. Willshaw et al. in 1969.[8] Teuvo Kohonen trained an associative memory by gradient descent in 1974.[9]

A close-loop cross-coupled perceptron network. Principles of Neurodynamics (1961): 403, Fig. 47 .

Another origin of associative memory was statistical mechanics. The Ising model was published in 1920s as a model of magnetism, however it studied the thermal equilibrium, which does not change with time. Roy J. Glauber in 1963 studied the Ising model evolving in time, as a process towards thermal equilibrium (Glauber dynamics), adding in the component of time.[10][11]

The second component to be added was adaptation to stimulus. Described independently by Kaoru Nakano in 1971[12][13] and Shun'ichi Amari in 1972,[14] they proposed to modify the weights of an Ising model by Hebbian learning rule as a model of associative memory. The same idea was published by William A. Little [de] in 1974,[15] who was acknowledged by Hopfield in his 1982 paper.

See Carpenter (1989)[16] and Cowan (1990)[17] for a technical description of some of these early works in associative memory.

The Sherrington–Kirkpatrick model of spin glass, published in 1975,[18] is the Hopfield network with random initialization. Sherrington and Kirkpatrick found that it is highly likely for the energy function of the SK model to have many local minima. In the 1982 paper, Hopfield applied this recently developed theory to study the Hopfield network with binary activation functions.[19] In a 1984 paper he extended this to continuous activation functions.[20] It became a standard model for the study of neural networks through statistical mechanics.[21][22]

A major advance in memory storage capacity was developed by Dimitry Krotov and Hopfield in 2016[23] through a change in network dynamics and energy function. This idea was further extended by Demircigil and collaborators in 2017.[24] The continuous dynamics of large memory capacity models was developed in a series of papers between 2016 and 2020.[23][25][26] Large memory storage capacity Hopfield Networks are now called Dense Associative Memories or modern Hopfield networks.

In 2024, John J. Hopfield and Geoffrey E. Hinton were awarded the Nobel Prize in Physics for their foundational contributions to machine learning, such as the Hopfield network.

Structure

[edit]
A Hopfield net with four units

The units in Hopfield nets are binary threshold units, i.e. the units only take on two different values for their states, and the value is determined by whether or not the unit's input exceeds its threshold . Discrete Hopfield nets describe relationships between binary (firing or not-firing) neurons .[19] At a certain time, the state of the neural net is described by a vector , which records which neurons are firing in a binary word of bits.

The interactions between neurons have units that usually take on values of 1 or −1, and this convention will be used throughout this article. However, other literature might use units that take values of 0 and 1. These interactions are "learned" via Hebb's law of association, such that, for a certain state and distinct nodes

but .

(Note that the Hebbian learning rule takes the form when the units assume values in .)

Once the network is trained, no longer evolve. If a new state of neurons is introduced to the neural network, the net acts on neurons such that

  • if
  • if

where is the threshold value of the i'th neuron (often taken to be 0).[27] In this way, Hopfield networks have the ability to "remember" states stored in the interaction matrix, because if a new state is subjected to the interaction matrix, each neuron will change until it matches the original state (see the Updates section below).

The connections in a Hopfield net typically have the following restrictions:

  • (no unit has a connection with itself)
  • (connections are symmetric)

The constraint that weights are symmetric guarantees that the energy function decreases monotonically while following the activation rules.[28] A network with asymmetric weights may exhibit some periodic or chaotic behaviour; however, Hopfield found that this behavior is confined to relatively small parts of the phase space and does not impair the network's ability to act as a content-addressable associative memory system.

Hopfield also modeled neural nets for continuous values, in which the electric output of each neuron is not binary but some value between 0 and 1.[20] He found that this type of network was also able to store and reproduce memorized states.

Notice that every pair of units i and j in a Hopfield network has a connection that is described by the connectivity weight . In this sense, the Hopfield network can be formally described as a complete undirected graph , where is a set of McCulloch–Pitts neurons and is a function that links pairs of units to a real value, the connectivity weight.

Updating

[edit]

Updating one unit (node in the graph simulating the artificial neuron) in the Hopfield network is performed using the following rule:

where:

  • is the strength of the connection weight from unit j to unit i (the weight of the connection).
  • is the state of unit i.
  • is the threshold of unit i.

Updates in the Hopfield network can be performed in two different ways:

  • Asynchronous: Only one unit is updated at a time. This unit can be picked at random, or a pre-defined order can be imposed from the very beginning.
  • Synchronous: All units are updated at the same time. This requires a central clock to the system in order to maintain synchronization. This method is viewed by some as less realistic, based on an absence of observed global clock influencing analogous biological or physical systems of interest.

Neurons "attract or repel each other" in state space

[edit]

The weight between two units has a powerful impact upon the values of the neurons. Consider the connection weight between two neurons i and j. If , the updating rule implies that:

  • when , the contribution of j in the weighted sum is positive. Thus, is pulled by j towards its value
  • when , the contribution of j in the weighted sum is negative. Then again, is pushed by j towards its value

Thus, the values of neurons i and j will converge if the weight between them is positive. Similarly, they will diverge if the weight is negative.

Convergence properties of discrete and continuous Hopfield networks

[edit]

Bruck in his paper in 1990[29]  studied discrete Hopfield networks and proved a generalized convergence theorem that is based on the connection between the network's dynamics and cuts in the associated graph. This generalization covered both asynchronous as well as synchronous dynamics and presented elementary proofs based on greedy algorithms for max-cut in graphs. A subsequent paper[30] further investigated the behavior of any neuron in both discrete-time and continuous-time Hopfield networks when the corresponding energy function is minimized during an optimization process. Bruck showed[29] that neuron j changes its state if and only if it further decreases the following biased pseudo-cut. The discrete Hopfield network minimizes the following biased pseudo-cut[30] for the synaptic weight matrix of the Hopfield net.

where and represents the set of neurons which are −1 and +1, respectively, at time . For further details, see the recent paper.[30]

The discrete-time Hopfield Network always minimizes exactly the following pseudo-cut[29][30]

The continuous-time Hopfield network always minimizes an upper bound to the following weighted cut[30]

where is a zero-centered sigmoid function.

The complex Hopfield network, on the other hand, generally tends to minimize the so-called shadow-cut of the complex weight matrix of the net.[31]

Energy

[edit]
Energy Landscape of a Hopfield Network, highlighting the current state of the network (up the hill), an attractor state to which it will eventually converge, a minimum energy level and a basin of attraction shaded in green. Note how the update of the Hopfield Network is always going down in Energy.

Hopfield nets have a scalar value associated with each state of the network, referred to as the "energy", E, of the network, where:

This quantity is called "energy" because it either decreases or stays the same upon network units being updated. Furthermore, under repeated updating the network will eventually converge to a state which is a local minimum in the energy function (which is considered to be a Lyapunov function).[19] Thus, if a state is a local minimum in the energy function it is a stable state for the network. Note that this energy function belongs to a general class of models in physics under the name of Ising models; these in turn are a special case of Markov networks, since the associated probability measure, the Gibbs measure, has the Markov property.

Hopfield network in optimization

[edit]

Hopfield and Tank presented the Hopfield network application in solving the classical traveling-salesman problem in 1985.[32] Since then, the Hopfield network has been widely used for optimization. The idea of using the Hopfield network in optimization problems is straightforward: If a constrained/unconstrained cost function can be written in the form of the Hopfield energy function E, then there exists a Hopfield network whose equilibrium points represent solutions to the constrained/unconstrained optimization problem.  Minimizing the Hopfield energy function both minimizes the objective function and satisfies the constraints also as the constraints are "embedded" into the synaptic weights of the network. Although including the optimization constraints into the synaptic weights in the best possible way is a challenging task, many difficult optimization problems with constraints in different disciplines have been converted to the Hopfield energy function: Associative memory systems, Analog-to-Digital conversion, job-shop scheduling problem, quadratic assignment and other related NP-complete problems, channel allocation problem in wireless networks, mobile ad-hoc network routing problem, image restoration, system identification, combinatorial optimization, etc, just to name a few. However, while it is possible to convert hard optimization problems to Hopfield energy functions, it does not guarantee convergence to a solution (even in exponential time).[33]

Initialization and running

[edit]

Initialization of the Hopfield networks is done by setting the values of the units to the desired start pattern. Repeated updates are then performed until the network converges to an attractor pattern. Convergence is generally assured, as Hopfield proved that the attractors of this nonlinear dynamical system are stable, not periodic or chaotic as in some other systems.[citation needed] Therefore, in the context of Hopfield networks, an attractor pattern is a final stable state, a pattern that cannot change any value within it under updating.[citation needed]

Training

[edit]

Training a Hopfield net involves lowering the energy of states that the net should "remember". This allows the net to serve as a content addressable memory system, that is to say, the network will converge to a "remembered" state if it is given only part of the state. The net can be used to recover from a distorted input to the trained state that is most similar to that input. This is called associative memory because it recovers memories on the basis of similarity. For example, if we train a Hopfield net with five units so that the state (1, −1, 1, −1, 1) is an energy minimum, and we give the network the state (1, −1, −1, −1, 1) it will converge to (1, −1, 1, −1, 1). Thus, the network is properly trained when the energy of states which the network should remember are local minima. Note that, in contrast to Perceptron training, the thresholds of the neurons are never updated.

Learning rules

[edit]

There are various different learning rules that can be used to store information in the memory of the Hopfield network. It is desirable for a learning rule to have both of the following two properties:

  • Local: A learning rule is local if each weight is updated using information available to neurons on either side of the connection that is associated with that particular weight.
  • Incremental: New patterns can be learned without using information from the old patterns that have been also used for training. That is, when a new pattern is used for training, the new values for the weights only depend on the old values and on the new pattern.[34]

These properties are desirable, since a learning rule satisfying them is more biologically plausible. For example, since the human brain is always learning new concepts, one can reason that human learning is incremental. A learning system that was not incremental would generally be trained only once, with a huge batch of training data.

Hebbian learning rule for Hopfield networks

[edit]

Hebbian theory was introduced by Donald Hebb in 1949 in order to explain "associative learning", in which simultaneous activation of neuron cells leads to pronounced increases in synaptic strength between those cells.[35] It is often summarized as "Neurons that fire together wire together. Neurons that fire out of sync fail to link".

The Hebbian rule is both local and incremental. For the Hopfield networks, it is implemented in the following manner when learning binary patterns:

where represents bit i from pattern .

If the bits corresponding to neurons i and j are equal in pattern , then the product will be positive. This would, in turn, have a positive effect on the weight and the values of i and j will tend to become equal. The opposite happens if the bits corresponding to neurons i and j are different.

Storkey learning rule

[edit]

This rule was introduced by Amos Storkey in 1997 and is both local and incremental. Storkey also showed that a Hopfield network trained using this rule has a greater capacity than a corresponding network trained using the Hebbian rule.[36] The weight matrix of an attractor neural network[clarification needed] is said to follow the Storkey learning rule if it obeys:

where is a form of local field[34] at neuron i.

This learning rule is local, since the synapses take into account only neurons at their sides. The rule makes use of more information from the patterns and weights than the generalized Hebbian rule, due to the effect of the local field.

Spurious patterns

[edit]
10 random bits initially flipped
15 random bits initially flipped (converging to the spurious pattern)
Hopfield dynamics of a network of 5x5=25 neurons trained to store a single pattern "E" in black

Patterns that the network uses for training (called retrieval states) become attractors of the system. Repeated updates would eventually lead to convergence to one of the retrieval states. However, sometimes the network will converge to spurious patterns (different from the training patterns).[37] In fact, the number of spurious patterns can be exponential in the number of stored patterns, even if the stored patterns are orthogonal.[38] The energy in these spurious patterns is also a local minimum. For each stored pattern x, the negation -x is also a spurious pattern.

A spurious state can also be a linear combination of an odd number of retrieval states. For example, when using 3 patterns , one can get the following spurious state:

Spurious patterns that have an even number of states cannot exist, since they might sum up to zero[37]

Capacity

[edit]

The Network capacity of the Hopfield network model is determined by neuron amounts and connections within a given network. Therefore, the number of memories that are able to be stored is dependent on neurons and connections. Furthermore, it was shown that the recall accuracy between vectors and nodes was 0.138 (approximately 138 vectors can be recalled from storage for every 1000 nodes) (Hertz et al., 1991). Therefore, it is evident that many mistakes will occur if one tries to store a large number of vectors. When the Hopfield model does not recall the right pattern, it is possible that an intrusion has taken place, since semantically related items tend to confuse the individual, and recollection of the wrong pattern occurs. Therefore, the Hopfield network model is shown to confuse one stored item with that of another upon retrieval. Perfect recalls and high capacity, >0.14, can be loaded in the network by Storkey learning method; ETAM,[39][40] ETAM experiments also in.[41] Ulterior models inspired by the Hopfield network were later devised to raise the storage limit and reduce the retrieval error rate, with some being capable of one-shot learning.[42]

The storage capacity can be given as where is the number of neurons in the net.

Human memory

[edit]

The Hopfield network is a model for human associative learning and recall.[43][44] It accounts for associative memory through the incorporation of memory vectors. Memory vectors can be slightly used, and this would spark the retrieval of the most similar vector in the network. However, we will find out that due to this process, intrusions can occur. In associative memory for the Hopfield network, there are two types of operations: auto-association and hetero-association. The first being when a vector is associated with itself, and the latter being when two different vectors are associated in storage. Furthermore, both types of operations are possible to store within a single memory matrix, but only if that given representation matrix is not one or the other of the operations, but rather the combination (auto-associative and hetero-associative) of the two.

Hopfield's network model utilizes the same learning rule as Hebb's (1949) learning rule, which characterised learning as being a result of the strengthening of the weights in cases of neuronal activity.

Rizzuto and Kahana (2001) were able to show that the neural network model can account for repetition on recall accuracy by incorporating a probabilistic-learning algorithm. During the retrieval process, no learning occurs. As a result, the weights of the network remain fixed, showing that the model is able to switch from a learning stage to a recall stage. By adding contextual drift they were able to show the rapid forgetting that occurs in a Hopfield model during a cued-recall task. The entire network contributes to the change in the activation of any single node.

McCulloch and Pitts' (1943) dynamical rule, which describes the behavior of neurons, does so in a way that shows how the activations of multiple neurons map onto the activation of a new neuron's firing rate, and how the weights of the neurons strengthen the synaptic connections between the new activated neuron (and those that activated it). Hopfield would use McCulloch–Pitts's dynamical rule in order to show how retrieval is possible in the Hopfield network. However, Hopfield would do so in a repetitious fashion. Hopfield would use a nonlinear activation function, instead of using a linear function. This would therefore create the Hopfield dynamical rule and with this, Hopfield was able to show that with the nonlinear activation function, the dynamical rule will always modify the values of the state vector in the direction of one of the stored patterns.

Dense associative memory or modern Hopfield network

[edit]

Hopfield networks[19][20] are recurrent neural networks with dynamical trajectories converging to fixed point attractor states and described by an energy function. The state of each model neuron is defined by a time-dependent variable , which can be chosen to be either discrete or continuous. A complete model describes the mathematics of how the future state of activity of each neuron depends on the known present or previous activity of all the neurons.

In the original Hopfield model of associative memory,[19] the variables were binary, and the dynamics were described by a one-at-a-time update of the state of the neurons. An energy function quadratic in the was defined, and the dynamics consisted of changing the activity of each single neuron only if doing so would lower the total energy of the system. This same idea was extended to the case of being a continuous variable representing the output of neuron , and being a monotonic function of an input current. The dynamics became expressed as a set of first-order differential equations for which the "energy" of the system always decreased.[20]  The energy in the continuous case has one term which is quadratic in the (as in the binary model), and a second term which depends on the gain function (neuron's activation function). While having many desirable properties of associative memory, both of these classical systems suffer from a small memory storage capacity, which scales linearly with the number of input features.[19] In contrast, by increasing the number of parameters in the model so that there are not just pair-wise but also higher-order interactions between the neurons, one can increase the memory storage capacity.[45][46]

Dense Associative Memories[23] (also known as the modern Hopfield networks[25]) are generalizations of the classical Hopfield Networks that break the linear scaling relationship between the number of input features and the number of stored memories. This is achieved by introducing stronger non-linearities (either in the energy function or neurons' activation functions) leading to super-linear[23] (even an exponential[24]) memory storage capacity as a function of the number of feature neurons, in effect increasing the order of interactions between the neurons.[45][46] The network still requires a sufficient number of hidden neurons.[26]

The key theoretical idea behind dense associative memory networks is to use an energy function and an update rule that is more sharply peaked around the stored memories in the space of neuron's configurations compared to the classical model,[23] as demonstrated when the higher-order interactions and subsequent energy landscapes are explicitly modelled.[46]

Discrete variables

[edit]

A simple example[23] of the modern Hopfield network can be written in terms of binary variables that represent the active and inactive state of the model neuron .In this formula the weights represent the matrix of memory vectors (index enumerates different memories, and index enumerates the content of each memory corresponding to the -th feature neuron), and the function is a rapidly growing non-linear function. The update rule for individual neurons (in the asynchronous case) can be written in the following form which states that in order to calculate the updated state of the -th neuron the network compares two energies: the energy of the network with the -th neuron in the ON state and the energy of the network with the -th neuron in the OFF state, given the states of the remaining neuron. The updated state of the -th neuron selects the state that has the lowest of the two energies.[23]

In the limiting case when the non-linear energy function is quadratic these equations reduce to the familiar energy function and the update rule for the classical binary Hopfield Network.[19]

The memory storage capacity of these networks can be calculated for random binary patterns. For the power energy function the maximal number of memories that can be stored and retrieved from this network without errors is given by[23]For an exponential energy function the memory storage capacity is exponential in the number of feature neurons[24]

Fig. 1: An example of a continuous modern Hopfield network with feature neurons and memory (hidden) neurons with symmetric synaptic connections between them.

Continuous variables

[edit]

Modern Hopfield networks or dense associative memories can be best understood in continuous variables and continuous time.[25][26] Consider the network architecture, shown in Fig.1, and the equations for neuron's states evolution[26]

where the currents of the feature neurons are denoted by , and the currents of the memory neurons are denoted by ( stands for hidden neurons). There are no synaptic connections among the feature neurons or the memory neurons. A matrix denotes the strength of synapses from a feature neuron to the memory neuron . The synapses are assumed to be symmetric, so that the same value characterizes a different physical synapse from the memory neuron to the feature neuron . The outputs of the memory neurons and the feature neurons are denoted by and , which are non-linear functions of the corresponding currents. In general these outputs can depend on the currents of all the neurons in that layer so that and . It is convenient to define these activation functions as derivatives of the Lagrangian functions for the two groups of neurons

This way the specific form of the equations for neuron's states is completely defined once the Lagrangian functions are specified. Finally, the time constants for the two groups of neurons are denoted by and , is the input current to the network that can be driven by the presented data. 

Fig. 2: Effective theory on the feature neurons for various common choices of the Lagrangian functions. Model A reduces to the models studied in[23][24] depending on the choice of the activation function, model B reduces to the model studied in,[25] model C reduces to the model of.[26] F is a "smooth enough" function.[23]

General systems of non-linear differential equations can have many complicated behaviors that can depend on the choice of the non-linearities and the initial conditions. For Hopfield Networks, however, this is not the case - the dynamical trajectories always converge to a fixed point attractor state. This property is achieved because these equations are specifically engineered so that they have an underlying energy function[26]

The terms grouped into square brackets represent a Legendre transform of the Lagrangian function with respect to the states of the neurons. If the Hessian matrices of the Lagrangian functions are positive semi-definite, the energy function is guaranteed to decrease on the dynamical trajectory[26]

This property makes it possible to prove that the system of dynamical equations describing temporal evolution of neurons' activities will eventually reach a fixed point attractor state.

In certain situations one can assume that the dynamics of hidden neurons equilibrates at a much faster time scale compared to the feature neurons, . In this case the steady state solution of the second equation in the system (1) can be used to express the currents of the hidden units through the outputs of the feature neurons. This makes it possible to reduce the general theory (1) to an effective theory for feature neurons only. The resulting effective update rules and the energies for various common choices of the Lagrangian functions are shown in Fig.2. In the case of log-sum-exponential Lagrangian function the update rule (if applied once) for the states of the feature neurons is the attention mechanism[25] commonly used in many modern AI systems (see Ref.[26] for the derivation of this result from the continuous time formulation).

Relationship to classical Hopfield network with continuous variables

[edit]

Classical formulation of continuous Hopfield Networks[20] can be understood[26] as a special limiting case of the modern Hopfield networks with one hidden layer. Continuous Hopfield Networks for neurons with graded response are typically described[20] by the dynamical equations

and the energy function

where , and is the inverse of the activation function . This model is a special limit of the class of models that is called models A,[26] with the following choice of the Lagrangian functions

that, according to the definition (2), leads to the activation functions

If we integrate out the hidden neurons the system of equations (1) reduces to the equations on the feature neurons (5) with , and the general expression for the energy (3) reduces to the effective energy

While the first two terms in equation (6) are the same as those in equation (9), the third terms look superficially different. In equation (9) it is a Legendre transform of the Lagrangian for the feature neurons, while in (6) the third term is an integral of the inverse activation function. Nevertheless, these two expressions are in fact equivalent, since the derivatives of a function and its Legendre transform are inverse functions of each other. The easiest way to see that these two terms are equal explicitly is to differentiate each one with respect to . The results of these differentiations for both expressions are equal to . Thus, the two expressions are equal up to an additive constant. This completes the proof[26] that the classical Hopfield Network with continuous states[20] is a special limiting case of the modern Hopfield network (1) with energy (3).

General formulation of the modern Hopfield network

[edit]
Fig. 3: The connectivity diagram of the fully-connected modern Hopfield network consisting of five neurons. The synaptic weights are described by a symmetric matrix .

Biological neural networks have a large degree of heterogeneity in terms of different cell types. This section describes a mathematical model of a fully connected modern Hopfield network assuming the extreme degree of heterogeneity: every single neuron is different.[47] Specifically, an energy function and the corresponding dynamical equations are described assuming that each neuron has its own activation function and kinetic time scale.  The network is assumed to be fully connected, so that every neuron is connected to every other neuron using a symmetric matrix of weights , indices and enumerate different neurons in the network, see Fig.3. The easiest way to mathematically formulate this problem is to define the architecture through a Lagrangian function that depends on the activities of all the neurons in the network. The activation function for each neuron is defined as a partial derivative of the Lagrangian  with respect to that neuron's activity

From the biological perspective one can think about as an axonal output of the neuron . In the simplest case, when the Lagrangian is additive for different neurons, this definition results in the activation that is a non-linear function of that neuron's activity. For non-additive Lagrangians this activation function can depend on the activities of a group of neurons. For instance, it can contain contrastive (softmax) or divisive normalization. The dynamical equations describing temporal evolution of a given neuron are given by[47]

This equation belongs to the class of models called firing rate models in neuroscience. Each neuron collects the axonal outputs from all the neurons, weights them with the synaptic coefficients and produces its own time-dependent activity . The temporal evolution has a time constant , which in general can be different for every neuron. This network has a global energy function[47]

where the first two terms represent the Legendre transform of the Lagrangian function with respect to the neurons' currents . The temporal derivative of this energy function can be computed on the dynamical trajectories leading to (see [47] for details)

The last inequality sign holds provided that the matrix (or its symmetric part) is positive semi-definite. If, in addition to this, the energy function is bounded from below the non-linear dynamical equations are guaranteed to converge to a fixed point attractor state. The advantage of formulating this network in terms of the Lagrangian functions is that it makes it possible to easily experiment with different choices of the activation functions and different architectural arrangements of neurons. For all those flexible choices the conditions of convergence are determined by the properties of the matrix and the existence of the lower bound on the energy function.

Fig. 4: The connectivity diagram of the layered Hierarchical Associative Memory network.[47] Each layer can have different number of neurons, different activation function, and different time scales. The feedforward weights and feedback weights are equal.

Hierarchical associative memory network

[edit]

The neurons can be organized in layers so that every neuron in a given layer has the same activation function and the same dynamic time scale. If we assume that there are no horizontal connections between the neurons within the layer (lateral connections) and there are no skip-layer connections, the general fully connected network (11), (12) reduces to the architecture shown in Fig.4. It has layers of recurrently connected neurons with the states described by continuous variables and the activation functions , index enumerates the layers of the network, and index enumerates individual neurons in that layer. The activation functions can depend on the activities of all the neurons in the layer. Every layer can have a different number of neurons . These neurons are recurrently connected with the neurons in the preceding and the subsequent layers. The matrices of weights that connect neurons in layers and are denoted by (the order of the upper indices for weights is the same as the order of the lower indices, in the example above this means that the index enumerates neurons in the layer , and index enumerates neurons in the layer ). The feedforward weights and the feedback weights are equal. The dynamical equations for the neurons' states can be written as[47]

with boundary conditions

The main difference between these equations and those from the conventional feedforward networks is the presence of the second term, which is responsible for the feedback from higher layers. These top-down signals help neurons in lower layers to decide on their response to the presented stimuli. Following the general recipe it is convenient to introduce a Lagrangian function for the -th hidden layer, which depends on the activities of all the neurons in that layer.[47] The activation functions in that layer can be defined as partial derivatives of the Lagrangian

With these definitions the energy (Lyapunov) function is given by[47]

If the Lagrangian functions, or equivalently the activation functions, are chosen in such a way that the Hessians for each layer are positive semi-definite and the overall energy is bounded from below, this system is guaranteed to converge to a fixed point attractor state. The temporal derivative of this energy function is given by[47]

Thus, the hierarchical layered network is indeed an attractor network with the global energy function. This network is described by a hierarchical set of synaptic weights that can be learned for each specific problem.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A Hopfield network is a type of recurrent artificial neural network designed as a content-addressable memory system, where patterns can be stored and retrieved from partial or noisy inputs through collective dynamics of interconnected binary neurons.[1] Introduced by physicist John J. Hopfield in 1982, it models how simple neuron-like units can exhibit emergent computational abilities, drawing inspiration from neurobiology while being adaptable to physical systems like integrated circuits.[1] The network's core innovation lies in its ability to converge to stable states representing stored memories, treating retrieval as minimization of an energy function akin to the Ising model in statistical physics.[2] The structure of a classical Hopfield network comprises N fully connected neurons, each with a binary state $ s_i = \pm 1 $ (representing "on" or "off"), and symmetric synaptic weights $ w_{ij} = w_{ji} $ that encode memories via Hebbian learning: $ w_{ij} = \frac{1}{N} \sum_{\mu=1}^M p_i^\mu p_j^\mu $ for $ i \neq j $, where $ p^\mu $ are the M patterns to store (with $ w_{ii} = 0 $).[3] Updates occur asynchronously or synchronously, with each neuron's state evolving based on its local field $ h_i = \sum_j w_{ij} s_j $, typically via a deterministic threshold $ s_i = \operatorname{sgn}(h_i) $ or a probabilistic rule incorporating temperature-like noise for exploration.[3] This dynamics ensures convergence to local minima of a Lyapunov energy function $ E = -\frac{1}{2} \sum_{i,j} w_{ij} s_i s_j $, where memorized patterns act as attractors, allowing robust recall even from corrupted cues.[2] Hopfield networks have a storage capacity limit of approximately $ 0.138N $ random binary patterns before spurious states degrade performance, a result derived from statistical mechanics analysis treating the system as a spin glass.[4] Early extensions generalized neurons to continuous states or added stochasticity, enhancing applicability to graded responses in biological systems.[5] Beyond memory, the model has influenced optimization problems, such as the traveling salesman problem, by mapping constraints to energy minimization, and pattern recognition tasks like image reconstruction from partial data.[6] In modern contexts, Hopfield networks underpin advancements in machine learning, including dense associative memories integrated into transformer architectures for improved attention mechanisms and sequence modeling, reviving interest in their scalable, energy-based formulations. Their interdisciplinary impact spans neuroscience, where they model attractor-based memory in the brain, and physics, linking neural computation to phase transitions in disordered systems.[7]

Overview

Definition and Purpose

A Hopfield network is a recurrent artificial neural network featuring symmetric, bidirectional connections among its neurons, designed primarily as a content-addressable memory system. This architecture allows the network to store multiple patterns and retrieve complete versions of them from partial, corrupted, or noisy inputs by iteratively updating neuron states until reaching a stable configuration.[1] Introduced by physicist John Hopfield in 1982, the model draws inspiration from physical systems like spin glasses to bridge concepts from statistical mechanics and computation, enabling tasks such as associative recall and optimization through collective dynamics that minimize an underlying energy function. For this foundational work on machine learning, Hopfield shared the 2024 Nobel Prize in Physics with Geoffrey Hinton.[1][8] In this framework, neurons operate as binary units—taking on/off states akin to spins in the Ising model—which evolve via feedback to settle into attractor states that encode memories. Unlike feedforward networks that process inputs in a single pass, Hopfield networks rely on recurrent loops for progressive refinement of representations.[1]

Basic Components

A Hopfield network is composed of NN binary neurons, each of which can adopt one of two states, conventionally represented as +1+1 or 1-1. These neurons are interconnected in a fully connected graph structure, where every neuron links to all others except itself, ensuring no self-loops are present. This architecture models a recurrent neural network without directional biases in connectivity, facilitating collective behavior among the units.[1] The interactions between neurons are governed by synaptic weights organized into an N×NN \times N symmetric matrix WW, satisfying Wij=WjiW_{ij} = W_{ji} for all iji \neq j. The diagonal elements are initialized to zero (Wii=0W_{ii} = 0), preventing self-reinforcement. The symmetry of this weight matrix eliminates directed cycles, which inherently promotes the network's tendency toward stable configurations by avoiding asymmetric influences that could lead to perpetual motion.[1] Each neuron incorporates a threshold value, commonly set to zero to simplify the model and provide a neutral bias for activation decisions. This threshold determines the net input required for a neuron to switch states, influencing the overall responsiveness of the network.[1] The network processes information through its state vector s{+1,1}N\mathbf{s} \in \{+1, -1\}^N, where each element sis_i denotes the state of the ii-th neuron. An initial state vector serves as input, and the network internally evolves this vector based on the weights and thresholds. These core elements collectively enable the network to function as an associative memory device, where partial or noisy inputs can retrieve complete stored patterns.[1]

Mathematical Formulation

Network Structure

The Hopfield network consists of NN fully connected binary neurons, each representing a state that can take one of two values, typically +1+1 or 1-1, analogous to spins in a physical system.[1] The state of the network at any time is described by a state vector s=(s1,,sN)T\mathbf{s} = (s_1, \dots, s_N)^T, where each component si{+1,1}s_i \in \{+1, -1\}.[1] This binary representation enables the network to model discrete attractors, facilitating associative memory functions.[1] The interactions between neurons are governed by a weight matrix W\mathbf{W}, an N×NN \times N symmetric matrix with zero diagonal elements (Wii=0W_{ii} = 0 for all ii), ensuring no self-connections or self-loops.[1] The symmetry of W\mathbf{W} (Wij=WjiW_{ij} = W_{ji}) reflects the bidirectional nature of synaptic connections in the model, promoting collective dynamics.[1] This full connectivity, excluding self-loops, allows every neuron to receive inputs from all others, forming the basis for the network's recurrent structure.[1] The network is designed to store MM patterns, denoted as ξμ=(ξ1μ,,ξNμ)\xi^\mu = (\xi_1^\mu, \dots, \xi_N^\mu) for μ=1,,M\mu = 1, \dots, M, where each ξiμ{+1,1}\xi_i^\mu \in \{+1, -1\}.[1] These patterns serve as stable attractors in the network's state space, enabling the retrieval of complete memories from partial or noisy inputs.[1] The input to each neuron ii, known as the local field hih_i, is computed as the weighted sum of the states of all other neurons:
hi=jiWijsj. h_i = \sum_{j \neq i} W_{ij} s_j.

This pre-activation value determines the potential update of sis_i based on the current network configuration.[1]
The structure of the Hopfield network draws inspiration from the Ising model in statistical physics, where neurons correspond to spins and weights to exchange interactions, allowing analysis through thermodynamic analogies.[1]

Energy Function

The energy function of a Hopfield network provides a Lyapunov function that governs the dynamics, ensuring that state transitions lead to stable configurations by monotonically decreasing the system's "energy." For a network with neuron states $ s_i = \pm 1 $ and symmetric weight matrix $ W $ (with zero diagonal elements), the energy is defined as the quadratic form
E(s)=12i,jWijsisj, E(\mathbf{s}) = -\frac{1}{2} \sum_{i,j} W_{ij} s_i s_j,
where the factor of $ \frac{1}{2} $ accounts for the symmetry $ W_{ij} = W_{ji} $. This formulation, inspired by the Hamiltonian of spin glass systems in statistical physics, treats the network states as configurations that evolve to minimize $ E $. The energy function exhibits key properties that underpin the network's computational reliability. It is bounded below, as the quadratic form with symmetric $ W $ yields real eigenvalues, preventing unbounded decreases and guaranteeing convergence to a finite minimum. Stored patterns, encoded via Hebbian rules, correspond to local minima of $ E $, making them attractors for nearby states, while the symmetry of $ W $ ensures $ E $ is well-defined and scalar-valued. To see how updates reduce energy, consider flipping the state of neuron $ k $ from $ s_k $ to $ -s_k $. The change in energy is
ΔE=2skhk, \Delta E = 2 s_k h_k,
where $ h_k = \sum_j W_{kj} s_j $ is the local field at neuron $ k $ (as defined in the network structure). In stable asynchronous updates, a flip occurs only if $ s_k \neq \operatorname{sign}(h_k) $, implying $ s_k h_k < 0 $ and thus $ \Delta E < 0 $, aligning the state with the weights to decrease $ E $. This derivation follows from substituting the flipped state into the quadratic form and simplifying using symmetry. The energy landscape metaphor visualizes the $ 2^N $ possible states as points in a high-dimensional space, with $ E(\mathbf{s}) $ as the height; dynamics resemble particles rolling downhill toward local minima, each surrounded by a basin of attraction that captures corrupted or partial inputs to retrieve stored memories.

Update Rules

The update rules in a Hopfield network define how the states of the neurons evolve over time to retrieve stored patterns, operating on binary neuron states $ s_i \in {-1, +1} $ for $ i = 1, \dots, N $, where $ N $ is the number of neurons.[9] The core mechanism uses an activation function that thresholds the local field $ h_i = \sum_{j=1}^N W_{ij} s_j ,typicallywithnoselfconnections(, typically with no self-connections ( W_{ii} = 0 $) and zero biases in the standard formulation.[1] The activation function is the sign function, $ s_i \leftarrow \operatorname{sign}(h_i) $, which sets $ s_i = +1 $ if $ h_i > 0 $ and $ s_i = -1 $ if $ h_i \leq 0 $, ensuring binary outputs.[9] In asynchronous updates, a single neuron $ i $ is selected randomly (or in a fixed order) and updated while keeping all other states fixed: $ s_i \leftarrow \operatorname{sign}\left( \sum_{j \neq i} W_{ij} s_j \right) $.[1] This mode processes one neuron at a time, mimicking sequential computation, and guarantees a non-increase in the network's energy function with each update.[9] Asynchronous updates are preferred for reliable convergence to stable fixed points, as they avoid potential cycles.[1] Synchronous updates, in contrast, revise all neurons simultaneously based on the current state: $ s_i(t+1) \leftarrow \operatorname{sign}\left( \sum_{j \neq i} W_{ij} s_j(t) \right) $ for all $ i $.[9] This parallel mode enables faster computation in hardware implementations but can lead to oscillations or limit cycles rather than fixed-point convergence.[9]

Dynamics and Convergence

Synchronous and Asynchronous Updates

In Hopfield networks, synchronous updates involve computing the new state of all neurons simultaneously based on the states from the previous time step, enabling parallel processing that is computationally efficient in vectorized implementations.[9] This mode can lead to limit cycles, such as 2-cycles where the network oscillates between two states without converging to a fixed point, particularly when the weight matrix is symmetric.[10] For instance, in simulations of small networks with positive weights between connected neurons, synchronous updates may trap the system in periodic flipping between patterns, preventing stable retrieval. Asynchronous updates, in contrast, select and update one neuron at a time, either randomly or in a fixed sequence, ensuring that the network's energy function strictly decreases with each update until reaching a fixed point attractor.[1] This sequential approach guarantees convergence to stable states without oscillations, making it ideal for theoretical analyses of dynamics and optimization problems where avoiding cycles is critical, such as in constraint satisfaction tasks.[9][11] The trade-offs between these modes highlight their complementary uses: synchronous updates offer speed advantages in hardware or parallel simulations but risk spurious oscillations like 2-cycles that degrade performance in memory retrieval, while asynchronous updates provide reliable convergence at the cost of slower execution due to serial processing. In optimization applications, asynchronous modes are preferred to prevent trapping in cycles, ensuring progression toward global or local minima.[12] Modern implementations often employ relaxed synchronous variants, such as damped or probabilistic parallel updates, to balance efficiency with stability in large-scale computations.[13]

Convergence Properties

In asynchronous updates, where only one neuron is updated at a time based on the sign of its local field $ h_k = \sum_{j \neq k} w_{kj} s_j $, the change in the network's energy function satisfies $ \Delta E = 2 s_k h_k \leq 0 $, with equality only if the neuron is already at a fixed point.[3] This decrease occurs because the update rule flips the state $ s_k $ only if $ s_k h_k < 0 $, ensuring the energy does not increase. Given the symmetric weight matrix $ W $ (where $ w_{ij} = w_{ji} )andnoselfloops() and no self-loops ( w_{ii} = 0 $), the energy function serves as a Lyapunov function, guaranteeing that the dynamics are stable.[14] Since the state space is finite (with $ 2^N $ possible binary configurations for $ N $ neurons), the strictly decreasing energy implies convergence to a local minimum—a fixed point—in a finite number of steps.[14] For synchronous updates, where all neurons are updated simultaneously, convergence is not guaranteed in the same manner; the network may enter limit cycles, particularly of length 2, due to potential oscillations between two states.[14] Convergence to a fixed point occurs only if no such synchronous oscillations arise, which depends on the specific initial state and weight configuration, but the absence of cycles cannot be assured without additional constraints on $ W $.[14] The conditions of symmetric weights and no self-loops are essential for Lyapunov stability in both update modes, as they ensure the energy function is well-defined and non-increasing.[1] In discrete Hopfield networks, this leads to halting at a fixed point in finite time, unlike continuous variants where dynamics follow differential equations and approach equilibria asymptotically without necessarily reaching them in finite time.[5] From random initial states, the probability of converging to a desired attractor (rather than a spurious state) is high when the number of stored patterns is below the critical capacity $ \alpha_c \approx 0.14 $ and initial noise is low, as the basins of attraction encompass a significant portion of the state space under these conditions.[1]

Attractors and State Space

The state space of a Hopfield network consists of all possible configurations of its N binary neurons, forming a discrete N-dimensional hypercube with 2^N vertices, where each vertex represents a unique binary state vector. This hypercube structure captures the instantaneous condition of the network, with dynamics evolving along its edges as neurons update their states based on interactions with others. Attractors in the Hopfield network are the stable fixed points or short cycles to which the system's trajectories converge from a wide range of initial conditions, serving as the stored memory patterns. The stored patterns are designed to be stable attractors, ensuring that the network settles into these configurations when dynamics are initiated near them, thereby enabling reliable recall of information. While fixed points predominate, simulations occasionally reveal simple cycles of length two, though these are less common in properly configured networks. Basins of attraction define the regions within the state space from which initial states flow toward a particular attractor, facilitating pattern completion even from partial or noisy inputs. This geometric partitioning of the hypercube allows the network to associate corrupted or incomplete patterns with the nearest stored memory, demonstrating the robustness of these basins to perturbations. In the state space, the dynamics exhibit an "attract or repel" behavior, where configurations similar to a stored pattern are drawn toward its attractor basin, while dissimilar ones are repelled away, guided by the underlying energy landscape.[15] Stable attractors pull nearby states inward, reinforcing the memory, whereas unstable regions push trajectories outward to prevent trapping in non-memory states.[16] For small networks, such as N=3 with 8 possible states forming a cube, trajectories can be visualized as directed paths converging to attractor vertices, illustrating how initial points midway along edges flow toward the nearest minimum, with basin boundaries separating regions of influence for each attractor. This low-dimensional view highlights the funnel-like structure of basins, where diverse starting configurations collapse into stable memories.[15]

Learning Algorithms

Hebbian Learning Rule

The Hebbian learning rule forms the foundation for storing patterns in a Hopfield network, drawing from Donald Hebb's principle that synaptic connections between neurons strengthen when those neurons exhibit correlated activity, often summarized as "neurons that fire together wire together."[1] In the Hopfield model, this principle is applied to encode a set of M binary patterns {ξμ}\{\xi^\mu\}, typically represented in ±1\pm 1 states as σiμ=±1\sigma_i^\mu = \pm 1 for neuron i in pattern μ\mu, treating simultaneous activation across neurons in a pattern as the correlated firing to be reinforced.[1] The rule computes the synaptic weights TijT_{ij} as the normalized sum of outer products over the stored patterns:
Tij=1Nμ=1Mσiμσjμ,ij T_{ij} = \frac{1}{N} \sum_{\mu=1}^M \sigma_i^\mu \sigma_j^\mu, \quad i \neq j
with diagonal elements set to zero (Tii=0T_{ii} = 0) to prevent self-connections.[1] This outer-product summation directly implements Hebbian strengthening, where weights reflect the co-occurrence of neuron activations in the patterns. In the original formulation with {0,1} states, stored patterns should be unbiased (approximately half 1s) to ensure stability without explicit thresholds; the ±1\pm 1 reformulation avoids this issue.[1] The derivation of this rule ensures that each stored pattern σμ\sigma^\mu becomes a stable fixed point of the network dynamics. The local field at neuron i when the network is in state σμ\sigma^\mu is hi=jiTijσjμh_i = \sum_{j \neq i} T_{ij} \sigma_j^\mu. Substituting the weight formula yields $h_i = \sum_{\nu=1}^M \sigma_i^\nu \frac{1}{N} \sum_{j \neq i} \sigma_j^\nu \sigma_j^\mu $. Under the assumption of orthogonal patterns—where 1Njσjνσjμ=δνμ\frac{1}{N} \sum_j \sigma_j^\nu \sigma_j^\mu = \delta_{\nu\mu}—the cross terms (νμ\nu \neq \mu) average to zero, simplifying to hiσiμ+h_i \approx \sigma_i^\mu + small noise from the jij \neq i exclusion and finite-size effects. Thus, the update rule si=sgn(hi)s_i = \operatorname{sgn}(h_i) leaves σμ\sigma^\mu unchanged, confirming it as a fixed point.[1] This verification holds because the Hebbian weights align the local fields with the stored pattern directions when orthogonality is satisfied.[1] While perfect storage requires orthogonal patterns (up to MN), for random uncorrelated patterns, reliable retrieval occurs up to the critical capacity of approximately 0.138N, beyond which spurious states degrade performance.[1]

Storkey Learning Rule

The Storkey learning rule, introduced as an enhancement to the foundational Hebbian learning rule for Hopfield networks, addresses pattern interference by incorporating local field corrections during weight updates.[17] Unlike the simple Hebbian approach, which directly correlates pattern components, the Storkey rule iteratively adjusts weights to minimize crosstalk between stored patterns, enabling better storage of non-orthogonal patterns.[17] The core update for the synaptic weight $ W_{ij} $ when incorporating a new pattern $ \xi^\nu $ (with components $ \xi_i^\nu = \pm 1 $) is given by:
WijWij+1Nξiν[ξjνkWjkξkν] W_{ij} \leftarrow W_{ij} + \frac{1}{N} \xi_i^\nu \left[ \xi_j^\nu - \sum_k W_{jk} \xi_k^\nu \right]
for $ i \neq j $, where $ N $ is the number of neurons and the sum represents the local field at neuron $ j $ induced by the current weights and the new pattern.[17] This formulation subtracts the projection of the new pattern onto the subspace spanned by previously stored patterns, effectively orthogonalizing the patterns in the network's state space and reducing interference that leads to spurious attractors.[17] The Storkey rule approximates the pseudoinverse learning method, which computes optimal weights via the Moore-Penrose pseudoinverse of the pattern matrix to maximize capacity under orthogonal constraints, but does so in a local, incremental manner without requiring global recomputation of all weights.[17] This approximation maintains desirable properties like locality (updates depend only on pre- and post-synaptic activities and local fields) and online implementability, allowing patterns to be added sequentially without storing the full set of prior patterns.[17] In terms of performance, the Storkey rule achieves a storage capacity of approximately 0.14N with reliable retrieval, comparable to the Hebbian rule for random binary patterns but with improved performance for correlated patterns, as demonstrated through theoretical analysis and simulations.[17] For example, when storing a set of non-orthogonal patterns such as shifted versions of a base image, the rule avoids the formation of spurious minima that plague Hebbian-trained networks, leading to cleaner pattern completion during recall.[17]

Initialization and Training Process

The initialization and training of a Hopfield network constitute a straightforward, one-shot offline procedure that sets the synaptic weights based on selected patterns, after which the weights remain fixed for all subsequent recall operations, without requiring iterative optimization or backpropagation as in modern neural networks. This process leverages unsupervised Hebbian-style learning to encode memories as stable attractors in the network's state space.[18] The first step involves selecting a set of $ p $ patterns to store, denoted as $ \xi^\mu $ for $ \mu = 1 $ to $ p $, where each pattern is a binary vector of length $ N $ (the number of neurons) with components $ \xi_i^\mu = \pm 1 $. These patterns should ideally be orthogonal or exhibit low correlation to optimize storage capacity and retrieval reliability; in practice, random or preprocessed patterns (e.g., from data compression) are often used.[18] The weight matrix $ \mathbf{W} $ is then initialized as an $ N \times N $ zero matrix. A learning rule is applied to compute the weights, such as the Hebbian rule, which accumulates outer products of the patterns: $ W_{ij} = \frac{1}{N} \sum_{\mu=1}^p \xi_i^\mu \xi_j^\mu $ for $ i \neq j $, with diagonal elements $ W_{ii} = 0 $ to avoid self-reinforcement. For enhanced capacity, the Storkey rule can be employed instead, which adjusts weights incrementally while preserving locality. In the standard model, biases are omitted, but they can be added if patterns require shifting (e.g., $ b_i = \frac{1}{N} \sum_\mu \xi_i^\mu $) to account for non-zero means. Patterns are typically already normalized to $ \pm 1 $, but for non-binary inputs, normalization ensures unit variance across components. Training concludes once weights are computed, yielding a fixed connectivity for associative recall. To evaluate performance, the network can be tested by initializing with random states or noisy versions of stored patterns and observing convergence behavior.[18] The recall phase operates by evolving an initial state vector $ \mathbf{s}^{(0)} $ (often a corrupted input) through iterative updates until reaching a fixed point, where no further state changes occur or the energy function $ E = -\frac{1}{2} \sum_{i,j} W_{ij} s_i s_j $ achieves a local minimum. Asynchronous updates—randomly selecting one neuron at a time—are common to guarantee convergence via energy decrease, though synchronous parallel updates (all neurons simultaneously) may be used for speed at the risk of oscillations. Updates follow $ s_i \leftarrow \operatorname{sgn}\left( \sum_j W_{ij} s_j \right) $, iterated until stability.[18] The following high-level pseudocode outlines the process (using the Hebbian rule for training): Training Phase
# Inputs: patterns ξ^μ (p vectors of length N, each ±1)
W ← zeros(N, N)  # Initialize weight matrix

for μ ← 1 to p do
    for i ← 1 to N do
        for j ← 1 to N do
            if i ≠ j then
                W[i, j] ← W[i, j] + (ξ^μ_i * ξ^μ_j) / N

# Diagonal already zero; weights now fixed
Recall Phase
# Input: initial state s (vector of length N, ±1 or partial)
s ← initial_state
changed ← true

while changed do
    changed ← false
    for each neuron i (in random order) do
        h_i ← sum_{j=1 to N} W[i, j] * s_j
        new_s_i ← sgn(h_i)
        if new_s_i ≠ s_i then
            s_i ← new_s_i
            changed ← true

return s  # Converged state
This algorithm ensures efficient one-pass training and iterative retrieval, with convergence typically reached in $ O(N) $ to $ O(N \log N) $ steps depending on the input.[18]

Properties and Limitations

Storage Capacity

The storage capacity of a Hopfield network, defined as the maximum number of random binary patterns MmaxM_{\max} that can be stored for reliable retrieval with low error probability, is a key limitation of the model. For the standard Hebbian learning rule, theoretical analysis yields a critical loading parameter αc=Mmax/N0.138\alpha_c = M_{\max}/N \approx 0.138, where NN is the number of neurons; beyond this threshold, the retrieval overlap drops sharply, and errors accumulate during asynchronous updates.[19] This bound emerges from signal-to-noise analysis of the local field during retrieval of a stored pattern ξμ\boldsymbol{\xi}^\mu. The field at neuron ii is given by
hi=ξiμ+zi, h_i = \xi_i^\mu + z_i,
where the signal term is ξiμ=±1\xi_i^\mu = \pm 1 and ziz_i is the crosstalk noise from the other M1M-1 patterns. With Hebbian weights
Jij=1Nν=1Mξiνξjν(ij), J_{ij} = \frac{1}{N} \sum_{\nu=1}^M \xi_i^\nu \xi_j^\nu \quad (i \neq j),
the noise ziz_i arises as
zi=1Nνμξiνjiξjνξjμ. z_i = \frac{1}{N} \sum_{\nu \neq \mu} \xi_i^\nu \sum_{j \neq i} \xi_j^\nu \xi_j^\mu.
For uncorrelated random patterns, the inner sum over jj is a random walk with zero mean and variance approximately NN, making each term O(1/N)\sim \mathcal{O}(1/\sqrt{N}) and the total noise Gaussian with zero mean and variance σ2α\sigma^2 \approx \alpha. Retrieval fails when the signal falls below the noise level, i.e., when 1σ1 \lesssim \sigma; solving the self-consistent equation for the macroscopic overlap mm (using the error function for bit-error probability) determines the instability point at αc0.138\alpha_c \approx 0.138. Several factors influence this capacity. The Hebbian rule's weights exhibit a bias-variance tradeoff: bias from incomplete orthogonalization of patterns increases crosstalk, while variance from random overlaps limits scalability. For random, uncorrelated patterns, αc0.138\alpha_c \approx 0.138 holds, but correlated patterns reduce capacity due to heightened interference; theoretical bounds show αcN/(γlogN)\alpha_c \sim N/(\gamma \log N) for correlation parameter γ>0\gamma > 0, and simulations reveal empirical capacity curves dropping monotonically with correlation strength (e.g., halving at correlation 0.5).[20] Alternative learning rules mitigate these limits while referencing Hebbian weights as a baseline. The pseudoinverse rule, computing weights via the Moore-Penrose pseudoinverse of the pattern matrix to minimize reconstruction error, achieves a capacity of up to αc=1\alpha_c = 1 (storing M=NM = N patterns exactly as fixed points).[21] The Storkey rule, an incremental and local extension that subtracts higher-order interference terms from Hebbian updates, improves capacity to approximately αc0.97\alpha_c \approx 0.97 for random patterns.[17] For sparse patterns (low fraction of +1's), capacity exceeds 0.138N, scaling inversely with sparsity (e.g., up to 0.6N0.6N at 10% activity) due to reduced crosstalk variance.

Spurious States

In Hopfield networks, spurious states refer to stable fixed points or attractors that emerge in the dynamics but do not correspond to any of the originally stored patterns. These unwanted states include reversed patterns, where the negation of a stored pattern (−ξ) serves as a stable configuration due to the even symmetry of the network's energy function under the Hebbian learning rule. Other types encompass superpositions, such as the bitwise XOR of two stored memories, and mixtures that combine elements from multiple patterns into a novel stable state.[22] The primary cause of spurious states is the non-orthogonality of the stored patterns, which introduces interference or cross-talk in the synaptic weight matrix formed by the outer-product Hebbian rule; this interference creates additional local minima in the energy landscape beyond the intended attractors for the stored patterns.[22] Such minima represent pathological attractors that can trap the network's state during retrieval, leading to erroneous pattern completion. Spurious states are detectable even at low storage loads but proliferate significantly when the number of patterns M exceeds roughly 0.1 times the number of neurons N, with their occurrence probability rising sharply under capacity overload due to intensified pattern interference. For instance, consider two similar stored patterns ξ¹ and ξ² that differ in only a few components; a common spurious state arises as the pointwise average of ξ¹ and ξ², rounded to the nearest ±1 values, which stabilizes as an unintended attractor.[22] Basic mitigation strategies focus on pattern selection and learning modifications to curb interference without fully eradicating spurious states. Using unbiased patterns—those with zero mean, featuring an equal number of +1 and -1 values—helps reduce bias-induced reversed and mixture states by promoting balanced representations.[23] The Storkey learning rule addresses this further by incorporating a second-order correction term that locally adjusts weights to minimize correlations between patterns, thereby suppressing many spurious attractors while maintaining incremental training.

Error Correction Capabilities

Hopfield networks demonstrate robust error correction capabilities through their attractor dynamics, where stored patterns serve as stable fixed points in the energy landscape. When presented with a noisy or partial input, the network iteratively updates its state via synchronous or asynchronous rules, converging toward the nearest attractor if the input lies within its basin of attraction. This basin represents the set of initial states that flow to a particular stored pattern under the network's dynamics. Retrieval is successful when the Hamming distance between the input and the closest stored pattern is smaller than the basin's radius, enabling the correction of corrupted inputs without external supervision.[1] The size of these basins quantifies the network's robustness, allowing correction of up to dd errors where d/N0.1d/N \approx 0.1 at the optimal storage capacity of approximately 0.14N0.14N patterns for random binary inputs of length NN. For lower loading factors or uncorrelated patterns, the basins expand significantly, enhancing error tolerance as the effective separation between attractors increases. This self-correcting behavior arises from the iterative nature of updates, where each step reduces the energy and pulls the state deeper into the basin, progressively repairing discrepancies from the target pattern.[1] As the number of stored patterns approaches the critical load αc0.14\alpha_c \approx 0.14, the basins of attraction contract due to crosstalk interference, resulting in retrieval error rates exceeding 50% for inputs at moderate Hamming distances from valid patterns. This shrinkage reflects a phase transition in the network's state space, where the signal-to-noise ratio degrades, compromising reliable convergence even for mildly perturbed probes. Brief reference to attractors underscores that these fixed points define the boundaries of viable correction regions. A representative example is the recovery of binary image patterns, where the network can reconstruct a full NN-pixel image from an input with approximately 10% of bits flipped, provided the corruption keeps the probe within the basin of the target pattern; simulations confirm high-fidelity retrieval under such conditions for loads below capacity limits.[1]

Applications

Combinatorial Optimization

Hopfield networks address combinatorial optimization problems by mapping decision variables to neuron states and encoding the objective function and constraints into the network's energy function, such that its minima correspond to feasible optimal solutions. The states $ V_i \in {0, 1} $ represent binary choices in the problem (e.g., presence or absence of an edge or assignment), while the weights $ T_{ij} $ are derived to penalize violations and costs. The network's dynamics then minimize the energy $ E = -\frac{1}{2} \sum_{i,j} T_{ij} V_i V_j - \sum_i I_i V_i $, effectively solving the optimization through collective relaxation to a stable state. This approach transforms NP-hard problems into energy minimization tasks amenable to parallel computation.[24] A canonical illustration is the Traveling Salesman Problem (TSP), which seeks the shortest Hamiltonian cycle through $ N $ cities with distances $ d_{ij} $. Here, the network uses $ N^2 $ neurons $ V_{Xi} $, where $ V_{Xi} = 1 $ if city $ i $ occupies tour position $ X $, and 0 otherwise. The energy function is tailored as a sum of constraint and cost terms:
E=A2XijVXiVXj+B2iXYVXiVYi+C2(XiVXiN)2+D2Xi,jdijVXi(VX+1,j+VX1,j) E = \frac{A}{2} \sum_X \sum_{i \neq j} V_{Xi} V_{Xj} + \frac{B}{2} \sum_i \sum_{X \neq Y} V_{Xi} V_{Yi} + \frac{C}{2} \left( \sum_X \sum_i V_{Xi} - N \right)^2 + \frac{D}{2} \sum_X \sum_{i,j} d_{ij} V_{Xi} (V_{X+1,j} + V_{X-1,j})
The first two terms enforce exactly one city per position and one position per city, the third ensures all cities are visited, and the fourth minimizes total tour length (with constants $ A, B, C, D $ balancing priorities). Weights $ T_{ij} $ are set from the quadratic expansions of these terms, and external inputs $ I_i $ may adjust biases. Simulations demonstrate convergence to near-optimal tours within a few time constants, typically within a few percent of the optimum for $ N = 10 $.[24] Despite these strengths, the deterministic dynamics frequently converge to local minima, yielding spurious or suboptimal solutions such as disconnected tours or excess length. To overcome this, hybrid methods integrate simulated annealing, introducing stochastic noise (modeled as temperature-dependent fluctuations in neuron updates) to explore the state space broadly before cooling to refine valid minima, improving solution quality for larger instances.[24] Hopfield and Tank's 1985 formulation extended this paradigm to hardware, proposing analog VLSI implementations where neurons and synapses are realized as continuous-time circuits, enabling sub-millisecond solutions for optimization via massively parallel analog computation.[24]

Associative Memory and Pattern Completion

Hopfield networks serve as autoassociative memories, enabling the storage and retrieval of patterns through a process where a partial or noisy input is iteratively updated via synchronous or asynchronous dynamics until convergence to a stable state representing the closest stored pattern.[1] This pattern completion occurs as the network's state evolves according to the update rule $ s_i(t+1) = \text{sgn}\left( \sum_{j \neq i} w_{ij} s_j(t) \right) $, where $ \mathbf{s} $ is the state vector and $ \mathbf{W} $ is the weight matrix derived from Hebbian learning, effectively minimizing an energy function to settle into an attractor.[1] A defining feature of this mechanism is its content-addressable nature, allowing queries based on partial pattern content rather than explicit addresses, which facilitates robust retrieval even from degraded inputs within the basin of attraction of a stored memory.[1] In practical applications, Hopfield networks excel at image denoising by restoring corrupted pixel patterns to intact originals through iterative convergence, as demonstrated in reconstruction tasks for damaged grayscale images. They also support spell-checking by storing dictionary words as binary patterns and completing misspelled inputs to the nearest valid entry via error-correcting dynamics.[25] Additionally, in sensor data recovery, the networks reconstruct missing or noisy readings from partial observations, leveraging associative recall to infer complete datasets from predefined pattern libraries.[26] Performance in pattern completion is highly dependent on loading factor and noise level; at low loads (e.g., number of patterns much below 0.14N, where N is neuron count), success rates are high for inputs with low levels of noise, such as a small fraction of random bit flips, but degrade sharply near capacity due to spurious attractors and overlapping basins. For instance, in storing binary representations of human faces, the network can retrieve a full facial pattern from an occluded or noisy input by aligning it to the nearest memorized prototype, achieving reliable completion when noise is minimal and patterns are orthogonal.[27]

Modeling Human Memory

Hopfield networks serve as computational models of biological memory by representing stored patterns as stable attractor states, which function analogously to engrams—sparse assemblies of neurons that encode specific memories. In this framework, the basins of attraction surrounding these attractors enable associative recall, where partial activation of an engram, such as a cue from one memory, drives the network dynamics toward the complete stable state, mimicking how one recollection triggers related ones in the brain.[28][1] The model's biological ties stem from its use of symmetric synaptic weights, updated via a Hebbian learning rule that strengthens connections between co-active neurons, paralleling Hebbian plasticity observed in neural circuits. This rule, where synaptic efficacy increases when pre- and post-synaptic neurons fire together, aligns with mechanisms like long-term potentiation (LTP), a persistent strengthening of synapses fundamental to memory formation. Additionally, the network's energy function, which decreases during state transitions to reach local minima at attractors, models neural potential in a way that reflects the brain's tendency to settle into low-energy configurations representing stored information.[1][29][1] Developed in the early 1980s, the Hopfield network drew inspiration from associative memory processes in biological systems. It provides insights into phenomena like catastrophic interference, where overlearning—storing excessive patterns—leads to abrupt degradation of prior memories due to crosstalk in synaptic weights, echoing challenges in biological systems where new learning can disrupt established engrams.[1][30] Extensions to multi-layer architectures allow Hopfield networks to model hierarchical memory, where lower layers encode basic features and higher layers assemble them into complex representations, supported by bidirectional connections that facilitate context-dependent recall akin to cortical-hippocampal interactions. Despite these advances, the original binary-state formulation oversimplifies biological neurons, which exhibit graded responses rather than all-or-nothing firing; subsequent work extended the model to continuous activations to enhance realism, though it remains a foundational abstraction in computational neuroscience.[31][32][5]

Modern Extensions

Continuous Hopfield Networks

The continuous Hopfield network generalizes the discrete model by permitting neuron states to vary continuously within a bounded range, such as [1,1][-1, 1], rather than restricting them to binary values. This extension, introduced by Hopfield in 1984,[5] models neurons with graded responses akin to biological firing rates, using a smooth sigmoidal activation function to produce outputs that approximate the step function of the discrete case under high-gain conditions. The state of each neuron ii is denoted siRs_i \in \mathbb{R}, with the activation function typically defined as g(hi)=tanh(βhi)g(h_i) = \tanh(\beta h_i), where hi=jiWijsjh_i = \sum_{j \neq i} W_{ij} s_j is the local field, WW is the symmetric weight matrix with zero diagonal, and β>0\beta > 0 is a gain parameter that steepens the sigmoid as β\beta \to \infty, approaching the sign function. The network dynamics follow the continuous-time ordinary differential equation
dsidt=si+g(jiWijsj), \frac{ds_i}{dt} = -s_i + g\left( \sum_{j \neq i} W_{ij} s_j \right),
which can be interpreted as a relaxation process where each neuron's state evolves toward an equilibrium balancing decay and synaptic input. This dynamics minimizes a Lyapunov energy function
E=12sTWs+i0sig1(u)du, E = -\frac{1}{2} \mathbf{s}^T W \mathbf{s} + \sum_i \int_0^{s_i} g^{-1}(u) \, du,
analogous to the discrete energy but with an additional integral term accounting for the inverse activation, ensuring dEdt0\frac{dE}{dt} \leq 0 and asymptotic convergence to stable equilibria that are local minima of EE. Unlike the discrete model, convergence here occurs continuously without discrete updates, enabling smoother trajectories in state space. The continuous formulation supports graded memories, where patterns exhibit intermediate state values rather than strict binaries, while maintaining a storage capacity comparable to the discrete case, roughly on the order of 0.14N0.14N random patterns for NN neurons before significant error rates emerge. It proves advantageous for optimization tasks, as the analog nature facilitates efficient VLSI implementations, such as those proposed by Tank and Hopfield for solving combinatorial problems like the traveling salesman.[24]

Dense Associative Memories

Dense associative memories represent a significant advancement in Hopfield networks, enabling the storage and retrieval of exponentially many patterns relative to the network size, far surpassing the linear capacity of classical models. Introduced by Krotov and Hopfield in 2016, this formulation revives the Hopfield paradigm by addressing the classical capacity bottleneck—limited to approximately 0.14N patterns for N neurons—through dense interconnections that leverage higher-order or exponential interactions among patterns.[33] The key insight is the use of pattern separability, where random patterns in high dimensions are nearly orthogonal, allowing the network to distinguish and retrieve vastly more memories without significant crosstalk.[34] In the modern dense formulation, particularly for continuous states, the network's dynamics are governed by an energy function incorporating exponential terms to enable dense retrieval. The state vector $ \mathbf{x} \in \mathbb{R}^N $ evolves to minimize the energy
E(x)=1λlogμ=1Mexp(λxξμ)+12x2, E(\mathbf{x}) = -\frac{1}{\lambda} \log \sum_{\mu=1}^M \exp\left( \lambda \, \mathbf{x} \cdot \boldsymbol{\xi}^\mu \right) + \frac{1}{2} \|\mathbf{x}\|^2,
where $ {\boldsymbol{\xi}^\mu}_{\mu=1}^M $ are the stored pattern vectors in $ \mathbb{R}^N $, $ M $ is the number of patterns, and $ \lambda > 0 $ controls the sharpness of retrieval (analogous to inverse temperature).[35] This log-sum-exp structure ensures that the energy landscape features attractors sharply peaked at the stored patterns, facilitating reliable completion from partial or noisy inputs. The hidden states can be viewed as low-rank projections onto the subspace spanned by the patterns, as the retrieval process reconstructs $ \mathbf{x} $ as a weighted sum of $ \boldsymbol{\xi}^\mu $, emphasizing the dominant matching pattern in the limit of large $ \lambda $.[34] The iterative update rule follows gradient descent on this energy, yielding
xt+1=μ=1Maμtξμ,aμt=exp(λxtξμ)ν=1Mexp(λxtξν), \mathbf{x}_{t+1} = \sum_{\mu=1}^M a_\mu^t \, \boldsymbol{\xi}^\mu, \quad a_\mu^t = \frac{\exp\left( \lambda \, \mathbf{x}_t \cdot \boldsymbol{\xi}^\mu \right)}{\sum_{\nu=1}^M \exp\left( \lambda \, \mathbf{x}_t \cdot \boldsymbol{\xi}^\nu \right)},
which is a softmax over the dot-product similarities, converging in a single step for separable patterns.[35] In the high-temperature limit ($ \lambda \to 0 ),updatesaverageoverpatterns,whileatlow[temperature](/page/Temperature)(), updates average over patterns, while at low [temperature](/page/Temperature) ( \lambda \to \infty $), they approximate an argmax retrieval of the closest pattern. This approach achieves exponential storage capacity $ M \sim 2^{N/2} $, or more precisely $ M \sim e^{\alpha N} $ with $ \alpha \approx 0.1 $ for reliable retrieval of all patterns, relying on the exponential separability of random patterns in high dimensions.[34] These networks inherently handle continuous variables through the differentiable softmax update and quadratic regularization term, enabling smooth dynamics and integration into gradient-based learning.[35] Furthermore, the update rule mirrors the attention mechanism in transformer models, where patterns act as keys and values, and the state as the query, allowing dense associative memories to serve as a theoretical foundation for efficient long-range dependencies in sequence processing.[35] Unlike classical Hopfield attractors, this dense variant prioritizes one-step convergence over multi-step relaxation, making it suitable for rapid pattern completion tasks.[33]

Connections to Deep Learning

The attention mechanism in transformer architectures bears a direct mathematical resemblance to the retrieval dynamics of modern Hopfield networks. Specifically, the scaled dot-product attention operation retrieves relevant stored patterns in a manner equivalent to a single asynchronous update step in a continuous-state Hopfield network, where stored memories serve as keys and values.[35] This equivalence arises because the modern Hopfield update rule minimizes an energy function that aligns with the softmax-based weighting in attention, enabling transformers to function as associative memory systems for pattern completion during inference.[35] The core mapping can be expressed through the attention formula and its Hopfield counterpart. The standard scaled dot-product attention is given by
Attention(Q,K,V)=\softmax(QKdk)V, \text{Attention}(Q, K, V) = \softmax\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V,
where QQ, KK, and VV are query, key, and value matrices, and dkd_k is the key dimension.[35] In a modern Hopfield network, for a query state q\mathbf{q} and stored patterns Ξ=[ξ1,,ξN]\Xi = [\boldsymbol{\xi}_1, \dots, \boldsymbol{\xi}_N] (serving as both keys K=ΞK = \Xi and values V=ΞV = \Xi), the retrieved state is
x=\softmax(βΞqd)Ξ, \mathbf{x}' = \softmax\left( \beta \frac{\Xi^\top \mathbf{q}}{\sqrt{d}} \right)^\top \Xi,
with temperature parameter β=1/d\beta = 1/\sqrt{d} matching the scaling factor, thus identifying QK/dkQ K^\top / \sqrt{d_k} as the Hopfield similarity matrix.[35] The associated energy function for the modern Hopfield network, E(x)=1βlogiexp(βxξi)+12x2E(\mathbf{x}) = -\frac{1}{\beta} \log \sum_i \exp(\beta \mathbf{x}^\top \boldsymbol{\xi}_i) + \frac{1}{2} \|\mathbf{x}\|^2, further underscores this link, as attention implicitly minimizes a similar free-energy landscape to retrieve low-energy attractors.[35] Building on this foundation, researchers in the 2020s have interpreted multi-layer transformers as stacked or iterated Hopfield layers, where each attention block augments the network's capacity for dynamic memory storage and retrieval across sequences.[35] For instance, Hopfield layers have been integrated into hybrid deep learning architectures to bolster associative recall in tasks requiring long-context processing, such as sequence modeling, by leveraging the networks' exponential storage capacity relative to sequence length.[10] In generative modeling, diffusion models trained on discrete patterns exhibit energy landscapes asymptotically identical to those of modern Hopfield networks, framing denoising steps as energy-based retrieval of memorized distributions.[36] Recent advances from 2024 to 2025 have extended these connections to vision transformers, where Hopfield-inspired analysis of layer-wise energy landscapes—via metrics like the Layer Instability Index—reveals attractor dynamics that enable adaptive training and specialization in visual tasks, such as classification on datasets like CIFAR-100.[37] To mitigate the quadratic computational complexity of attention in large-scale transformers, sparse quantized Hopfield networks employ tree-structured architectures and local updates, achieving efficient one-shot recall and continual learning without backpropagation, outperforming dense variants on associative memory benchmarks.[38] As of November 2025, further extensions include self-learning magnetic Hopfield networks using spintronic systems for intrinsic unsupervised learning,[39] and modern complex-valued Hopfield networks enhancing associative memory for phase-sensitive applications.[40] These developments highlight Hopfield networks' role in scaling memory-augmented AI systems while preserving energy minimization principles.[38]

References

User Avatar
No comments yet.