Recent from talks
Nothing was collected or created yet.
Butterfly network
View on WikipediaThis article may be too technical for most readers to understand. (May 2017) |
A butterfly network is a technique to link multiple computers into a high-speed network. This form of multistage interconnection network topology can be used to connect different nodes in a multiprocessor system.
Technical requirements
[edit]The interconnect network for a shared memory multiprocessor system must have low latency and high bandwidth unlike other network systems, like local area networks (LANs) or internet[citation needed] for three reasons:
- Messages are relatively short as most messages are coherence protocol requests and responses without data.
- Messages are generated frequently because each read-miss or write-miss generates messages to every node in the system to ensure coherence. Read/write misses occur when the requested data is not in the processor's cache and must be fetched either from memory or from another processor's cache.
- Messages are generated frequently, therefore rendering it difficult for the processors to hide the communication delay.
Components
[edit]The major components of an interconnect network are:[citation needed]

- Processor nodes, which consist of one or more processors along with their caches, memories and communication assist.
- Switching nodes (Router), which connect communication assist of different processor nodes in a system. In multistage topologies, higher level switching nodes connect to lower level switching nodes as shown in figure 1, where switching nodes in rank 0 connect to processor nodes directly while switching nodes in rank 1 connect to switching nodes in rank 0.
- Links, which are physical wires between two switching nodes. They can be uni-directional or bi-directional.
These multistage networks have lower cost than a cross bar, but obtain lower contention than a bus. The ratio of switching nodes to processor nodes is greater than one in a butterfly network. Such topology, where the ratio of switching nodes to processor nodes is greater than one, is called an indirect topology.[1]
The network derives its name from connections between nodes in two adjacent ranks (as shown in figure 1), which resembles a butterfly. Merging top and bottom ranks into a single rank, creates a Wrapped Butterfly Network.[1] In figure 1, if rank 3 nodes are connected back to respective rank 0 nodes, then it becomes a wrapped butterfly network.
BBN Butterfly, a massive parallel computer built by Bolt, Beranek and Newman in the 1980s, used a butterfly interconnect network.[2] Later in 1990, Cray Research's machine Cray C90, used a butterfly network to communicate between its 16 processors and 1024 memory banks.[3]
Butterfly network building
[edit]For a butterfly network with p processor nodes, there need to be p(log2 p + 1) switching nodes. Figure 1 shows a network with 8 processor nodes, which implies 32 switching nodes. It represents each node as N(rank, column number). For example, the node at column 6 in rank 1 is represented as (1,6) and node at column 2 in rank 0 is represented as (0,2).[1]
For any 'i' greater than zero, a switching node N(i,j) gets connected to N(i-1, j) and N(i-1, m), where, m is inverted bit on ith location of j. For example, consider the node N(1,6): i equals 1 and j equals 6, therefore m is obtained by inverting the ith bit of 6.
| Variable | Binary representation | Decimal Representation |
|---|---|---|
| j | 110 | 6 |
| m | 010 | 2 |
As a result, the nodes connected to N(1,6) are :
| N(i,j) | N(i-1,j) | N(i-1,m) |
| (1,6) | (0,6) | (0,2) |
Thus, N(0,6), N(1,6), N(0,2), N(1,2) form a butterfly pattern. Several butterfly patterns exist in the figure and therefore, this network is called a Butterfly Network.
Butterfly network routing
[edit]
In a wrapped butterfly network (which means rank 0 gets merged with rank 3), a message is sent from processor 5 to processor 2.[1] In figure 2, this is shown by replicating the processor nodes below rank 3. The packet transmitted over the link follows the form:
| Header | Payload | Trailer |
The header contains the destination of the message, which is processor 2 (010 in binary). The payload is the message, M and trailer contains checksum. Therefore, the actual message transmitted from processor 5 is:
| 010 | M | checksum |
Upon reaching a switching node, one of the two output links is selected based on the most significant bit of the destination address. If that bit is zero, the left link is selected. If that bit is one, the right link is selected. Subsequently, this bit is removed from the destination address in the packet transmitted through the selected link. This is shown in figure 2.
- The above packet reaches N(0,5). From the header of the packet it removes the leftmost bit to decide the direction. Since it is a zero, left link of N(0,5) (which connects to N(1,1)) gets selected. The new header is '10'.
- The new packet reaches N(1,1). From the header of the packet it removes the leftmost bit to decide the direction. Since it is a one, right link of N(1,1) (which connects to N(2,3)) gets selected. The new header is '0'.
- The new packet reaches N(2,3). From the header of the packet it removes the leftmost bit to decide the direction. Since it is a zero, left link of N(2,3) (which connects to N(3,2)) gets selected. The header field is empty.
- Processor 2 receives the packet, which now contains only the payload 'M' and the checksum.
Butterfly network parameters
[edit]Several parameters help evaluate a network topology. The prominent ones relevant in designing large-scale multi-processor systems are summarized below and an explanation of how they are calculated for a butterfly network with 8 processor nodes as shown in figure 1 is provided.[citation needed]
- Bisection Bandwidth: The maximum bandwidth required to sustain communication between all nodes in the network. This can be interpreted as the minimum number of links that need to be severed to split the system into two equal portions. For example, the 8 node butterfly network can be split into two by cutting 4 links that crisscross across the middle. Thus bisection bandwidth of this particular system is 4. It is a representative measure of the bandwidth bottleneck which restricts overall communication.
- Diameter: The worst case latency (between two nodes) possible in the system. It can be calculated in terms of network hops, which is the number of links a message must travel in order to reach the destination node. In the 8 node butterfly network, it appears that N(0,0) and N(3,7) are farthest away, but upon inspection, it is apparent that due to the symmetric nature of the network, traversing from any rank 0 node to any rank 3 node requires only 3 hops. Therefore, the diameter of this system is 3.
- Links: Total number of links required to construct the entire network structure. This is an indicator of overall cost and complexity of implementation. The example network shown in figure 1 requires a total of 48 links (16 links each between rank 0 and 1, rank 1 and 2, rank 2 and 3).
- Degree: The complexity of each router in the network. This is equal to the number of in/out links connected to each switching node. The butterfly network switching nodes have 2 input links and 2 output links, hence it is a 4-degree network.
Comparison with other network topologies
[edit]This section compares the butterfly network with linear array, ring, 2-D mesh and hypercube networks.[4] Linear array can be considered as a 1-D mesh topology. Relevant parameters are compiled in the table[citation needed] (‘p’ represents the number of processor nodes).
| Topology | Diameter | Bisection Bandwidth | Links | Degree |
|---|---|---|---|---|
| Linear array | | 1 | | 2 |
| Ring | | 2 | p | 2 |
| 2-D mesh | 2(√p - 1) | √p | 2√p(√p - 1) | 4 |
| Hypercube | log2(p) | | log2(p) × (p/2) | log2(p) |
| Butterfly | log2(p) | , h = number of nodes in a single layer | log2(p) × 2p | 4 |
Advantages
[edit]- Butterfly networks have lower diameter than other topologies like a linear array, ring and 2-D mesh. This implies that in butterfly network, a message sent from one processor would reach its destination in a lower number of network hops.
- Butterfly networks have higher bisection bandwidth than other topologies. This implies that in butterfly network, a higher number of links need to be broken in order to prevent global communication.
- It has a bigger computer range.
Disadvantages
[edit]- Butterfly networks are more complex and costlier than other topologies due to the higher number of links required to sustain the network.
The difference between hypercube and butterfly lies within their implementation. Butterfly network has a symmetric structure where all processor nodes between two ranks are equidistant to each other, whereas hypercube is more suitable for a multi-processor system which demands unequal distances between its nodes. By looking at the number of links required, it may appear that hypercube is cheaper and simpler compared to a butterfly network, but as the number of processor nodes go beyond 16, the router cost and complexity (represented by degree) of butterfly network becomes lower than hypercube because its degree is independent of the number of nodes.
In conclusion, no single network topology is best for all scenarios. The decision is made based on factors like the number of processor nodes in the system, bandwidth-latency requirements, cost and scalability.
See also
[edit]References
[edit]- ^ a b c d Leighton, F.Thomson (1992). Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers. ISBN 1-55860-117-1.
- ^ T., LeBlanc; M., Scott; C., Brown (1988-01-01). Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor (Report). Butterfly Project. hdl:1802/15082.
- ^ Jadhav, Sunitha S (2009). Advanced Computer Architecture and Computing. Technical Publications. pp. Section 3–22. ISBN 9788184315721.
- ^ M. Arjomand, H. Sarbazi-Azad, "Performance Evaluation of Butterfly on-Chip Network for MPSoCs", International SoC Design Conference, pp. 1–296-1-299, 2008
Butterfly network
View on GrokipediaFundamentals
Definition and Purpose
The butterfly network is a multistage interconnection network (MIN) used in parallel computing to connect N inputs to N outputs through log₂ N stages of switching elements. Each stage consists of N/2 basic 2×2 switches, arranged such that the overall connectivity pattern visually resembles the wings of a butterfly due to the symmetric fanning of links between levels. This topology provides a unique path of length log₂ N between any input-output pair, enabling deterministic routing with low depth.[4] The primary purpose of the butterfly network is to facilitate efficient communication in multiprocessor systems, particularly for all-to-all data exchange patterns required in parallel algorithms. It excels in supporting permutation routing, where any arbitrary mapping of inputs to outputs can be realized simultaneously without internal conflicts, making it suitable for applications like fast Fourier transforms (FFT) and sorting operations that rely on structured data dependencies and collective communications.[5][4] At its core, the butterfly network employs a regular, fixed interconnection scheme across stages, often visualized as a two-dimensional grid where horizontal lines represent stages and diagonal links denote switch connections—straight for one output and crossed for the other. This design ensures scalability and predictability, with each switch in stage i connecting to specific nodes in stages i-1 and i+1 based on bit positions in the input address.[4] A representative example is an 8-input butterfly network (N=8, log₂ 8=3 stages), featuring 4 switches per stage. Inputs enter at stage 0 (e.g., labeled 000 to 111 in binary), connect via straight and cross links to stage 1 switches, which then route to stage 2, and finally to outputs at stage 3; for instance, input 000 might follow a straight path through all stages to output 000, while input 001 crosses in stage 1 to reach output 100, demonstrating the ports and interconnections that support full permutations.[4]Historical Development
The butterfly network topology originated in the late 1960s and 1970s amid growing interest in multistage interconnection networks (MINs) for efficient parallel processing and switching. Foundational concepts drew from permutation networks, including the perfect shuffle permutation proposed by Harold Stone in 1971, which enabled data redistribution across processors in a single stage and laid groundwork for more complex topologies. Earlier, V. E. Beneš introduced rearrangeable non-blocking networks in 1964, providing a theoretical basis for scalable switching fabrics that minimized crosspoint requirements while supporting arbitrary permutations. Complementing these, D. H. Lawrie described the omega network in 1975 as a self-routing MIN derived from Beneš structures, emphasizing fixed permutations for cost-effective interconnection in array processors. The butterfly network's distinctive structure also reflected influences from signal processing and sorting algorithms. The "butterfly" nomenclature and diagram echoed the computational graphs in the Cooley-Tukey fast Fourier transform (FFT) algorithm, published in 1965, where pairwise operations formed butterfly-like motifs for efficient polynomial evaluation. Similarly, Kenneth Batcher's 1968 odd-even mergesort network utilized comparator stages resembling butterfly connections to achieve parallel sorting with logarithmic depth, inspiring interconnection designs for permutation-heavy tasks in parallel computers. A pivotal practical advancement occurred in the late 1970s when Bolt, Beranek and Newman (BBN) adapted butterfly switching—rooted in perfect shuffles and omega networks—for multiprocessor systems, initially building on their 1972 Pluribus ARPANET switches.[6] BBN's design efforts began in 1978, culminating in the first Butterfly parallel processor delivery in 1981, featuring up to 256 Motorola 68000 processors interconnected via a multistage butterfly fabric for shared-memory emulation.[7] This system marked a key milestone in deploying MINs commercially, with the Butterfly series evolving through the 1980s; the enhanced Butterfly 1000 series was announced in 1987, incorporating improved processors and software for broader adoption in scientific computing.[7] From these theoretical MIN foundations, the butterfly topology transitioned to diverse implementations, including emulations on field-programmable gate arrays (FPGAs) for prototyping and acceleration in the 2000s and beyond. For instance, multi-FPGA setups have realized scalable 64-node butterfly networks compliant with standards like OCP-IP, enabling on-chip testing of parallel algorithms.Architecture
Core Components
The core components of a butterfly network consist primarily of switching elements, nodes arranged in stages, and interconnecting links that facilitate data routing in parallel computing systems. The fundamental switching elements are 2x2 crossbar switches, which operate at each stage to control permutations by selectively connecting inputs to outputs in either a straight (upper to upper, lower to lower) or cross (upper to lower, lower to upper) configuration.[8][9] These switches enable basic rerouting decisions, allowing the network to rearrange data paths dynamically. Each switch handles two inputs and two outputs, forming the building block for scalable interconnection. The node arrangement in a butterfly network supports N = 2^k inputs and outputs, organized across k stages, with each stage containing N/2 such 2x2 switches.[9][10] Inputs connect to the first stage of switches, and outputs emanate from the final stage, creating a multi-level hierarchy that resembles the wings of a butterfly when visualized. For example, in an 8x8 network (k=3), there are 3 stages, each with 4 switches, totaling 12 switches that interconnect the 8 input nodes to 8 output nodes. This radix-2 structure ensures a balanced distribution of computational load across stages. Links between consecutive stages employ permutation-based connections, specifically a perfect shuffle pattern, where outputs from one stage are redistributed to inputs of the next by interleaving indices. The perfect shuffle permutes the N outputs of one stage to the N inputs of the next stage, for example, by connecting the output with index i to the input with index obtained by left-rotating the binary representation of i by one bit (modulo adjustment for power of 2), ensuring each output links to exactly one input.[8][10] This shuffling promotes even traffic distribution and supports efficient path diversity. Variants of the butterfly network include the baseline configuration, which uses linear indexing for connections without wrapping, and the wrapped-around (or circular shift) version, where the shuffle incorporates modular arithmetic to connect the ends, potentially reducing path lengths in cyclic applications.[8] The logical layout of a butterfly network illustrates its radix-2 structure through a series of horizontal stages, with vertical and diagonal links forming non-blocking paths from any input to any output via a unique route, akin to edges in a hypercube projection.[9] This design allows the components to collectively enable permutation routing, as explored in subsequent operational analyses.Network Construction
The butterfly network is constructed as a multistage interconnection network using a series of switching stages connected via specific permutation patterns. It begins with an input stage consisting of N processor nodes, where N is a power of 2, typically N = 2^k for some integer k. Each subsequent stage is formed by applying a perfect shuffle permutation to the outputs of the previous stage, linking them to 2x2 crossbar switches that enable either straight-through or crossover connections. This process repeats for log₂ N stages, culminating in an output stage mirroring the input, resulting in a total of log₂ N + 1 levels or ranks.[11][8] For larger networks, the construction follows a recursive approach. A butterfly network of order k (with 2^k nodes per rank) is built by combining two networks of order k-1: one for the lower half and one for the upper half, shifted by 2^{k-1}. The input rank of the full network connects to the first stage of both subnetworks via straight and shuffled links, while the outputs merge similarly. This recursive method allows scalable assembly, where each level adds a new dimension of connectivity without redesigning the base topology.[8] Wiring patterns are designed to minimize crossovers, often employing multilayer VLSI layouts with L routing layers (L ≥ 2) to separate horizontal and vertical tracks, achieving area complexity O(N log N) and bounded wire lengths. For instance, odd layers handle vertical connections, and even layers manage horizontal ones, reducing interference in dense chip integrations.[12] A concrete example is the construction of a 16-node butterfly network (k=4, 5 ranks with 16 nodes each). The input nodes are labeled in binary from 0000 to 1111 and connect pairwise to the 8 switches in stage 1 (e.g., inputs 0000 and 0001 to switch 0, 0010 and 0011 to switch 1). Outputs from stage 1 are then permuted via perfect shuffle to inputs of stage 2 switches, such as by left-rotating the binary labels by one bit. Subsequent shuffles propagate similarly through the stages, yielding full connectivity through four switching stages.[11] Nodes are addressed using binary labels of the form [i, j], where i denotes the rank (0 to k) and j is the k-bit binary position within the rank (0 to 2^k - 1). Connections between [i, j] and [i+1, m] follow the rule: straight link to m = j, and shuffle link to m = (j + 2^{k-i-1}) mod 2^k, ensuring systematic wiring.[8]Operation
Routing Algorithms
The butterfly network possesses a self-routing property, in which the destination address bits directly guide path selection at each switch without requiring centralized control or routing tables.[13] This property enables packets to navigate autonomously through the network stages, leveraging the binary labeling of nodes and links.[14] The core routing algorithm operates as follows: for a destination address (where and is the network size), at stage (with stages numbered from 0 to ), the switch examines bit to select the output port—typically 0 for the straight (upper) path and 1 for the cross (lower) path—before stripping that bit and forwarding the remaining address to the next stage.[13] This deterministic process ensures a unique shortest path of length from source to destination.[14] In the presence of contention, where multiple packets compete for the same output port, conflict resolution strategies include randomized routing (e.g., routing packets via a randomly chosen intermediate destination to balance load) or priority-based arbitration (using fixed priorities or nonpredictive selection to resolve bids without inspecting full destinations).[15] These methods, combined with bounded queuing at switches, prevent indefinite blocking while maintaining network throughput.[15] The self-routing mechanism supports any permutation of inputs to outputs in depth, as each packet follows an independent path determined by its destination tag; this includes the bit-reversal permutation essential for efficient fast Fourier transform (FFT) implementations, where data reordering aligns with the network's butterfly computation structure.[15][16] For basic unicast routing, the algorithm can be expressed in pseudocode as follows:function route_packet(source, destination):
current_address = destination # Binary representation: d_{k-1} ... d_0
current_node = source
for i from 0 to k-1:
bit = extract_bit(current_address, i) # d_i
if bit == 0:
next_node = upper_neighbor(current_node, stage=i)
else:
next_node = lower_neighbor(current_node, stage=i)
forward_packet_to(next_node, remaining_address) # Strip d_i
current_node = next_node
deliver_to(destination)
function route_packet(source, destination):
current_address = destination # Binary representation: d_{k-1} ... d_0
current_node = source
for i from 0 to k-1:
bit = extract_bit(current_address, i) # d_i
if bit == 0:
next_node = upper_neighbor(current_node, stage=i)
else:
next_node = lower_neighbor(current_node, stage=i)
forward_packet_to(next_node, remaining_address) # Strip d_i
current_node = next_node
deliver_to(destination)
Key Parameters
The butterfly network, as a multistage interconnection topology connecting inputs to outputs, exhibits several key parameters that determine its structural efficiency and performance bounds. The diameter represents the longest shortest path between any pair of nodes, measured in hops, and is equal to , corresponding to the number of stages traversed in the worst case.[2] This low diameter enables low-latency communication for distant nodes, with the maximum path length strictly stages.[17] The degree of each switch in the network is 4, consisting of two input ports and two output ports in the standard 2x2 configuration.[2] This fixed degree balances connectivity and hardware simplicity, as intermediate switches connect bidirectionally across stages. The bisection bandwidth, defined as the minimum number of links crossing a balanced partition of the network into two equal-sized sets of nodes each, is .[17] This arises from the balanced cuts between consecutive stages, where exactly channels span the midpoint, providing linear scalability in aggregate capacity.[18] In terms of cost, the network requires switches and links overall, with stages each containing switches and links per stage.[17] This quadratic-logarithmic scaling reflects the total hardware investment for full connectivity. The path length between source and destination is precisely the number of stages , as routing follows a unique deterministic path through the topology.[17] Regarding scalability limits, the butterfly network demonstrates sensitivity to faults in individual stages, where a single switch or link failure can partition the network and disrupt connectivity for N node pairs due to the unique-path routing. This vulnerability necessitates additional redundancy, such as extra stages, for enhanced fault tolerance in practical deployments.Analysis and Applications
Performance Metrics
The butterfly network exhibits an average latency of under uniform traffic loads, leveraging its logarithmic diameter for efficient routing in permutations and balanced workloads. However, hotspots arising from uneven traffic distribution can lead to worst-case contention, escalating latency to in adversarial scenarios where multiple packets converge on the same link or switch.[19][20] Historical throughput benchmarks for the BBN Butterfly multiprocessor, a seminal implementation from the 1980s, demonstrated scaling to higher aggregate rates in multi-node configurations. Modern emulations of butterfly topologies on GPUs have extended these capabilities, achieving significantly higher throughput in parallel graph algorithms; for instance, multi-GPU implementations of butterfly-based breadth-first search (BFS) traversals report over 10x speedup compared to CPU baselines for large-scale network analysis.[21][22] In applications, the butterfly network's structure aligns naturally with the Cooley-Tukey FFT algorithm, enabling efficient mapping of butterfly operations for computation of discrete Fourier transforms in parallel environments. Similarly, for parallel sorting, the network supports implementations of the AKS sorting network and parallel prefix computations, facilitating deterministic depth sorting with constant fan-out.[23] Fault tolerance in butterfly networks stems from redundancy in variants like multibutterfly or extra-stage designs, which provide multiple disjoint paths between nodes, enhancing reliability in multistage interconnection setups.[24] Simulation studies highlight the network's efficiency for random permutations under light to moderate loads, but dropping in adversarial traffic patterns that induce congestion. Post-2000 research has revitalized butterfly topologies in optical networks and network-on-chip (NoC) designs, where flattened or Clos-based variants demonstrate reduced power consumption and latency in photonic interconnects for chip-scale computing.[25][26][27]Comparisons with Other Topologies
The butterfly network shares a diameter of with the hypercube network, facilitating efficient short-path routing in both. Unlike the hypercube, which requires a node degree of that grows with system size, the butterfly maintains a constant degree of 4, reducing wiring complexity and improving scalability for large-scale implementations.[8] The hypercube's direct interconnection among processors enables symmetric, reconfigurable routing, whereas the butterfly's indirect multistage design relies on fixed switch stages for connectivity.[11] Both the butterfly and omega networks are multistage interconnection networks (MINs) with a diameter of and bisection width of . The omega network is inherently blocking, supporting only permutations without path conflicts due to its self-routing shuffle-exchange pattern.[9] In contrast, the butterfly is rearrangeably non-blocking, allowing any permutation to be realized by reconfiguring switch settings, thus providing superior support for arbitrary communication patterns.[9] Relative to 2D mesh and torus topologies, the butterfly offers substantially lower latency with its diameter compared to the diameter of meshes and tori, both of which maintain a constant degree of 4.[8] Meshes and tori excel in scalability for applications involving 2D spatial embeddings, such as grid-based simulations, where the butterfly's non-planar structure leads to inefficient local routing.[11] The butterfly network contrasts with fat-tree topologies by employing a simpler, single-level scaling structure with diameter , versus the fat-tree's hierarchical design yielding a diameter of .[8] While both achieve comparable bisection widths around , fat-trees distribute bandwidth more evenly across levels through increasing link capacities toward the root, whereas the butterfly's uniform staging results in less balanced throughput under uneven loads.[11]| Topology | Diameter | Degree | Bisection Width |
|---|---|---|---|
| Butterfly | 10 | 4 | 512 |
| Hypercube | 10 | 10 | 512 |
| Omega | 10 | 4 | 512 |
| 2D Mesh | 62 | 4 | 32 |
| Fat-Tree | 20 | 3 | 512 |
