Hubbry Logo
Parallel breadth-first searchParallel breadth-first searchMain
Open search
Parallel breadth-first search
Community hub
Parallel breadth-first search
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Parallel breadth-first search
Parallel breadth-first search
from Wikipedia

The breadth-first-search algorithm is a way to explore the vertices of a graph layer by layer. It is a basic algorithm in graph theory which can be used as a part of other graph algorithms. For instance, BFS is used by Dinic's algorithm to find maximum flow in a graph. Moreover, BFS is also one of the kernel algorithms in Graph500 benchmark, which is a benchmark for data-intensive supercomputing problems.[1] This article discusses the possibility of speeding up BFS through the use of parallel computing.

[edit]

In the conventional sequential BFS algorithm, two data structures are created to store the frontier and the next frontier. The frontier contains all vertices that have the same distance (also called "level") from the source vertex, these vertices need to be explored in BFS. Every neighbor of these vertices will be checked, some of these neighbors which are not explored yet will be discovered and put into the next frontier. At the beginning of the BFS algorithm, a given source vertex s is the only vertex in the frontier. All direct neighbors of s are visited in the first step, which form the next frontier. After each layer-traversal, the "next frontier" is switched to the frontier and new vertices will be stored in the new next frontier. The following pseudo-code outlines the idea of it, in which the data structures for the frontier and next frontier are called FS and NS respectively.

1    define bfs_sequential(graph(V,E), source s):
2        for all v in V do
3            d[v] = -1;
4        d[s] = 0; level = 1; FS = {}; NS = {};
5        push(s, FS);
6        while FS !empty do
7            for u in FS do 
8                for each neighbour v of u do 
9                    if d[v] = -1 then
10                       push(v, NS);
11                       d[v] = level;
12           FS = NS, NS = {}, level = level + 1;

First step of parallelization

[edit]

As a simple and intuitive solution, the classic Parallel Random Access Machine (PRAM) approach is just an extension of the sequential algorithm that is shown above. The two for-loops (line 7 and line 8) can be executed in parallel. The update of the next frontier (line 10) and the increase of distance (line 11) need to be atomic. Atomic operations are program operations that can only run entirely without interruption and pause (i.e. "all or nothing").

In Parallel Random Access Machine (PRAM) model consists of multiple processors, they share the memory together.
A PRAM Model.

However, there are two problems in this simple parallelization. Firstly, the distance-checking (line 9) and distance-updating operations (line 11) introduce two benign races. The reason of race is that a neighbor of one vertex can also be the neighbor of another vertex in the frontier. As a result, the distance of this neighbor may be examined and updated more than one time. Although these races waste resource and lead to unnecessary overhead, with the help of synchronization, they don't influence the correctness of BFS, so these races are benign. Secondly, in spite of the speedup of each layer-traversal due to parallel processing, a barrier synchronization is needed after every layer in order to completely discover all neighbor vertices in the frontier. This layer-by-layer synchronization indicates that the steps of needed communication equals the longest distance between two vertices, O(d), where O is the big O notation and d is the graph diameter.

This simple parallelization's asymptotic complexity is the same as that of the sequential algorithm in the worst case. Better BFS parallelization can be achieved with optimizations, such as:

  1. Mitigating barrier synchronization. Barrier synchronization is necessary after each layer-traversal to ensure correctness. Reducing the cost of barrier synchronization is an effective way to speed up parallel BFS.
  2. Load-balancing for neighbor discovery. Because there is a barrier synchronization after each layer-traversal, every processing unit must wait for the last one to finish its work. Therefore, the processing unit with the most neighbors decides the time consumption of this layer. With the optimization of load-balancing, the time of layer-traversal can be reduced.
  3. Improving the locality of memory references. In parallel systems with distributed memory, remote memory references are accessing data from other processing units, which usually incurs extra communication cost compared to local memory access. A more efficient data structure design or organization of data can reduce the need for remote memory access, hence reducing the total communication cost.

Parallel BFS with shared memory

[edit]

Compared to parallel BFS with distributed memory, shared memory provides higher memory-bandwidth and lower latency due to direct access by all processors. Therefore, the overhead of message passing, which is necessary for distributed memory access, is avoided.[2]

A graphic example of shared memory model. Each processor has local cache and is connected to the network. Through this network every processor has access to shared memory blocks.
A shared memory model.

However, the number of vertices in each layer and the number of neighbors of each vertex are shown to be highly irregular, which leads to highly irregular memory accesses and work distribution of BFS. In parallel BFS, this feature reduces the benefits from parallelization due to unbalanced load. As a result, it is very important to make the parallel BFS with shared memory load-balanced. Moreover, exploring data-locality can also speed up parallel execution.

Parallel BFS algorithms with shared memory can be divided into two types: container centric approaches and vertex centric approaches.[3] In the container centric approach, two data structures are created to store the current frontier and the next vertex frontier. The next vertex frontier is switched to the current frontier at the end of each iteration. There is a trade-off between the cost of synchronization and data locality according to the location of data. These two data structures can be held in every processing unit (such as a thread) which supports data locality but needs extra load balancing mechanisms. Alternatively, they can be stored globally to provide implicit load balancing, where special data structures are used for concurrent access, which requires additional effort for synchronization.

Besides, data organization of containers can be optimized. The typical data structure in serial BFS and some parallel BFS is a FIFO Queue, as it is simple and fast (insertion and delete operations have constant time cost).

Another alternative is the bag-structure.[4] The insertion operation in a bag takes O(logn) time in the worst-case, whereas it takes only constant amortized time which is as fast as FIFO. Furthermore, union of two bags takes Θ(lgn) time where n is the number of elements in the smaller bag. The bag-split operation also takes Θ(lgn) time. With the help of bag-structure, a certain number of vertices (according to granularity parameter) are stored in one bag and the bag-structure becomes the basic parallel entity. Moreover, the reducer can be combined with the bag-structure to write vertices in parallel and traverse them efficiently.

The vertex centric approach treats a vertex as parallel entity,which enables parallel iteration. Each vertex is assigned to a parallel entity. This vertex centric approach might only work well if the graph depth is very low. Graph depth in BFS is defined as the maximum distance from any vertex in the graph to the source vertex. Therefore, the vertex centric approach is well-suited for GPUs if every thread is mapped to exactly one vertex.[3]

Parallel BFS with distributed memory

[edit]

In the distributed memory model, each processing unit has its own memory. Because of this, processing units must communicate via message passing to share their local data and access remote data.

In the distributed memory model, each processor has its own cache and memory. They communicate with each other using the network and message passing.
A distributed memory model.

1-D partitioning

[edit]

1D partitioning is the simplest way to combine the parallel BFS with distributed memory. It is based on vertex partitioning. Load balancing is still an important issue for data partitioning, which determines how we can benefit from parallelization. In other words, each processing unit with distributed memory should be in charge of approximately the same number of vertices and their outgoing edges. For the implementation of data storage, each processor can store an adjacency matrix of its local vertices, in which each row for each vertex is a row of outgoing edges represented by destination vertex indices.

As opposed to shared memory BFS, the neighbor vertex from one processing unit may be stored in another processing unit. As a result, each processing unit is responsible to tell other processing units about traversal status by sending them messages. Moreover, each processing unit should also deal with the messages from all other processing units to construct its local next vertex frontier. Obviously, one all-to-all communication (which means each unit has different messages for all others) is necessary in each step when exchanging the current frontier and the next vertex frontier.

The following pseudo-code of a 1-D distributed memory BFS[5] was originally designed for IBM BlueGene/L systems, which have a 3D torus network architecture. Because the synchronization is the main extra cost for parallelized BFS, the authors of this paper also developed a scalable all-to-all communication based on point-to-point communications. After that, they also reduced the number of point-to-point communication, taking advantage of its high-bandwidth torus network.

The main steps of BFS traversal in the following algorithm are:

  1. processor view (line 8): construct the frontier FS with vertices from local storage
  2. global view (line 10–11): terminate the traversal if FS from all processors are empty
  3. processor view (line 13): construct the next frontier based on the neighbors vertex of its FS, although some of their neighbors may be stored in other processors
  4. global view (line 15–18): run an all-to-all communication to let each processor know, which local vertices should be put into its local next frontier NS
  5. processor view (line 20–22): receive messages from all other processors, update the distance value of their local vertices in the current frontier, change its NS to FS
1    define 1_D_distributed_memory_BFS( graph(V,E), source s):
2        //normal initialization
3        for all v in V do
4            d[v] = -1;
5        d[s] = 0; level = 0; FS = {}; NS = {};
6        //begin BFS traversal
7        while True do:
8            FS = {the set of local vertices with level}
9            //all vertices traversed
10           if FS = {} for all processors then:
11               terminate the while loop
12           //construct the NS based on local vertices in current frontier
13           NS = {neighbors of vertices in FS, both local and not local vertices}
14           //synchronization: all-to-all communication
15           for 0 <= j < p do:
16               N_j = {vertices in NS owned by processor j}
17               send N_j to processor j
18               receive N_j_rcv from processor j
19           //combine the received message to form local next vertex frontier then update the level for them
20           NS_rcv = Union(N_j_rcv)
21           for v in NS_rcv and d[v] == -1 do
22               d[v] = level + 1

Combined with multi-threading, the following pseudo code of 1D distributed memory BFS also specifies thread stack and thread barrier, which comes from the paper.[6]

With multi-threading, local vertices in the frontier FS can be divided and assigned to different threads inside of one processor, which further parallelizes the BFS traversal. In contrast with the methods above, we need more data structures for each individual thread. For instance, the thread stack, which is prepared for saving the neighbor vertices from the vertices of this thread. Each thread has p-1 local storage, where p is the number of processors. Because each thread must separate the messages for all other processors. For example, they will put their neighbor vertices in their j-th stack to form the message to send to j processor, if j processor is the owner of those vertices. Moreover, Thread barrier is also necessary for synchronization. As a result, although distributed memory with multi-threading might benefit from refinement of parallelization, it also introduces extra synchronization cost for threads.

The main steps of BFS traversal in the following algorithm are:

  1. thread view (line 19–22): based on vertices assigned to itself, find the owner processor of neighbor vertices, put them into thread stack base on their owners.
  2. processor view (line 23): run a thread barrier, wait until all threads(of the same processor) finish their job.
  3. processor view (line 25–26): merge all thread stacks of all threads, which has the same owner (those have the destination for next step).
  4. global view (line 28–30): run an all-to-all communication with master thread to let each processor know, which local vertices should be put into the next frontier.
  5. processor view (line 31): run a thread barrier, wait until the communication finished(of master thread).
  6. processor view (line 33): assign vertices from the next frontier to each thread.
  7. thread view (line 34–36): if the vertex is not visited, update the distance value for their vertices and put it in thread stack for the next frontier NS.
  8. processor view (line 37): run a thread barrier, wait until all threads(of the same processor) finish their job.
  9. processor view (line 39): aggregate thread stacks for the next frontier from every thread
  10. processor view (line 40): run a thread barrier, wait until all threads send all their vertices in their stack.
1    define 1_D_distributed_memory_BFS_with_threads(graph(V,E), source s):
2        // normal initialization
3        for all v in V do
4            d[v] = -1;
5        level = 1; FS = {}; NS = {};
6        // find the index of the owner processor of source vertex s
7        pu_s = find_owner(s);
8        if pu_s = index_pu then
9            push(s,FS);
10           d[s] = 0;
11       // message initialization
12       for 0 <= j < p do
13           sendBuffer_j = {}   // p shared message buffers
14           recvBuffer_j = {}   // for MPI communication
15           thrdBuffer_i_j = {} //thread-local stack for thread i
16       // begin BFS traversal
17       while FS != {} do
18           //traverse vertices and find owners of neighbor vertices
19           for each u in FS in parallel do
20               for each neighbor v of u do
21                   pu_v = find_owner(v)
22                   push(v, thrdBuffer_i_(pu_v))
23           Thread Barrier
24           // combine thread stack to form sendBuffer
25           for 0 <= j < p do
26               merge thrdBuffer_i_j in parallel
27           // all-to-all communication 
28           All-to-all collective step with master thread: 
29               1. send data in sendBuffer
30               2. receive and aggregate newly visited vertices into recvBuffer
31           Thread Barrier
32           // update level for new-visited vertices 
33           for each u in recvBuffer in parallel do
34               if d[u] == -1 then
35                   d[u] = level
36                   push(u, NS_i)
37           Thread Barrier
38           // aggregate NS and form new FS 
39           FS = Union(NS_i)
40           Thread Barrier
41           level = level + 1f

2-D partitioning

[edit]

Because BFS algorithm always uses the adjacency matrix as the representation of the graph. The natural 2D decomposition of matrix can also be an option to consider. In 2D partitioning, each processor has a 2D index (i,j). The edges and vertices are assigned to all processors with 2D block decomposition, in which the sub-adjacency matrix is stored.

If there are in total P=R·C processors, then the adjacency matrix will be divided like below:

The adjacency matrix is divided into C columns and R·C rows.
2D-partition of the adjacency matrix.

There are C columns and R·C block rows after this division. For each processor, they are in charge of C blocks, namely the processor (i,j) stores Ai,j(1) to Ai,j(C) blocks. The conventional 1D partitioning is equivalent to the 2D partitioning with R=1 or C=1.

In general, the parallel edge processing based on 2D partitioning can be organized in 2 communication phases, which are "expand" phase and "fold" phase.[6]

In the "expand" phase, if the edge list for a given vertex is the column of the adjacency matrix, then for each vertex v in the frontier, the owner of v is responsible to tell other processors in its processor column that v is visited. That's because each processor only stores partial edge lists of vertices. After this communication, each processor can traverse the column of according to the vertices and find out their neighbors to form the next frontier.[5]

In the "fold" phase, vertices in resulting next frontier are sent to their owner processors to form the new frontier locally. With 2D partitioning, these processors are in the same processor row.[5]

The main steps of BFS traversal in this 2D partitioning algorithm are(for each processor):

  1. expand phase (line 13–15): based on local vertices, only send messages to processors in processor-column to tell them these vertices are in the frontier, receive messages from these processors.
  2. (line 17–18): merge all receiving messages and form the net frontier N. Notice that not all vertices from received messages should be put into the next frontier, some of them may be visited already. The next frontier only contains vertices that has the distance value −1.
  3. fold phase (line 20–23): based on the local vertices in next frontier, send messages to owner processors of these vertices in processor-row.
  4. (line 25–28): merge all receiving messages and update the distance value of vertices in the next frontier.

The pseudo-code below describes more details of 2D BFS algorithm, which comes from the paper:[5]

1    define 2_D_distributed_memory_BFS( graph(V,E), source s):
2        // normal initialization
3        for all v in V do
4            d[v] = -1;
5        d[s] = 0; 
6        // begin BFS traversal
7        for l = 0 to infinite do:
8            F = {the set of local vertices with level l}
9            // all vertices traversed
10           if F = {} for all processors then:
11               terminate the while loop
12           // traverse vertices by sending message to selected processor
13           for all processor q in this processor-column do:
14               Send F to processor q
15               Receive Fqr from q
16           // deal with the receiving information after the frontier traversal
17           Fr = Union{Fqr} for all q
18           N = {neighbors of vertices in Fr using edge lists on this processor}
19           // broadcast the neighbor vertices by sending message to their owner processor
20           for all processor q in this processor-row do:
21               Nq = {vertices in N owned by processor q}
22               Send Nq to processor q
23               Receive Nqr from q
24           // form the next frontier used for next layer traversal
25           Nr = Union{Nqr} for all q
26           // layer distance update
27           for v in Nr and d(v) = -1 do:
28               level = l + 1

In 2D partitioning, only columns or rows of processors participate in communication in "expand" or "fold" phase respectively.[5] This is the advantage of 2D partitioning over 1D partitioning, because all processors are involved in the all-to-all communication operation in 1D partitioning. Besides, 2D partitioning is also more flexible for better load balancing, which makes a more scalable and storage-efficient approach much easier.

Implementation of optimization strategies

[edit]

Aside from basic ideas of parallel BFS, some optimization strategies can be used to speed up parallel BFS algorithm and improve the efficiency. There are already several optimizations for parallel BFS, such as direction optimization, load balancing mechanism and improved data structure and so on.

Direction optimization

[edit]

In the original top-down BFS, each vertex examines all neighbors of vertex in the frontier. This is sometimes not efficient, when the graph has a low diameter.[7] but some vertices inside have much higher degrees than average, such as a small-world graph.[8] As mentioned before, one benign race in parallel BFS is that, if more than one vertex in the frontier has common neighbor vertices, the distance of neighbor vertices will be checked many times. Although the distance update is still correct with the help of synchronization, the resource is wasted. In fact, to find the vertices for the next frontier, each unvisited vertex only needs to check if any of its neighbors is in the frontier. This is also the core idea for direction optimization. Better still, each vertex would quickly find a parent by checking its incoming edges if a significant number of its neighbors are in the frontier.

In the paper,[8] the authors introduce a bottom-up BFS where each vertex only needs to check whether any of its parents is in the frontier. This can be determined efficiently if the frontier is represented by a bitmap. Compared to the top-down BFS, bottom-up BFS reduces the fail checking by self-examining the parent to prevent contention.

However, bottom-up BFS requires serializing work of one vertex and only works better when a large fraction of vertices are in the frontier. As a result, a direction optimized BFS should combine the top-down and the bottom-up BFS. In particular, the BFS should start with the top-down direction and switch to the bottom-up BFS when the number of vertices exceeds a given threshold and vice versa.[8]

Load balance

[edit]

Load balancing is very important not only in parallel BFS but also in all parallel algorithms, because balanced work can improve the benefit of parallelization. In fact, almost all of parallel BFS algorithm designers should observe and analyze the work partitioning of their algorithm and provide a load balancing mechanism for it.

Randomization is a useful and simple way of achieving load balancing. For instance, a graph can be traversed by randomly shuffling all vertex identifiers prior to partitioning.[6]

Data structure

[edit]
There is an example for compressed spare row representation of a directed graph.
An example of CSR representation of a directed graph.
Four examples of pennant data structure based on k from 0 to 3.
Pennant data structure for k=0 to k=3.
An example of bag structure with 23 elements.
An example of bag structure with 23 elements.

There are some special data structures that parallel BFS can benefit from, such as CSR (Compressed Sparse Row), bag-structure, bitmap and so on.

In the CSR, all adjacencies of a vertex is sorted and compactly stored in a contiguous chunk of memory, with adjacency of vertex i+1 next to the adjacency of i. In the example on the left, there are two arrays, C and R. Array C stores the adjacency lists of all nodes. Array R stored the index in C, the entry R[i] points to the beginning index of adjacency lists of vertex i in array C. The CSR is extremely fast because it costs only constant time to access vertex adjacency. But it is only space-efficient for 1D partitioning.[6] More information about CSR can be found in.[9] For 2D partitioning, DCSC (Doubly Compressed Sparse Columns) for hyper-sparse matrices is more suitable.[10]

In the paper,[4] the authors develop a new data structure called bag-structure. Bag structure is constructed from the pennant data structure. A pennant is a tree of 2k nodex, where k is a nonnegative integer. Each root x in this tree contains two pointers x.left and x.right to its children. The root of the tree has only a left child, which is a complete binary tree of the remaining elements.[4]

The bag structure is the collection of pennants with a backbone array S. Each entry S[i] in S is either a null pointer or a pointer to a pennant with size si. The insertion operation in a bag takes amortized time and the union of two bags takes time. The bag-split takes also time. With this bag-structure, parallel BFS is allowed to write the vertices of a layer in a single data structure in parallel and later efficiently traverse them in parallel.[4]

Moreover, bitmap is also a very useful data structure to memorize which vertices are already visited, regardless in the bottom-up BFS.[11] or just to check if vertices are visited in the top-down BFS[9]

Benchmarks

[edit]

Graph500 is the first benchmark for data-intensive supercomputing problems.[1] This benchmark generates an edge tuple with two endpoints at first. Then the kernel 1 will constructs an undirected graph, in which weight of edge will not be assigned if only kernel 2 runs afterwards. Users can choose to run BFS in kernel 2 and/or Single-Source-Shortest-Path in kernel 3 on the constructed graph. The result of those kernels will be verified and running time will be measured.

Graph500 also provides two reference implementations for kernel 2 and 3. In the referenced BFS, the exploration of vertices is simply sending messages to target processors to inform them of visited neighbors. There is no extra load balancing method. For the synchronization, AML (Active Messages Library, which is an SPMD communication library build on top of MPI3, intend to be used in fine grain applications like Graph500) barrier ensures the consistent traversal after each layer. The referenced BFS is only used for correctness verification of results. Thus, users should implement their own BFS algorithm based on their hardware. The choice of BFS is not constrained, as long as the output BFS tree is correct.

The correctness of result is based on the comparison with result from referenced BFS. Because only 64 search key are sampled to runs kernel 2 and/or kernel 3, the result is also considered correct when this result is different from referenced result only because the search key is not in samples. These 64 search keys also run the kernel sequentially to compute mean and variance, with which the performance of a single search is measured.

Different from TOP500, the performance metric in Graph500 is traversed edges per second (TEPS).

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Parallel breadth-first search (parallel BFS) is a adaptation of the classic algorithm, which systematically explores the vertices of a graph level by level starting from a source vertex, discovering the shortest paths in unweighted graphs by processing nodes in order of increasing distance from the source. This parallelization enables efficient traversal on architectures including multi-core CPUs, GPUs, and distributed-memory systems, addressing the computational demands of massive graphs with billions of edges. Key implementations achieve near-linear speedups, such as 3–10× on modern multicore processors for small-diameter graphs, by optimizing work efficiency and reducing synchronization overheads. The foundations of BFS trace back to serial formulations by in 1959 for maze and C.Y. Lee in 1961 for , but parallel variants emerged in the early to leverage emerging parallel hardware. Seminal works include Yoo et al.'s 2005 scalable distributed algorithm on the BlueGene/L , which used 2D edge partitioning to handle graphs with billions of vertices and edges across tens of thousands of processors. This was followed by Leiserson and Schardl's 2010 work-efficient PBFS using a "bag" in Cilk++ for shared-memory multicore systems, achieving O((V + E)/P + D lg³(V/D)) runtime on P processors for graphs with V vertices, E edges, and D. Further advancements, like Beamer et al.'s 2012 direction-optimizing BFS, introduced hybrid top-down and bottom-up traversals to minimize redundant edge examinations. Parallel BFS faces inherent challenges due to graphs' irregularity, including load imbalance from skewed degree distributions, high communication costs in distributed settings, and synchronization barriers that limit scalability. To mitigate these, algorithms employ strategies such as 1D or 2D graph partitioning for better load balancing, non-atomic updates to avoid costly locks, and bitmap-based visited sets for improved cache locality and memory efficiency. Hybrid approaches dynamically switch between push-style (top-down) exploration of outgoing edges from the and pull-style (bottom-up) checks of incoming edges to unvisited vertices, particularly effective when the frontier grows large relative to the remaining graph. As a core primitive in graph analytics, parallel BFS underpins applications in , web crawling, bioinformatics, and recommendation systems, serving as the benchmark for the Graph500 competition to evaluate performance on irregular workloads. Its scalability has enabled traversals of trillion-edge graphs in seconds on large clusters, highlighting its role in processing real-world "small-world" networks with low diameters.

Background on BFS

Breadth-first search (BFS) is a fundamental algorithm that explores a graph level by level, starting from a designated source node. It systematically visits all nodes at a given distance from the source before moving to nodes farther away, making it ideal for finding the shortest path in unweighted graphs or performing level-order traversals. The algorithm relies on a queue data structure to manage the order of exploration, ensuring that nodes are processed in the order of their discovery. BFS and its application in finding connected components of graphs were invented in 1945 by Konrad Zuse in his PhD thesis on the Plankalkül programming language, though it was later reinvented and popularized in computer science literature during the 1950s and 1960s by researchers such as Edward F. Moore (1959) for maze pathfinding and C. Y. Lee (1961) for integrated circuit routing. The core of the serial BFS algorithm involves initializing a queue with the source node, marking it as visited to avoid revisiting, and then iteratively dequeuing nodes, enqueueing their unvisited neighbors, and marking those neighbors as visited. This process continues until the queue is empty, ensuring every reachable node is visited exactly once. for the algorithm on an undirected graph represented as an is as follows:

function BFS(graph, source): create a queue Q create a visited set or array Q.enqueue(source) visited.add(source) while Q is not empty: current = Q.dequeue() process(current) // e.g., record distance or visit for each neighbor in graph[current]: if neighbor not in visited: Q.enqueue(neighbor) visited.add(neighbor)

function BFS(graph, source): create a queue Q create a visited set or array Q.enqueue(source) visited.add(source) while Q is not empty: current = Q.dequeue() process(current) // e.g., record distance or visit for each neighbor in graph[current]: if neighbor not in visited: Q.enqueue(neighbor) visited.add(neighbor)

This implementation guarantees a level-by-level expansion, where each of the outer loop corresponds to processing one level of the graph. In terms of complexity, serial BFS has a of O(V + E), where V is the number of vertices and E is the number of edges, as each vertex and edge is processed at most once. The is O(V) to store the queue and visited set in the worst case, such as in a linear graph. For example, consider a simple undirected graph with vertices A, B, C, D, and E, where A connects to B and C, B connects to D, and C connects to D and E. Starting from A, BFS first visits A (level 0), then enqueues and visits B and C (level 1), followed by D and E (level 2), illustrating the breadth-wise progression without . This sequential formulation serves as the baseline for understanding but becomes inefficient for massive graphs with billions of vertices and edges, where parallel variants are necessary to achieve .

Motivations for Parallelization

Serial faces significant challenges when scaling to massive graphs with billions of vertices and edges, often exceeding the memory capacity of single processors and resulting in long runtimes due to irregular access patterns and the algorithm's O(|V| + |E|) complexity. These limitations are particularly acute for real-world graphs exhibiting scale-free or power-law degree distributions, where high-degree vertices amplify bottlenecks and computational demands. The need for parallelization arises from applications such as shortest path computation in road networks, where BFS identifies minimal edge paths between nodes; influence maximization in social graphs, employing BFS to model the spread of influence from nodes; and in , utilizing BFS to exhaustively explore state spaces for property satisfaction. Parallel BFS addresses these issues by distributing workload across multiple processors, substantially reducing wall-clock time and enabling efficient processing of such large, irregular graphs that are infeasible serially. For example, serial BFS on a graph with 10^9 edges may take hours on a single processor when constraints necessitate disk I/O, compared to minutes or less in parallel on multi-processor systems. Early recognition of these challenges in the culminated in benchmarks like Graph500 (introduced in 2010), which positioned as a core parallel primitive for assessing capabilities on data-intensive graph workloads.

Core Parallelization Approaches

Level-Synchronous Parallel

Level-synchronous parallel () is a foundational approach to parallelizing the , where graph traversal proceeds strictly level by level, ensuring that all vertices at distance kk from the source are processed and their neighbors identified before advancing to distance k+1k+1. This model relies on global mechanisms, such as barriers, to coordinate processors after each level's completion, maintaining the deterministic order of discovery inherent to serial . In distributed-memory settings, vertices and edges are partitioned across processors, and communication primitives like all-to-all or all-gather are used to exchange boundary data for non-local neighbors, preventing premature processing of subsequent levels. The algorithm adapts the serial BFS by maintaining a distributed frontier queue that represents the current level's vertices, which is collectively expanded in parallel before synchronization. A typical implementation initializes distances for all vertices to infinity except the source, then iteratively performs the following per level: processors traverse edges from their local portion of the current frontier, collect unvisited neighbors, deduplicate via a global visited array or bitmap, and update the next frontier through collective communication. Pseudocode for a distributed-memory variant (1D vertex partitioning) might resemble:

Initialize: dist[v] = ∞ for all v; dist[s] = 0; current_frontier = {s}; level = 0 While current_frontier is not empty: next_frontier = empty Barrier across all processors In parallel: For each u in local current_frontier: For each neighbor v of u: If dist[v] == ∞: dist[v] = level + 1 Add v to global next_frontier (via communication) Aggregate and distribute next_frontier across processors current_frontier = next_frontier level += 1

Initialize: dist[v] = ∞ for all v; dist[s] = 0; current_frontier = {s}; level = 0 While current_frontier is not empty: next_frontier = empty Barrier across all processors In parallel: For each u in local current_frontier: For each neighbor v of u: If dist[v] == ∞: dist[v] = level + 1 Add v to global next_frontier (via communication) Aggregate and distribute next_frontier across processors current_frontier = next_frontier level += 1

This structure ensures load distribution but may incur communication overhead proportional to boundary edges per level. The time complexity of level-synchronous parallel BFS is O(D+(V+E)/P)O(D + (V + E)/P), where DD is the graph diameter, VV is the number of vertices, EE is the number of edges, and PP is the number of processors, reflecting DD synchronization steps plus the parallelizable work on vertices and edges divided by PP. Space complexity is O(V/P+E/P)O(V/P + E/P) per processor, where EE is the number of edges, accounting for local storage of vertices, edges, and frontiers assuming balanced partitioning. These bounds assume balanced partitioning and efficient communication, with total work remaining O(V+E)O(V + E). Early theoretical foundations for level-synchronous BFS trace to PRAM models in the , such as those surveyed by Quinn and Deo, which analyzed parallel graph traversals using concurrent reads and writes to shared memory for level expansion. By the 1990s, implementations transitioned from idealized PRAM simulations to practical shared- and distributed-memory systems, adapting barriers and message passing for real hardware like early MPP machines. This evolution emphasized the model's robustness for irregular graphs despite synchronization costs. The primary advantages of level-synchronous parallel BFS lie in its simplicity, requiring minimal changes from serial BFS beyond and communication, and its , which guarantees consistent level ordering across runs and facilitates or verification. These traits make it a baseline for more advanced variants, particularly on systems with reliable global barriers.

Initial Steps in Parallelizing BFS

The initial parallelization of (BFS) on shared-memory systems focused on distributing the workload across multiple threads while managing shared state to prevent race conditions. A key step was replacing the single global queue used in serial BFS with per-thread queues, allowing each thread to maintain its own local of vertices to process independently and reducing contention on a shared . To ensure correctness, atomic operations were employed for marking vertices as visited; for instance, a thread would atomically check if a vertex is unvisited and, if so, mark it and enqueue its neighbors, thereby avoiding multiple threads processing the same vertex simultaneously. Handling potential duplicates in enqueuing was addressed through the visited array check, which inherently prevents redundant processing of vertices, though if the graph's adjacency lists contained duplicate edges (uncommon in simple graphs), additional detection could filter them before enqueuing to avoid unnecessary work. Early implementations revealed challenges such as load imbalance arising from uneven vertex degree distributions in real-world graphs, where threads assigned high-degree vertices performed significantly more work than others, leading to idle time and suboptimal . A simple multi-threaded asynchronous version without level can be outlined as follows, where each thread operates on its local queue and uses atomic operations for shared visited marks:

initialize visited[v] = false for all v choose a source s; enqueue s into thread 0's queue; visited[s] = true (atomic) parallel for each thread t: while queue_t is not empty: dequeue u from queue_t for each neighbor v of u: if atomic_compare_and_swap(visited[v], false, true): enqueue v into some queue_t' (e.g., round-robin or local)

initialize visited[v] = false for all v choose a source s; enqueue s into thread 0's queue; visited[s] = true (atomic) parallel for each thread t: while queue_t is not empty: dequeue u from queue_t for each neighbor v of u: if atomic_compare_and_swap(visited[v], false, true): enqueue v into some queue_t' (e.g., round-robin or local)

This approach ensures work is distributed dynamically but may still suffer from imbalance without further optimizations. Historical milestones in parallel BFS trace back to the and , when researchers prototyped shared-memory implementations adapting theoretical models like PRAM to practical architectures, including early efforts on connectivity and traversal problems. Seminal works, such as those exploring efficient parallel solutions for graph problems on unbounded models, laid foundational techniques for these prototypes.

Shared-Memory Implementations

Threading Models and Synchronization

In shared-memory systems, parallel (BFS) implementations commonly employ threading models that leverage multi-core architectures to distribute the traversal workload. provides a directive-based approach for , enabling straightforward parallelization of loops over graph vertices or edges while automatically managing thread creation and termination in a fork-join manner. For finer-grained control, particularly in handling irregular graph structures, threads (pthreads) allow explicit thread management, where developers manually create threads to process subsets of the frontier and synchronize their progress to avoid excessive overhead from dynamic scheduling. Synchronization is critical in these models to prevent race conditions during concurrent access to shared data structures, such as the visited array and the queue or frontier representing the current level. Atomic operations, like (CAS), are widely used to update the visited array thread-safely, ensuring that a vertex is marked only once even if multiple threads discover it simultaneously. Locks, such as mutexes in or critical sections in , protect shared queues to serialize enqueues and dequeues, though they can introduce contention; alternatives like per-thread local queues minimize this by deferring merges until level completion. To preserve the topological order inherent to BFS—where vertices are processed level by level in increasing distance from the source—implementations often utilize per-thread frontiers. Each thread maintains a private list of newly discovered vertices during edge exploration, allowing independent work within a level while ensuring that all threads synchronize at barriers (e.g., via OpenMP's implicit barriers or pthread barriers) before advancing to the next level, thus guaranteeing correct layering without cross-level interference. The of these level-synchronous approaches in is typically O(V+EP+D)O\left( \frac{V + E}{P} + D \right), where VV is the number of vertices, EE the number of edges, PP the number of threads, and DD the graph , reflecting parallelizable work per level plus sequential dependencies across levels; synchronization overhead, such as barrier costs, is often O(DlogP)O(D \log P) in practice but subsumed under the diameter term for dense graphs. An example implementation on multi-core CPUs, such as Intel Xeon processors, uses to parallelize the vertex processing loop, achieving near-linear speedups up to 32 cores on benchmarks like synthetic rMat graphs, with atomic updates ensuring correctness amid high contention on power-law distributions.

Work-Efficient Algorithms

In parallel (BFS) implementations for shared-memory systems, work-efficiency refers to achieving a total computational work complexity of O(V+E)O(V + E), where VV is the number of vertices and EE is the number of edges in the graph, matching the serial BFS complexity while enabling parallelism. This ensures no redundant operations beyond visiting each vertex and edge once, contrasting with less efficient approaches that may perform Θ(VE)\Theta(V \cdot E) work due to race conditions or poor . Ideally, such algorithms also attain a parallel runtime of O(V+EP+DlogV)O\left(\frac{V + E}{P} + D \log V\right), where PP is the number of processors and DD is the graph , balancing load across processors with logarithmic overhead for depth. A seminal work-efficient parallel BFS algorithm was introduced by Leiserson and Schardl in 2010, employing a "bags-of-pennants" data structure to manage graph frontiers in a multithreaded setting using Cilk++. Pennants are specialized complete binary trees augmented with an extra root node, enabling constant-time merging of two pennants of equal size into a larger one without rebalancing, as the roots connect directly while preserving balance. A bag represents the frontier as a collection of distinct-sized pennants, akin to a binary number system where each size corresponds to a power-of-two level, allowing insertions and unions in amortized O(1)O(1) time per operation through local merges and occasional splits. This structure replaces the traditional FIFO queue, which suffers from nondeterminism in parallel reductions. The resulting PBFS algorithm exhibits data-race-free execution with time complexity O(V+EP+Dlg3VD)O\left(\frac{V + E}{P} + D \lg^3 \frac{V}{D}\right) on PP processors for graphs with bounded out-degree. Recent refinements extend work-efficient BFS to GPUs, leveraging warp-level for thread execution within warps of 32 threads to reduce and enhance frontier management. For instance, the 2023 framework introduces GPU-tailored for dynamic graphs, enabling work-efficient BFS variants that exploit warp- models to process updates and traversals with minimal overhead, achieving scalable performance on large-scale graphs.

Distributed-Memory Implementations

One-Dimensional Partitioning

In one-dimensional (1D) partitioning for distributed-memory parallel (BFS), vertices are assigned to processors in a linear , such that each processor owns a contiguous block of approximately n/pn/p vertices and their corresponding outgoing edges, where nn is the total number of vertices and pp is the number of processors. Edges connecting vertices across different processors are handled through halo exchanges, where each processor aggregates the edges of non-local vertices and communicates them to the respective owning processors. This scheme builds on level-synchronous parallel BFS by distributing the vertex set across processors while maintaining the core level-by-level traversal. The algorithm proceeds with local BFS operations on each partition, followed by global communication to resolve cross-boundary neighbors. At each level, a processor maintains a local set FF of newly visited vertices it owns, traverses their edges to generate a neighbor set NN, and separates NN into local and remote neighbors. Remote neighbors are sent to their owning processors via an all-to-all communication , often implemented using MPI collectives such as Allgather, which distributes the updated information across all processors before advancing to the next level. This ensures that all vertices at a given level are processed synchronously, with local computations dominating when graph locality is high. Communication costs in 1D partitioning arise primarily from the all-to-all exchanges, generating O(E/p)O(E/p) messages per level, where EE is the total number of edges, with a cumulative data volume of approximately m(p1)/pm(p-1)/p (where m=E/nm = E/n is the average degree). These costs are typically bandwidth-bound and scale with the number of processors, as every processor participates in the exchange regardless of the graph's structure. The approach offers simplicity in implementation and partitioning, making it suitable for linear or one-dimensional graph structures where communication overhead remains low due to limited cross-partition edges. However, it performs poorly on two-dimensional-like sparse matrices or graphs with high-diameter connectivity, as the all-to-all pattern leads to excessive remote communication and load imbalance across processors.

Two-Dimensional Partitioning

In two-dimensional (2D) partitioning for distributed-memory parallel (BFS), the graph's vertices are distributed across a processor grid of dimensions P×P\sqrt{P} \times \sqrt{P}
Add your contribution
Related Hubs
User Avatar
No comments yet.