Hubbry Logo
search
logo

Parallel breadth-first search

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
Parallel breadth-first search

The breadth-first-search algorithm is a way to explore the vertices of a graph layer by layer. It is a basic algorithm in graph theory which can be used as a part of other graph algorithms. For instance, BFS is used by Dinic's algorithm to find maximum flow in a graph. Moreover, BFS is also one of the kernel algorithms in Graph500 benchmark, which is a benchmark for data-intensive supercomputing problems. This article discusses the possibility of speeding up BFS through the use of parallel computing.

In the conventional sequential BFS algorithm, two data structures are created to store the frontier and the next frontier. The frontier contains all vertices that have the same distance (also called "level") from the source vertex, these vertices need to be explored in BFS. Every neighbor of these vertices will be checked, some of these neighbors which are not explored yet will be discovered and put into the next frontier. At the beginning of the BFS algorithm, a given source vertex s is the only vertex in the frontier. All direct neighbors of s are visited in the first step, which form the next frontier. After each layer-traversal, the "next frontier" is switched to the frontier and new vertices will be stored in the new next frontier. The following pseudo-code outlines the idea of it, in which the data structures for the frontier and next frontier are called FS and NS respectively.

As a simple and intuitive solution, the classic Parallel Random Access Machine (PRAM) approach is just an extension of the sequential algorithm that is shown above. The two for-loops (line 7 and line 8) can be executed in parallel. The update of the next frontier (line 10) and the increase of distance (line 11) need to be atomic. Atomic operations are program operations that can only run entirely without interruption and pause (i.e. "all or nothing").

However, there are two problems in this simple parallelization. Firstly, the distance-checking (line 9) and distance-updating operations (line 11) introduce two benign races. The reason of race is that a neighbor of one vertex can also be the neighbor of another vertex in the frontier. As a result, the distance of this neighbor may be examined and updated more than one time. Although these races waste resource and lead to unnecessary overhead, with the help of synchronization, they don't influence the correctness of BFS, so these races are benign. Secondly, in spite of the speedup of each layer-traversal due to parallel processing, a barrier synchronization is needed after every layer in order to completely discover all neighbor vertices in the frontier. This layer-by-layer synchronization indicates that the steps of needed communication equals the longest distance between two vertices, O(d), where O is the big O notation and d is the graph diameter.

This simple parallelization's asymptotic complexity is the same as that of the sequential algorithm in the worst case. Better BFS parallelization can be achieved with optimizations, such as:

Compared to parallel BFS with distributed memory, shared memory provides higher memory-bandwidth and lower latency due to direct access by all processors. Therefore, the overhead of message passing, which is necessary for distributed memory access, is avoided.

However, the number of vertices in each layer and the number of neighbors of each vertex are shown to be highly irregular, which leads to highly irregular memory accesses and work distribution of BFS. In parallel BFS, this feature reduces the benefits from parallelization due to unbalanced load. As a result, it is very important to make the parallel BFS with shared memory load-balanced. Moreover, exploring data-locality can also speed up parallel execution.

Parallel BFS algorithms with shared memory can be divided into two types: container centric approaches and vertex centric approaches. In the container centric approach, two data structures are created to store the current frontier and the next vertex frontier. The next vertex frontier is switched to the current frontier at the end of each iteration. There is a trade-off between the cost of synchronization and data locality according to the location of data. These two data structures can be held in every processing unit (such as a thread) which supports data locality but needs extra load balancing mechanisms. Alternatively, they can be stored globally to provide implicit load balancing, where special data structures are used for concurrent access, which requires additional effort for synchronization.

See all
User Avatar
No comments yet.