Recent from talks
Nothing was collected or created yet.
Work stealing
View on WikipediaIn parallel computing, work stealing is a scheduling strategy for multithreaded computer programs. It solves the problem of executing a dynamically multithreaded computation, one that can "spawn" new threads of execution, on a statically multithreaded computer, with a fixed number of processors (or cores). It does so efficiently in terms of execution time, memory usage, and inter-processor communication.
In a work stealing scheduler, each processor in a computer system has a queue of work items (computational tasks, threads) to perform. Each work item consists of a series of instructions, to be executed sequentially, but in the course of its execution, a work item may also spawn new work items that can feasibly be executed in parallel with its other work. These new items are initially put on the queue of the processor executing the work item. When a processor runs out of work, it looks at the queues of the other processors and "steals" their work items. In effect, work stealing distributes the scheduling work over idle processors, and as long as all processors have work to do, no scheduling overhead occurs.[1]
Work stealing contrasts with work sharing, another popular scheduling approach for dynamic multithreading, where each work item is scheduled onto a processor when it is spawned. Compared to this approach, work stealing reduces the amount of process migration between processors, because no such migration occurs when all processors have work to do.[2]
The idea of work stealing goes back to the implementation of the Multilisp programming language and work on parallel functional programming languages in the 1980s.[2] It is employed in the scheduler for the Cilk programming language,[3] the Java fork/join framework,[4] the .NET Task Parallel Library,[5] and the Rust Tokio runtime.[6][7]
Execution model
[edit]Work stealing is designed for a "strict" fork–join model of parallel computation, which means that a computation can be viewed as a directed acyclic graph with a single source (start of computation) and a single sink (end of computation). Each node in this graph represents either a fork or a join. Forks produce multiple logically parallel computations, variously called "threads"[2] or "strands".[8] Edges represent serial computation.[9][note 1]
As an example, consider the following trivial fork–join program in Cilk-like syntax:
function f(a, b):
c ← fork g(a)
d ← h(b)
join
return c + d
function g(a):
return a × 2
function h(a):
b ← fork g(a)
c ← a + 1
join
return b + c
The function call f(1, 2) gives rise to the following computation graph:
In the graph, when two edges leave a node, the computations represented by the edge labels are logically parallel: they may be performed either in parallel, or sequentially. The computation may only proceed past a join node when the computations represented by its incoming edges are complete. The work of a scheduler, now, is to assign the computations (edges) to processors in a way that makes the entire computation run to completion in the correct order (as constrained by the join nodes), preferably as fast as possible.
Algorithm
[edit]The randomized version of the work stealing algorithm presented by Blumofe and Leiserson maintains several threads of execution and schedules these onto processors. Each of the processors has a double-ended queue (deque) of threads. Call the ends of the deque "top" and "bottom".
Each processor that has a current thread to execute, executes the instructions in the thread one by one, until it encounters an instruction that causes one of four "special" behaviors:[2]: 10
- A spawn instruction causes a new thread to be created. The current thread is placed at the bottom of the deque, and the processor starts executing the new thread.
- A stalling instruction is one that temporarily halts execution of its thread. The processor pops a thread off the bottom of its deque and starts executing that thread. If its deque is empty, it starts work stealing, explained below.
- An instruction may cause a thread to die. The behavior in this case is the same as for an instruction that stalls.
- An instruction may enable another thread. The other thread is pushed onto the bottom of the deque, but the processor continues execution of its current thread.
Initially, a computation consists of a single thread and is assigned to some processor, while the other processors start off idle. Any processor that becomes idle starts the actual process of work stealing, which means the following:
- it picks another processor uniformly at random;
- if the other processor's deque is non-empty, it pops the top-most thread off the deque and starts executing that;
- else, repeat.
Child stealing vs. continuation stealing
[edit]Note that, in the rule for spawn, Blumofe and Leiserson suggest that the "parent" thread execute its new thread, as if performing a function call (in the C-like program f(x); g(y);, the function call to f completes before the call to g is performed). This is called "continuation stealing", because the continuation of the function can be stolen while the spawned thread is executed, and is the scheduling algorithm used in Cilk Plus.[8] It is not the only way to implement work stealing; the alternative strategy is called "child stealing" and is easier to implement as a library, without compiler support.[8] Child stealing is used by Threading Building Blocks, Microsoft's Task Parallel Library and OpenMP, although the latter gives the programmer control over which strategy is used.[8]
Efficiency
[edit]Several variants of work stealing have been proposed. The randomized variant due to Blumofe and Leiserson executes a parallel computation in expected time on processors; here, is the work, or the amount of time required to run the computation on a serial computer, and is the span, the amount of time required on an infinitely parallel machine.[note 2] This means that, in expectation, the time required is at most a constant factor times the theoretical minimum.[2] However, the running time (in particular, the number of steals executed) can be exponential in in the worst case.[10] A localized variant, in which a processor attempts to steal back its own work whenever it is free, has also been analyzed theoretically and practically.[11][12]
Space usage
[edit]A computation scheduled by the Blumofe–Leiserson version of work stealing uses stack space, if were the stack usage of the same computation on a single processor,[2] fitting the authors' own earlier definition of space efficiency.[13] This bound requires continuation stealing; in a child stealing scheduler, it does not hold, as can be seen from the following example:[8]
for i = 0 to n:
fork f(i)
join
In a child-stealing implementation, all "forked" calls to f are put in a work queue that thus grows to size n, which can be made arbitrarily large.
Multiprogramming variant
[edit]The work stealing algorithm as outlined earlier, and its analysis, assume a computing environment where a computation is scheduled onto a set of dedicated processors. In a multiprogramming (multi-tasking) environment, the algorithm must be modified to instead schedule computation tasks onto a pool of worker threads, which in turn are scheduled onto the actual processors by an operating system scheduler. At any given time, the OS scheduler will assign to the work stealing process some number PA ≤ P of the P processors in the computer, because other processes may be using the remaining processors. In this setting, work stealing with a pool of P worker threads has the problem that workers acting as thieves may cause livelock: they may block the execution of workers that would actually spawn useful tasks.[14][15]
A variant of work stealing has been devised for this situation, which executes a computation in expected time
where Pavg is the average number of processors allocated to the computation by the OS scheduler over the computation's running time.[16] The multiprogramming work-scheduler differs from the traditional version in two respects:
- Its queues are non-blocking. While on dedicated processors, access to the queues can be synchronized using locks, this is not advisable in a multiprogramming environment since the operating system might preempt the worker thread holding the lock, blocking the progress of any other workers that try to access the same queue.
- Before each attempt to steal work, a worker thread calls a "yield" system call that yields the processor on which it is scheduled to the OS, in order to prevent starvation.
Attempts to improve on the multiprogramming work stealer have focused on cache locality issues[12] and improved queue data structures.[17]
Alternatives
[edit]Several scheduling algorithms for dynamically multithreaded computations compete with work stealing. Besides the traditional work sharing approach, there is a scheduler called parallel depth-first (PDF) that improves on the space bounds of work stealing,[18] as well giving better performance in some situations where the cores of a chip multiprocessor share a cache.[1]
Notes
[edit]- ^ In the original presentation, serial computations were represented as nodes as well, and a directed edge represented the relation "is followed by".
- ^ See analysis of parallel algorithms for definitions.
References
[edit]- ^ a b Chen, Shimin; Gibbons, Phillip B.; Kozuch, Michael; Liaskovitis, Vasileios; Ailamaki, Anastassia; Blelloch, Guy E.; Falsafi, Babak; Fix, Limor; Hardavellas, Nikos; Mowry, Todd C.; Wilkerson, Chris (2007). Scheduling threads for constructive cache sharing on CMPs (PDF). Proc. ACM Symp. on Parallel Algorithms and Architectures. pp. 105–115.
- ^ a b c d e f Blumofe, Robert D.; Leiserson, Charles E. (1999). "Scheduling multithreaded computations by work stealing". J ACM. 46 (5): 720–748. doi:10.1145/324133.324234. S2CID 5428476.
- ^ Blumofe, Robert D.; Joerg, Christopher F.; Kuszmaul, Bradley C.; Leiserson, Charles E.; Randall, Keith H.; Zhou, Yuli (1996). "Cilk: An efficient multithreaded runtime system". Journal of Parallel and Distributed Computing. 37 (1): 55–69. doi:10.1006/jpdc.1996.0107. hdl:1721.1/149259.
- ^ Doug Lea (2000). A Java fork/join framework (PDF). ACM Conf. on Java.
- ^ Leijen, Daan; Schulte, Wolfram; Burckhardt, Sebastian (2009). "The Design of a Task Parallel Library". ACM SIGPLAN Notices. 44 (10): 227. CiteSeerX 10.1.1.146.4197. doi:10.1145/1639949.1640106.
- ^ "What is Tokio? · Tokio". tokio.rs. Retrieved 2020-05-27.
- ^ Krill, Paul (2021-01-08). "Tokio Rust runtime reaches 1.0 status". InfoWorld. Retrieved 2021-12-26.
- ^ a b c d e Robison, Arch (15 January 2014). A Primer on Scheduling Fork–Join Parallelism with Work Stealing (PDF) (Technical report). ISO/IEC JTC 1/SC 22/WG 21—The C++ Standards Committee. N3872.
- ^ Halpern, Pablo (24 September 2012). Strict Fork–Join Parallelism (PDF) (Technical report). ISO/IEC JTC 1/SC 22/WG 21—The C++ Standards Committee. N3409=12-0099.
- ^ Leiserson, Charles E.; Schardl, Tao B.; Suksompong, Warut (2016). "Upper Bounds on Number of Steals in Rooted Trees". Theory of Computing Systems. 58 (2): 223–240. arXiv:1706.08219. doi:10.1007/s00224-015-9613-9. S2CID 424692.
- ^ Suksompong, Warut; Leiserson, Charles E.; Schardl, Tao B. (2016). "On the efficiency of localized work stealing". Information Processing Letters. 116 (2): 100–106. arXiv:1804.04773. doi:10.1016/j.ipl.2015.10.002. S2CID 1180480.
- ^ a b Acar, Umut A.; Blelloch, Guy E.; Blumofe, Robert D. (2002). "The Data Locality of Work Stealing" (PDF). Theory of Computing Systems. 35 (3): 321–347. CiteSeerX 10.1.1.19.3459. doi:10.1007/s00224-002-1057-3. S2CID 10235838.
- ^ Blumofe, Robert D.; Leiserson, Charles E. (1998). "Space-efficient scheduling of multithreaded computations". SIAM J. Comput. 27 (1): 202–229. CiteSeerX 10.1.1.48.9822. doi:10.1137/s0097539793259471.
- ^ Ding, Xiaoning; Wang, Kaibo; Gibbons, Phillip B.; Zhang, Xiaodong (2012). BWS: Balanced Work Stealing for Time-Sharing Multicores (PDF). EuroSys.
- ^ Blumofe, Robert D.; Papadopoulos, Dionisios (1998). The Performance of Work Stealing in Multiprogrammed Environments (Technical report). University of Texas at Austin, Department of Computer Sciences. CiteSeerX 10.1.1.48.2247.
- ^ Arora, Nimar S.; Blumofe, Robert D.; Plaxton, C. Greg (2001). "Thread scheduling for multiprogrammed multiprocessors" (PDF). Theory of Computing Systems. 34 (2): 115–144. doi:10.1007/s002240011004 (inactive 24 January 2026).
{{cite journal}}: CS1 maint: DOI inactive as of January 2026 (link) - ^ Chase, David R.; Lev, Yosef (2005). Dynamic Circular Work-Stealing Deque. ACM Symp. on Parallelism in Algorithms and Architectures. CiteSeerX 10.1.1.170.1097.
- ^ Blelloch, Guy E.; Gibbons, Phillip B.; Matias, Yossi (1999). "Provably efficient scheduling for languages with fine-grained parallelism" (PDF). Journal of the ACM. 46 (2): 281–321. CiteSeerX 10.1.1.48.8238. doi:10.1145/301970.301974. S2CID 47102937.
Work stealing
View on GrokipediaIntroduction
Definition and Motivation
Work stealing is a decentralized scheduling strategy employed in parallel computing to achieve load balancing in multithreaded programs, wherein idle processors proactively "steal" tasks from the local double-ended queues (deques) of busy processors. This approach contrasts with centralized scheduling mechanisms by distributing the responsibility for load redistribution across processors, enabling adaptive handling of computational workloads without requiring prior knowledge of task granularities or execution times. The primary motivation for work stealing arises in environments with dynamic parallelism, where task sizes and creation patterns vary unpredictably, leading to potential load imbalances that static scheduling techniques cannot effectively mitigate. By allowing runtime adaptation, work stealing minimizes processor idle time and enhances overall throughput, particularly in applications exhibiting irregular parallelism such as recursive algorithms or data-driven computations. This dynamic redistribution proves essential for maintaining efficiency on multiprocessor systems, where uneven work distribution could otherwise result in significant performance degradation. Key benefits of work stealing include its inherent fault tolerance, achieved through task locality that preserves data access patterns and reduces recovery overhead in the event of processor failures; scalability to large numbers of processors due to its decentralized nature; and low synchronization overhead facilitated by lock-free deque operations. For instance, in divide-and-conquer algorithms like parallel mergesort, uneven subtree sizes can cause some processors to finish early while others remain burdened, but work stealing allows idle processors to extract subtasks from the deques of busy ones, thereby balancing the load and ensuring efficient completion.Historical Development
Work stealing emerged in the 1990s as a load-balancing strategy for parallel and multithreaded computing systems, addressing the challenges of dynamic workload distribution on multiprocessors. It was first formalized by Robert D. Blumofe and Charles E. Leiserson in their 1993 work on the Cilk runtime system at MIT, detailed in the paper "Space-Efficient Scheduling of Multithreaded Computations." This introduced a randomized work-stealing scheduler that ensures efficient execution of fully strict multithreaded computations by allowing idle processors to "steal" tasks from busy ones, while bounding space usage to linear in the computation's size. The approach was designed to minimize scheduling overhead and achieve near-optimal parallelism in systems like Cilk, which targeted fine-grained, dynamic parallelism for applications such as scientific computing and AI.[7] Key theoretical advancements followed in 1999 with Blumofe and Leiserson's paper "Scheduling Multithreaded Computations by Work Stealing," published in the Journal of the ACM, which provided a rigorous analysis proving the algorithm's efficiency: it spans the computation in expected time proportional to the critical path length plus logarithmic factors, with total work linear in the computation size. This work solidified work stealing's guarantees for fully strict programs, influencing subsequent scheduler designs. Practical adoption accelerated in the 2000s; Doug Lea's Java Fork/Join framework, proposed in 2000 and integrated into Java 7 in 2011 via JSR-166, adapted work stealing for recursive task decomposition, enabling scalable parallelism in managed environments. Similarly, Intel's Threading Building Blocks (TBB) library, first released in 2007, incorporated work stealing in its task scheduler to support portable, high-performance parallelism in C++ applications.[1][8] The technique evolved with extensions for robustness in shared-memory systems, notably the non-blocking work-stealing algorithm by Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton in 1998, which eliminated locks using compare-and-swap operations to support multiprogrammed multiprocessors and reduce contention. More recent innovations include formally verified implementations, such as the block-based work stealing (BWoS) design by Jiawei Wang et al. in 2023 at the USENIX Symposium on Operating Systems Design and Implementation (OSDI), which partitions per-core queues into blocks for improved scalability and linearizability guarantees under high contention. Influential systems beyond Cilk include modern runtimes like the Go programming language's scheduler, which employs work stealing since Go 1.1 (2013), with significant improvements in Go 1.5, to balance goroutines across OS threads, and OpenCilk, an open-source extension of Cilk integrated with the LLVM compiler infrastructure since 2019 for optimized parallel code generation.[9]Core Concepts
Execution Model
In the work stealing execution model, a parallel computation is executed by multiple processors, each maintaining its own double-ended queue (deque) to hold ready tasks, typically in the form of threads or subtasks.[10] When a processor spawns new tasks during execution, it pushes them onto the bottom of its local deque, following a last-in, first-out (LIFO) order for its own task consumption to prioritize recently created work.[11] Conversely, when an idle processor attempts to steal work from another, it pops tasks from the top of the victim's deque, enforcing a first-in, first-out (FIFO) order to balance the load by taking older, potentially larger subtasks.[10] This asymmetric access pattern—LIFO for owners and FIFO for thieves—facilitates efficient locality for the task creator while promoting parallelism through theft.[11] Processors operate in one of two primary states: busy or idle. A busy processor repeatedly pops and executes tasks from the bottom (head) of its own deque until it becomes empty, at which point it transitions to idle.[10] An idle processor randomly selects a victim processor and attempts to steal a task from the top (tail) of that victim's deque; if successful, it executes the stolen task and returns to the busy state, but if unsuccessful, it selects another victim at random and tries again, repeating the process until it finds a task to execute or the computation is complete.[11] This random selection of victims helps distribute load evenly across processors without centralized coordination.[10] Synchronization in the execution model relies on lock-free atomic operations, such as compare-and-swap (CAS), to manage concurrent access to deques, avoiding the overhead and potential deadlocks associated with traditional locks.[11] For instance, push and pop operations on a deque are implemented atomically to ensure that only one processor can modify the structure at a time, with failed attempts resolved through retries.[10] The model is particularly suited to computations structured as recursive divide-and-conquer paradigms, where tasks spawn child subtasks upon encountering parallelism, adding them to the owner's deque bottom for immediate LIFO execution.[11] This dynamic task creation ensures that thieves primarily acquire ready, independent subtasks from the deque tail, maintaining progress without violating dependencies in the computation graph.[10] Such interactions enable scalable parallelism in shared-memory multiprocessors, with each processor acting autonomously in the runtime environment.[11]Task Representation and Deques
In work stealing, tasks are represented as closures or continuations that encapsulate the remaining computation along with any dependencies, enabling dynamic parallelism. A closure typically includes a function pointer, arguments, and a frame for local state, while a continuation captures the point in the program to resume after task execution. This structure supports operations such as spawn, which creates a child task for concurrent execution without blocking the parent, and sync, which suspends the current task until all spawned children complete, ensuring dependency resolution before proceeding.[1] The foundational data structure for managing tasks is a double-ended queue (deque) assigned to each processor, allowing efficient local access and remote stealing. Deques are typically implemented as array-based structures with atomic pointers to the top and bottom indices, ensuring thread-safety in concurrent environments without locks. These pointers are updated using compare-and-swap (CAS) operations to handle races between the owning processor and potential thieves.[12] The owner processor performs operations exclusively on the bottom of the deque: it pushes newly spawned tasks to the bottom for sequential execution and pops from the bottom to execute the next local task. In contrast, a thief (an idle processor) attempts to steal a task from the top of another processor's deque, succeeding only if the deque is non-empty, again using CAS to atomically update the top pointer and claim the task. This asymmetric access—bottom for owner, top for thieves—maintains the LIFO order for the owner while providing FIFO-like stealing for load balancing.[1][12] This design offers key advantages, as bottom-only access by the owner prevents thieves from interfering with the currently active task, preserving the illusion of sequential execution. Furthermore, the array-based deque with CAS ensures amortized O(1) time complexity for push, pop, and steal operations, supporting scalable parallelism across multiprocessors.[1]Algorithm
Standard Work-Stealing Procedure
In the standard work-stealing procedure, each processor maintains its own double-ended queue (deque) of ready tasks and operates in a loop to execute tasks until global termination. The core steps involve: (1) if the local deque is non-empty, the processor pops a task from the bottom of its deque and executes it; (2) if the local deque is empty, the processor selects a random other processor and attempts to steal a task from the top of that processor's deque; (3) if the steal succeeds, the processor executes the stolen task; and (4) the processor repeats this process until all tasks are completed.[2] The procedure incorporates operations for spawning child tasks and synchronizing with them. When a processor spawns a child task during execution, it pushes the child task onto the bottom of its local deque. Synchronization (sync) at a join point waits for all child tasks spawned since the last sync to complete, ensuring dependencies are resolved before proceeding. This is typically implemented using a join counter that tracks unfinished children; the worker continues executing tasks until the counter reaches zero. Implementations vary: stalling schedulers block the thread at sync until children complete, while greedy schedulers have the worker continue stealing tasks until the join condition is met, improving load balance.[13] The following pseudocode outlines the detailed loop for a worker thread in the standard procedure, including spawn and sync mechanisms (simplified; actual implementations handle task frames and counters):while (true) {
if (local_deque not empty) {
task = pop_bottom(local_deque); // Atomic pop from bottom
execute(task);
} else {
victim = random_other_processor();
task = try_steal_top(victim_deque); // Atomic steal from top
if (task != null) {
execute(task);
} else {
// Idle; check for termination
if (all_deques_empty() && no_active_tasks()) break;
}
}
}
spawn(child_task) {
push_bottom(local_deque, child_task); // Push to bottom
}
sync() {
// For greedy schedulers: continue the main loop, processing tasks until the task's join counter reaches zero
// (all children completed) or termination
// For stalling: block until children complete
}
while (true) {
if (local_deque not empty) {
task = pop_bottom(local_deque); // Atomic pop from bottom
execute(task);
} else {
victim = random_other_processor();
task = try_steal_top(victim_deque); // Atomic steal from top
if (task != null) {
execute(task);
} else {
// Idle; check for termination
if (all_deques_empty() && no_active_tasks()) break;
}
}
}
spawn(child_task) {
push_bottom(local_deque, child_task); // Push to bottom
}
sync() {
// For greedy schedulers: continue the main loop, processing tasks until the task's join counter reaches zero
// (all children completed) or termination
// For stalling: block until children complete
}
