Recent from talks
Nothing was collected or created yet.
Task parallelism
View on WikipediaThis article includes a list of general references, but it lacks sufficient corresponding inline citations. (May 2011) |
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors. In contrast to data parallelism which involves running the same task on different components of data, task parallelism is distinguished by running many different tasks at the same time on the same data.[1] A common type of task parallelism is pipelining, which consists of moving a single set of data through a series of separate tasks where each task can execute independently of the others.
Description
[edit]In a multiprocessor system, task parallelism is achieved when each processor executes a different thread (or process) on the same or different data. The threads may execute the same or different code. In the general case, different execution threads communicate with one another as they work, but this is not a requirement. Communication usually takes place by passing data from one thread to the next as part of a workflow.[2]
As a simple example, if a system is running code on a 2-processor system (CPUs "a" & "b") in a parallel environment and we wish to do tasks "A" and "B", it is possible to tell CPU "a" to do task "A" and CPU "b" to do task "B" simultaneously, thereby reducing the run time of the execution. The tasks can be assigned using conditional statements as described below.
Task parallelism emphasizes the distributed (parallelized) nature of the processing (i.e. threads), as opposed to the data (data parallelism). Most real programs fall somewhere on a continuum between task parallelism and data parallelism.[3]
Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. This type of parallelism is found largely in applications written for commercial servers such as databases. By running many threads at once, these applications are able to tolerate the high amounts of I/O and memory system latency their workloads can incur - while one thread is delayed waiting for a memory or disk access, other threads can do useful work.
The exploitation of thread-level parallelism has also begun to make inroads into the desktop market with the advent of multi-core microprocessors. This has occurred because, for various reasons, it has become increasingly impractical to increase either the clock speed or instructions per clock of a single core. If this trend continues, new applications will have to be designed to utilize multiple threads in order to benefit from the increase in potential computing power. This contrasts with previous microprocessor innovations in which existing code was automatically sped up by running it on a newer/faster computer.
Example
[edit]The pseudocode below illustrates task parallelism:
program:
...
if CPU = "a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.
- In an SPMD (single program, multiple data) system, both CPUs will execute the code.
- In a parallel environment, both will have access to the same data.
- The "if" clause differentiates between the CPUs. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task.
- Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program: ... do task "A" ... end program
Code executed by CPU "b":
program: ... do task "B" ... end program
This concept can now be generalized to any number of processors.
Language support
[edit]Task parallelism can be supported in general-purpose languages by either built-in facilities or libraries. Notable examples include:
- Ada: Tasks (built-in)
- C++ (Intel): Threading Building Blocks
- C++ (Intel): Cilk Plus
- C++ (Open Source/Apache 2.0): RaftLib
- C, C++, Objective-C, Swift (Apple): Grand Central Dispatch
- D: tasks and fibers
- Delphi (System.Threading.TParallel)
- Go: goroutines
- Java: Java concurrency
- .NET: Task Parallel Library
Examples of fine-grained task-parallel languages can be found in the realm of Hardware Description Languages like Verilog and VHDL.
See also
[edit]References
[edit]- ^ Reinders, James (10 September 2007). "Understanding task and data parallelism". ZDNet. Retrieved 8 May 2017.
- ^ Quinn, Michael J. (2007). Parallel programming in C with MPI and openMP (Tata McGraw-Hill ed.). New Delhi: Tata McGraw-Hill Pub. ISBN 978-0070582019.
- ^ Hicks, Michael. "Concurrency Basics" (PDF). University of Maryland: Department of Computer Science. Retrieved 8 May 2017.
Task parallelism
View on GrokipediaCore Concepts
Definition
Task parallelism is a form of parallel computing in which a computational problem is divided into multiple independent tasks—discrete units of work—that execute concurrently across different processors or cores to enhance throughput and reduce execution time compared to sequential processing.[8][9] This approach, also known as functional decomposition, emphasizes the concurrent performance of distinct operations rather than uniform processing of data elements.[8] The key principles of task parallelism revolve around task independence, enabling minimal inter-task communication and synchronization to maximize efficiency; the potential for dynamic task creation and assignment during runtime to adapt to workload variations; and a focus on coarse-grained work division, where tasks encompass larger, self-contained computations suitable for distribution across heterogeneous resources.[9] These principles distinguish task parallelism within the broader context of parallel computing, which involves the simultaneous use of multiple compute resources to solve problems that would otherwise require sequential execution on a single processor.[10] Task parallelism first appeared in early multiprocessing systems of the 1960s and 1970s, where concurrent task handling became essential for leveraging multiple processors.[11] The basic workflow of task parallelism begins with the identification and decomposition of a program into independent tasks, followed by their concurrent execution on available processing units, and concludes with result aggregation if dependencies exist.[8] This process improves resource utilization by allowing tasks to proceed asynchronously, with overall performance limited primarily by the longest-running task and any inherent serial components.[9]Historical Development
The concept of task parallelism emerged in the 1960s and 1970s amid early efforts to harness multiprocessing for high-performance computing. By the mid-1970s, foundational ideas for dynamic task execution were advanced through dataflow architectures, where computations are broken into independent tasks activated by data availability rather than rigid control flow. Jack B. Dennis and David P. Misunas proposed a preliminary architecture for a basic data-flow processor in 1975, enabling asynchronous task firing and influencing subsequent models of irregular parallelism.[12] In the 1980s, task parallelism gained practical expression in programming languages designed for concurrent systems. The Ada programming language, standardized as Ada 83 in 1983 under U.S. Department of Defense sponsorship, introduced native tasking facilities—including task types, rendezvous for synchronization, and select statements for conditional execution—to support real-time and embedded applications with reliable concurrency.[13] This marked a shift toward structured, language-level support for dynamic task creation and interaction, building on earlier multiprocessing concepts from the 1970s.[14] The 2000s accelerated adoption due to hardware trends, particularly the transition to multicore processors. Intel's announcement in 2005 of a pivot from single-core frequency scaling to multicore designs, exemplified by the release of dual-core Pentium processors, underscored the need for software paradigms like task parallelism to exploit on-chip concurrency effectively.[11] A pivotal milestone came with OpenMP 3.0 in May 2008, which added task constructs to the API, enabling programmers to define and schedule independent tasks dynamically for irregular workloads, evolving from earlier static loop-based parallelism.[15] Influential contributions in distributed contexts further shaped the field, with Ian Foster's work in the 1990s and 2000s on grid computing promoting task-based models for large-scale, heterogeneous environments, as detailed in his 1995 book Designing and Building Parallel Programs.[16] Post-2010 developments, driven by cloud computing and accelerators, extended dynamic tasking to scalable frameworks, transitioning from static scheduling in early multiprocessing to adaptive, runtime-managed task graphs in modern systems.[17]Implementation Mechanisms
Task Models and Abstractions
In task-parallel systems, computations are often modeled using a directed acyclic graph (DAG), where nodes represent individual tasks and directed edges indicate dependencies between them, ensuring that dependent tasks execute only after their predecessors complete. This model captures the structure of parallel workloads by avoiding cycles that could lead to deadlocks or undefined behavior, allowing runtimes to identify opportunities for concurrent execution. The DAG approach has become a foundational abstraction for expressing irregular parallelism in applications with varying dependency patterns.[18] A key abstraction for handling asynchronous results in these models is the use of futures and promises. A future acts as a placeholder for a value that will be computed asynchronously by a task, enabling the main program to continue without blocking until the result is needed, while a promise serves as the mechanism for the completing task to deliver that value. Futures were introduced in the context of concurrent symbolic computation to support lazy evaluation in parallel Lisp environments, promoting fine-grained parallelism without explicit synchronization. Promises, as precursors to modern asynchronous constructs, were proposed to represent eventual results in applicative programming paradigms, decoupling computation from result access.[19][20] Tasks in these systems are typically treated as first-class objects, meaning they can be dynamically created, manipulated, and managed like any other data entity, with well-defined states such as created, submitted, running, and completed. This abstraction allows programmers to submit tasks to a runtime scheduler and query their status, facilitating composable parallelism where tasks can depend on or spawn others. Dependency graphs, frequently implemented as DAGs, further refine this by explicitly encoding relationships, such as data or control dependencies, to guide execution order while maximizing concurrency.[21] Important distinctions in task models include implicit versus explicit parallelism. In implicit models, the runtime or compiler automatically detects and exploits parallel opportunities from high-level specifications, reducing programmer burden but potentially limiting control over irregular workloads; explicit models require developers to annotate or define parallel regions and dependencies directly, offering precision at the cost of added complexity. Additionally, tasks are often designed as lightweight entities, similar to threads that share process resources like memory and file descriptors for low overhead, in contrast to heavyweight processes that maintain isolated address spaces and incur higher creation and context-switching costs.[22][23] To illustrate basic task creation and management, consider the following pseudocode, which captures the essence of submitting a task and awaiting its result in a generic task-parallel runtime:task T = create_task(compute_function, arguments);
submit(T, scheduler);
result = wait_for_completion(T);
task T = create_task(compute_function, arguments);
submit(T, scheduler);
result = wait_for_completion(T);
Scheduling and Synchronization
In task parallelism, scheduling involves assigning tasks to available processing resources to optimize execution time and resource utilization. Static scheduling pre-allocates tasks to processors based on a known task graph prior to runtime, assuming predictable execution times and no variations in load, which minimizes runtime overhead but can lead to imbalances if assumptions fail.[25] Dynamic scheduling, in contrast, adapts task assignments at runtime to current system conditions, such as varying task durations or processor loads, enabling better responsiveness in irregular workloads.[25] A prominent dynamic strategy is work-stealing, where idle processors "steal" tasks from busy processors' queues to balance load, as implemented in systems like Cilk; this approach ensures low contention and bounded overhead, with theoretical guarantees of O(log n) stealing attempts per task in a system with n tasks. Priority-based task queues extend these strategies by ordering tasks according to assigned priorities, often using heaps or multi-level queues, to ensure high-priority tasks (e.g., those on critical paths) execute first, reducing overall completion time in dependency-heavy graphs.[26] Task models like directed acyclic graphs (DAGs) inform these schedulers by representing dependencies, allowing runtime systems to select only ready (unblocked) tasks for assignment. Synchronization in task parallelism coordinates task execution to respect dependencies and protect shared resources, using primitives tailored to minimize blocking. Barriers synchronize groups of tasks by requiring all to reach a common point before any proceeds, ensuring collective progress in phases like iterative algorithms.[10] Mutexes provide mutual exclusion for shared data access, preventing race conditions during critical sections, though they can introduce serialization if overused in fine-grained tasks.[10] Atomic operations offer lightweight synchronization for signaling task completion, such as incrementing counters or setting flags without full locks, enabling efficient dependency resolution via mechanisms like reference counting on futures or promises.[10] Runtime considerations for scheduling and synchronization emphasize load balancing across cores, achieved through decentralized mechanisms like work-stealing deques per processor, which distribute tasks without central bottlenecks and adapt to heterogeneity. Handling task dependencies at runtime involves maintaining a ready-task pool derived from the DAG, where incoming edges are decremented upon predecessor completion (often atomically), releasing successors when counts reach zero; this ensures tasks execute only after prerequisites, with schedulers prioritizing ready tasks to minimize idle time.[27] Challenges in these mechanisms include overhead from context switching, where frequent task migrations between cores incur costs for saving/restoring states, registers, and cache lines, potentially dominating execution for fine-grained tasks and reducing effective parallelism.[28] To quantify impact, consider the speedup equation, which measures parallel efficiency. Let be the execution time of the sequential version, encompassing all computation without parallelism. In the parallel case, includes the parallelized computation time divided by the number of processors , plus synchronization and communication times , and scheduling overhead (e.g., from stealing or priority queue operations). Thus, where is the total computational work (approximating if fully parallelizable). Speedup is then Deriving from first principles, as increases ideally, if overheads are negligible, but grows with task granularity (e.g., O(1) per steal but multiplied by steals), capping below ; for instance, if is constant, asymptotes to . This formulation highlights how scheduling costs directly limit scalability in task-parallel systems.[10]Language and Framework Support
Support in C++ and Related Standards
The standardization of task parallelism in C++ emerged in response to the widespread adoption of multicore processors starting around 2005, which necessitated higher-level abstractions for scalable concurrent programming beyond low-level threads.[29] The C++ standards committee, through Working Group 21 (WG21), introduced foundational concurrency features in C++11 to enable asynchronous task execution, addressing the limitations of explicit thread management for irregular workloads on multicore systems.[30] In C++11, the<future> header provides core primitives for task-based parallelism, including std::async, std::future, and std::packaged_task. std::async launches a callable object asynchronously, potentially on a new thread, and returns a std::future object that allows the caller to retrieve the result or handle exceptions once the task completes.[31] This mechanism supports deferred or concurrent execution policies, facilitating task parallelism without direct thread creation. std::future represents the shared state of an asynchronous operation, offering methods like wait() and get() to synchronize and access results, while std::packaged_task wraps a callable into a task that stores its outcome in a shared state accessible via a future.[32] These features were refined in C++17 with improvements to exception propagation and in C++20 with coroutines enhancing asynchronous task composition, though the core task model remains centered on futures for multicore scalability.[33]
OpenMP, a directive-based API for shared-memory parallelism, integrated task parallelism starting with version 3.0 in 2008 to handle dynamic, irregular workloads on multicore architectures.[34] The #pragma omp task directive generates a task from a structured block, allowing deferred execution by the runtime scheduler, while #pragma omp taskwait suspends the current task until all its child tasks complete, enabling dependency management without explicit synchronization.[35] This model, which builds on work-sharing constructs from earlier versions, promotes scalability by distributing tasks across threads dynamically, and it has been extended in subsequent standards like OpenMP 4.0 and 5.0 for better dependency graphs and device offloading, and in OpenMP 6.0 (released November 2024) with improved tasking support and features for easier parallel programming.[36][37]
Related to C++ standards, Intel's oneAPI Threading Building Blocks (oneTBB), formerly Threading Building Blocks, offers a library-based approach to task parallelism that complements standard features.[38] oneTBB provides task_group for enclosing parallel tasks executed by a work-stealing scheduler, ensuring load balancing on multicore systems, and flow graphs—a node-based model for composing task dependencies as directed acyclic graphs, suitable for pipeline and irregular parallelism.[39] Originally developed in 2007 to address post-multicore programming challenges, oneTBB has evolved into an open-source standard under the oneAPI initiative, integrating seamlessly with C++11+ concurrency primitives.[40]
Support in Java and Other Languages
Java provides robust support for task parallelism through its java.util.concurrent package, introduced in Java 5, which includes the ExecutorService interface for managing thread pools and executing asynchronous tasks without directly handling threads. ExecutorService allows developers to submit tasks via methods like submit() or invokeAll(), enabling efficient distribution of work across a pool of worker threads, which helps in achieving parallelism for independent units of computation. This framework abstracts low-level thread management, promoting scalable task execution in multi-core environments.[41] Building on this, Java 7 introduced the ForkJoinPool, a specialized ExecutorService implementation that employs a work-stealing scheduler to balance workloads dynamically among threads, particularly suited for recursive divide-and-conquer algorithms like parallel quicksort. In this model, idle threads "steal" tasks from busy threads' queues, minimizing synchronization overhead and maximizing CPU utilization for fine-grained tasks. Java 8 further enhanced task composition with CompletableFuture, a class that represents a pending completion stage and supports chaining operations (e.g., thenApply() for transformations or allOf() for combining multiple futures), facilitating non-blocking, asynchronous workflows that compose parallel tasks declaratively.[42][43] Java 21 (released in 2023) advanced task parallelism significantly with virtual threads under Project Loom, which are lightweight, JVM-managed threads that map to carrier (platform) threads, allowing millions of them to run concurrently with minimal overhead compared to traditional platform threads. Virtual threads enable scalable task execution for I/O-bound applications by reducing context-switching costs, while structured concurrency constructs like StructuredTaskScope ensure safe grouping and cancellation of related tasks. However, in managed runtime environments like the JVM, garbage collection pauses—such as those in the parallel collector—can introduce stop-the-world interruptions, potentially degrading latency-sensitive task parallelism by halting all threads during heap reclamation.[44][45] Beyond Java, other languages offer distinct paradigms for task parallelism. In Go, goroutines provide lightweight concurrency primitives, launched with the go keyword, that enable thousands of tasks to run multiplexed on a smaller number of OS threads managed by the Go runtime; channels facilitate safe communication and synchronization between goroutines, supporting patterns like fan-out/fan-in for parallel data processing. Python's concurrent.futures module, available since Python 3.2, abstracts task execution through ThreadPoolExecutor for I/O-bound parallelism and ProcessPoolExecutor for CPU-bound tasks, allowing submission of callables via submit() or map(), with futures for result retrieval, though the Global Interpreter Lock limits true thread parallelism.[46][47] Rust emphasizes safe, high-performance task parallelism via its async/await syntax (stable since Rust 1.39 in 2019), where asynchronous functions spawn non-blocking tasks; the Tokio runtime, a popular async executor, schedules these tasks across a multi-threaded worker pool using work-stealing, enabling efficient parallelism for both I/O and compute-intensive workloads while leveraging Rust's ownership model to prevent data races. Cross-language trends highlight the actor model, notably in Erlang, where lightweight processes act as isolated actors that communicate solely via asynchronous message passing, inherently supporting distributed task parallelism across nodes with fault tolerance through process supervision.[48]Applications and Examples
Practical Examples
One practical application of task parallelism is in parallel image processing, where different operations such as edge detection on one section and region growing on another are assigned to separate tasks executing concurrently on available processors. This approach allows functionally diverse processing—such as applying distinct algorithms to image regions—to proceed with coordination for dependencies, enabling efficient utilization of multicore systems.[49] The following pseudocode illustrates the task decomposition and submission for this image processing example, where the image is partitioned into tasks that can be executed in parallel:function parallel_image_process(image):
tasks = decompose_image_into_regions(image) // Partition image into regions for different operations
for each region in tasks:
if region_type == "edge":
submit_task(edge_detection, region)
elif region_type == "grow":
submit_task(region_growing, region)
wait_for_all_tasks() // Synchronize completion
return combined_processed_image(tasks)
function parallel_image_process(image):
tasks = decompose_image_into_regions(image) // Partition image into regions for different operations
for each region in tasks:
if region_type == "edge":
submit_task(edge_detection, region)
elif region_type == "grow":
submit_task(region_growing, region)
wait_for_all_tasks() // Synchronize completion
return combined_processed_image(tasks)
function parallel_monte_carlo_simulation(num_iterations, simulation_function):
tasks = partition_iterations(num_iterations) // Divide total iterations into equal task batches
for each batch in tasks:
submit_task(monte_carlo_batch, batch, simulation_function) // Each task runs independent simulations
wait_for_all_tasks() // Synchronize completion
partial_results = [get_result(batch) for batch in tasks]
return aggregate_results(partial_results) // Combine for final estimate, e.g., via averaging
function parallel_monte_carlo_simulation(num_iterations, simulation_function):
tasks = partition_iterations(num_iterations) // Divide total iterations into equal task batches
for each batch in tasks:
submit_task(monte_carlo_batch, batch, simulation_function) // Each task runs independent simulations
wait_for_all_tasks() // Synchronize completion
partial_results = [get_result(batch) for batch in tasks]
return aggregate_results(partial_results) // Combine for final estimate, e.g., via averaging
