Recent from talks
Nothing was collected or created yet.
Thread (computing)
View on Wikipedia

scheduling, preemption, context switching
In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system.[1] In many cases, a thread is a component of a process.
The multiple threads of a given process may be executed concurrently (via multithreading capabilities), sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.
The implementation of threads and processes differs between operating systems.[2][page needed]
History
[edit]This section needs expansion. You can help by adding to it. (February 2021) |
Threads made an early appearance under the name of "tasks" in IBM's batch processing operating system, OS/360, in 1967. It provided users with three available configurations of the OS/360 control system, of which multiprogramming with a variable number of tasks (MVT) was one. Saltzer (1966) credits Victor A. Vyssotsky with the term "thread".[3]
The Mach implementation of threads was described in the summer of 1986.[4] OS/2 1.0, released in 1987, supported threads[5]. The first version of Windows to have threads was Windows NT released in 1993.
The use of threads in software applications became more common in the early 2000s as CPUs began to utilize multiple cores. Applications wishing to take advantage of multiple cores for performance advantages were required to employ concurrency to utilize the multiple cores.[6]
Related concepts
[edit]Scheduling can be done at the kernel level or user level, and multitasking can be done preemptively or cooperatively. This yields a variety of related concepts.
Processes
[edit]At the kernel level, a process contains one or more kernel threads, which share the process's resources, such as memory and file handles – a process is a unit of resources, while a thread is a unit of scheduling and execution. Kernel scheduling is typically uniformly done preemptively or, less commonly, cooperatively. At the user level a process such as a runtime system can itself schedule multiple threads of execution. If these do not share data, as in Erlang, they are usually analogously called processes,[7] while if they share data they are usually called (user) threads, particularly if preemptively scheduled. Cooperatively scheduled user threads are known as fibers; different processes may schedule user threads differently. User threads may be executed by kernel threads in various ways (one-to-one, many-to-one, many-to-many). The term light-weight process variously refers to user threads or to kernel mechanisms for scheduling user threads onto kernel threads.
A process is a heavyweight unit of kernel scheduling, as creating, destroying, and switching processes is relatively expensive. Processes own resources allocated by the operating system. Resources include memory (for both code and data), file handles, sockets, device handles, windows, and a process control block. Processes are isolated by process isolation, and do not share address spaces or file resources except through explicit methods such as inheriting file handles or shared memory segments, or mapping the same file in a shared way – see Interprocess communication. Creating or destroying a process is relatively expensive, as resources must be acquired or released. Processes are typically preemptively multitasked, and process switching is relatively expensive, beyond basic cost of context switching, due to issues such as cache flushing (in particular, process switching changes virtual memory addressing, causing invalidation and thus flushing of an untagged translation lookaside buffer (TLB), notably on x86).
Kernel threads
[edit]A kernel thread is a lightweight unit of kernel scheduling. At least one kernel thread exists within each process. If multiple kernel threads exist within a process, then they share the same memory and file resources. Kernel threads are preemptively multitasked if the operating system's process scheduler is preemptive. Kernel threads do not own resources except for a stack, a copy of the registers including the program counter, and thread-local storage (if any), and are thus relatively cheap to create and destroy. Thread switching is also relatively cheap: it requires a context switch (saving and restoring registers and stack pointer), but does not change virtual memory and is thus cache-friendly (leaving TLB valid). The kernel can assign one or more software threads to each core in a CPU (it being able to assign itself multiple software threads depending on its support for multithreading), and can swap out threads that get blocked. However, kernel threads take much longer than user threads to be swapped.
User threads
[edit]Threads are sometimes implemented in userspace libraries, thus called user threads. The kernel is unaware of them, so they are managed and scheduled in userspace. Some implementations base their user threads on top of several kernel threads, to benefit from multi-processor machines (M:N model). User threads as implemented by virtual machines are also called green threads.
As user thread implementations are typically entirely in userspace, context switching between user threads within the same process is extremely efficient because it does not require any interaction with the kernel at all: a context switch can be performed by locally saving the CPU registers used by the currently executing user thread or fiber and then loading the registers required by the user thread or fiber to be executed. Since scheduling occurs in userspace, the scheduling policy can be more easily tailored to the requirements of the program's workload.
However, the use of blocking system calls in user threads (as opposed to kernel threads) can be problematic. If a user thread or a fiber performs a system call that blocks, the other user threads and fibers in the process are unable to run until the system call returns. A typical example of this problem is when performing I/O: most programs are written to perform I/O synchronously. When an I/O operation is initiated, a system call is made, and does not return until the I/O operation has been completed. In the intervening period, the entire process is "blocked" by the kernel and cannot run, which starves other user threads and fibers in the same process from executing.
A common solution to this problem (used, in particular, by many green threads implementations) is providing an I/O API that implements an interface that blocks the calling thread, rather than the entire process, by using non-blocking I/O internally, and scheduling another user thread or fiber while the I/O operation is in progress. Similar solutions can be provided for other blocking system calls. Alternatively, the program can be written to avoid the use of synchronous I/O or other blocking system calls (in particular, using non-blocking I/O, including lambda continuations and/or async/await primitives[8]).
Fibers
[edit]Fibers are an even lighter unit of scheduling which are cooperatively scheduled: a running fiber must explicitly yield to allow another fiber to run, which makes their implementation much easier than kernel or user threads. A fiber can be scheduled to run in any thread in the same process. This permits applications to gain performance improvements by managing scheduling themselves, instead of relying on the kernel scheduler (which may not be tuned for the application). Some research implementations of the OpenMP parallel programming model implement their tasks through fibers.[9][10] Closely related to fibers are coroutines, with the distinction being that coroutines are a language-level construct, while fibers are a system-level construct.
Threads vs processes
[edit]Threads differ from traditional multitasking operating-system processes in several ways:
- processes are typically independent, while threads exist as subsets of a process
- processes carry considerably more state information than threads, whereas multiple threads within a process share process state as well as memory and other resources
- processes have separate address spaces, whereas threads share their address space
- processes interact only through system-provided inter-process communication mechanisms
- context switching between threads in the same process typically occurs faster than context switching between processes
Systems such as Windows NT and OS/2 are said to have cheap threads and expensive processes; in other operating systems there is not so great a difference except in the cost of an address-space switch, which on some architectures (notably x86) results in a translation lookaside buffer (TLB) flush.
Advantages and disadvantages of threads vs processes include:
- Lower resource consumption of threads: using threads, an application can operate using fewer resources than it would need when using multiple processes.
- Simplified sharing and communication of threads: unlike processes, which require a message passing or shared memory mechanism to perform inter-process communication (IPC), threads can communicate through data, code and files they already share.
- Thread crashes a process: due to threads sharing the same address space, an illegal operation performed by a thread can crash the entire process; therefore, one misbehaving thread can disrupt the processing of all the other threads in the application.
Scheduling
[edit]Preemptive vs cooperative scheduling
[edit]Operating systems schedule threads either preemptively or cooperatively. Multi-user operating systems generally favor preemptive multithreading for its finer-grained control over execution time via context switching. However, preemptive scheduling may context-switch threads at moments unanticipated by programmers, thus causing lock convoy, priority inversion, or other side-effects. In contrast, cooperative multithreading relies on threads to relinquish control of execution, thus ensuring that threads run to completion. This can cause problems if a cooperatively-multitasked thread blocks by waiting on a resource or if it starves other threads by not yielding control of execution during intensive computation.
Single- vs multi-processor systems
[edit]Until the early 2000s, most desktop computers had only one single-core CPU, with no support for hardware threads, although threads were still used on such computers because switching between threads was generally still quicker than full-process context switches. In 2002, Intel added support for simultaneous multithreading to the Pentium 4 processor, under the name hyper-threading; in 2005, they introduced the dual-core Pentium D processor and AMD introduced the dual-core Athlon 64 X2 processor.
Systems with a single processor generally implement multithreading by time slicing: the central processing unit (CPU) switches between different software threads. This context switching usually occurs frequently enough that users perceive the threads or tasks as running in parallel (for popular server/desktop operating systems, maximum time slice of a thread, when other threads are waiting, is often limited to 100–200ms). On a multiprocessor or multi-core system, multiple threads can execute in parallel, with every processor or core executing a separate thread simultaneously; on a processor or core with hardware threads, separate software threads can also be executed concurrently by separate hardware threads.
Threading models
[edit]1:1 (kernel-level threading)
[edit]Threads created by the user in a 1:1 correspondence with schedulable entities in the kernel[11] are the simplest possible threading implementation. OS/2 and Win32 used this approach from the start, while on Linux the GNU C Library implements this approach (via the NPTL or older LinuxThreads). This approach is also used by Solaris, NetBSD, FreeBSD, macOS, and iOS.
M:1 (user-level threading)
[edit]An M:1 model implies that all application-level threads map to one kernel-level scheduled entity;[11] the kernel has no knowledge of the application threads. With this approach, context switching can be done very quickly and, in addition, it can be implemented even on simple kernels which do not support threading. One of the major drawbacks, however, is that it cannot benefit from the hardware acceleration on multithreaded processors or multi-processor computers: there is never more than one thread being scheduled at the same time.[11] For example: If one of the threads needs to execute an I/O request, the whole process is blocked and the threading advantage cannot be used. The GNU Portable Threads uses User-level threading, as does State Threads.
M:N (hybrid threading)
[edit]M:N maps some M number of application threads onto some N number of kernel entities,[11] or "virtual processors." This is a compromise between kernel-level ("1:1") and user-level ("N:1") threading. In general, "M:N" threading systems are more complex to implement than either kernel or user threads, because changes to both kernel and user-space code are required[clarification needed]. In the M:N implementation, the threading library is responsible for scheduling user threads on the available schedulable entities; this makes context switching of threads very fast, as it avoids system calls. However, this increases complexity and the likelihood of priority inversion, as well as suboptimal scheduling without extensive (and expensive) coordination between the userland scheduler and the kernel scheduler.
Hybrid implementation examples
[edit]- Scheduler activations used by older versions of the NetBSD native POSIX threads library implementation (an M:N model as opposed to a 1:1 kernel or userspace implementation model)
- Light-weight processes used by older versions of the Solaris operating system
- Marcel from the PM2 project.
- The OS for the Tera-Cray MTA-2
- The Glasgow Haskell Compiler (GHC) for the language Haskell uses lightweight threads which are scheduled on operating system threads.
History of threading models in Unix systems
[edit]SunOS 4.x implemented light-weight processes or LWPs. NetBSD 2.x+, and DragonFly BSD implement LWPs as kernel threads (1:1 model). SunOS 5.2 through SunOS 5.8 as well as NetBSD 2 to NetBSD 4 implemented a two level model, multiplexing one or more user level threads on each kernel thread (M:N model). SunOS 5.9 and later, as well as NetBSD 5 eliminated user threads support, returning to a 1:1 model.[12] FreeBSD 5 implemented M:N model. FreeBSD 6 supported both 1:1 and M:N, users could choose which one should be used with a given program using /etc/libmap.conf. Starting with FreeBSD 7, the 1:1 became the default. FreeBSD 8 no longer supports the M:N model.
Single-threaded vs multithreaded programs
[edit]In computer programming, single-threading is the processing of one instruction at a time.[13] In the formal analysis of the variables' semantics and process state, the term single threading can be used differently to mean "backtracking within a single thread", which is common in the functional programming community.[14]
Multithreading is mainly found in multitasking operating systems. Multithreading is a widespread programming and execution model that allows multiple threads to exist within the context of one process. These threads share the process's resources, but are able to execute independently. The threaded programming model provides developers with a useful abstraction of concurrent execution. Multithreading can also be applied to one process to enable parallel execution on a multiprocessing system.
Multithreading libraries tend to provide a function call to create a new thread, which takes a function as a parameter. A concurrent thread is then created which starts running the passed function and ends when the function returns. The thread libraries also offer data synchronization functions.
Threads and data synchronization
[edit]
Threads in the same process share the same address space. This allows concurrently running code to couple tightly and conveniently exchange data without the overhead or complexity of an IPC. When shared between threads, however, even simple data structures become prone to race conditions if they require more than one CPU instruction to update: two threads may end up attempting to update the data structure at the same time and find it unexpectedly changing underfoot. Bugs caused by race conditions can be very difficult to reproduce and isolate.
To prevent this, threading application programming interfaces (APIs) offer synchronization primitives such as mutexes to lock data structures against concurrent access. On uniprocessor systems, a thread running into a locked mutex must sleep and hence trigger a context switch. On multi-processor systems, the thread may instead poll the mutex in a spinlock. Both of these may sap performance and force processors in symmetric multiprocessing (SMP) systems to contend for the memory bus, especially if the granularity of the locking is too fine.
Other synchronization APIs include condition variables, critical sections, semaphores, and monitors.
Thread pools
[edit]A popular programming pattern involving threads is that of thread pools where a set number of threads are created at startup that then wait for a task to be assigned. When a new task arrives, it wakes up, completes the task and goes back to waiting. This avoids the relatively expensive thread creation and destruction functions for every task performed and takes thread management out of the application developer's hand and leaves it to a library or the operating system that is better suited to optimize thread management.
Multithreaded programs vs single-threaded programs pros and cons
[edit]Multithreaded applications have the following advantages vs single-threaded ones:
- Responsiveness: multithreading can allow an application to remain responsive to input. In a one-thread program, if the main execution thread blocks on a long-running task, the entire application can appear to freeze. By moving such long-running tasks to a worker thread that runs concurrently with the main execution thread, it is possible for the application to remain responsive to user input while executing tasks in the background. On the other hand, in most cases multithreading is not the only way to keep a program responsive, with non-blocking I/O and/or Unix signals being available for obtaining similar results.[15]
- Parallelization: applications looking to use multicore or multi-CPU systems can use multithreading to split data and tasks into parallel subtasks and let the underlying architecture manage how the threads run, either concurrently on one core or in parallel on multiple cores. GPU computing environments like CUDA and OpenCL use the multithreading model where dozens to hundreds of threads run in parallel across data on a large number of cores. This, in turn, enables better system utilization, and (provided that synchronization costs don't eat the benefits up), can provide faster program execution.
Multithreaded applications have the following drawbacks:
- Synchronization complexity and related bugs: when using shared resources typical for threaded programs, the programmer must be careful to avoid race conditions and other non-intuitive behaviors. In order for data to be correctly manipulated, threads will often need to rendezvous in time in order to process the data in the correct order. Threads may also require mutually exclusive operations (often implemented using mutexes) to prevent common data from being read or overwritten in one thread while being modified by another. Careless use of such primitives can lead to deadlocks, livelocks or races over resources. As Edward A. Lee has written: "Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. Threads, as a model of computation, are wildly non-deterministic, and the job of the programmer becomes one of pruning that nondeterminism."[16]
- Being untestable. In general, multithreaded programs are non-deterministic, and as a result, are untestable. In other words, a multithreaded program can easily have bugs which never manifest on a test system, manifesting only in production.[17][16] This can be alleviated by restricting inter-thread communications to certain well-defined patterns (such as message-passing).
- Synchronization costs. As thread context switch on modern CPUs can cost up to 1 million CPU cycles,[18] it makes writing efficient multithreading programs difficult. In particular, special attention has to be paid to avoid inter-thread synchronization from being too frequent.
Programming language support
[edit]Many programming languages support threading in some capacity.
- IBM PL/I(F) included support for multithreading (called multitasking) as early as in the late 1960s, and this was continued in the Optimizing Compiler and later versions. The IBM Enterprise PL/I compiler introduced a new model "thread" API. Neither version was part of the PL/I standard.
- Many implementations of C and C++ support threading, and provide access to the native threading APIs of the operating system. A standardized interface for thread implementation is POSIX Threads (Pthreads), which is a set of C-function library calls. OS vendors are free to implement the interface as desired, but the application developer should be able to use the same interface across multiple platforms. Most Unix platforms, including Linux, support Pthreads. Microsoft Windows has its own set of thread functions in the process.h interface for multithreading, like beginthread.
- Some higher level (and usually cross-platform) programming languages, such as Java, Python, and .NET Framework languages, expose threading to developers while abstracting the platform specific differences in threading implementations in the runtime. Several other programming languages and language extensions also try to abstract the concept of concurrency and threading from the developer fully (Cilk, OpenMP, Message Passing Interface (MPI)). Some languages are designed for sequential parallelism instead (especially using GPUs), without requiring concurrency or threads (Ateji PX, CUDA).
- A few interpreted programming languages have implementations (e.g., Ruby MRI for Ruby, CPython for Python) which support threading and concurrency but not parallel execution of threads, due to a global interpreter lock (GIL). The GIL is a mutual exclusion lock held by the interpreter that can prevent the interpreter from simultaneously interpreting the application's code on two or more threads at once. This effectively limits the parallelism on multiple core systems. It also limits performance for processor-bound threads (which require the processor), but doesn't effect I/O-bound or network-bound ones as much. Other implementations of interpreted programming languages, such as Tcl using the Thread extension, avoid the GIL limit by using an Apartment model where data and code must be explicitly "shared" between threads. In Tcl each thread has one or more interpreters.
- In programming models such as CUDA designed for data parallel computation, an array of threads run the same code in parallel using only its ID to find its data in memory. In essence, the application must be designed so that each thread performs the same operation on different segments of memory so that they can operate in parallel and use the GPU architecture.
- Hardware description languages such as Verilog have a different threading model that supports extremely large numbers of threads (for modeling hardware).
See also
[edit]References
[edit]- ^ Lamport, Leslie (September 1979). "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs" (PDF). IEEE Transactions on Computers. C-28 (9): 690–691. Bibcode:1979ITCmp.100..690L. doi:10.1109/tc.1979.1675439. S2CID 5679366.
- ^ Tanenbaum, Andrew S. (1992). Modern Operating Systems. Prentice-Hall International Editions. ISBN 0-13-595752-4.
- ^ Saltzer, Jerome Howard (July 1966). Traffic Control in a Multiplexed Computer System (PDF) (Doctor of Science thesis). p. 20.
- ^ "Mach: A New Kernel Foundation for UNIX Development" (PDF). 1986 Summer USENIX Technical Conference & Exhibition Proceedings: 93. 1986.
- ^ "Microsoft Operating System/2 Programmer's Guide" (PDF).
- ^ Sutter, Herb (March 2005). "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software". Dr. Dobb's Journal. 30 (3).
- ^ "Erlang: 3.1 Processes".
- ^ Ignatchenko, Sergey. Eight Ways to Handle Non-blocking Returns in Message-passing Programs: from C++98 via C++11 to C++20. CPPCON. Archived from the original on 2020-11-25. Retrieved 2020-11-24.
{{cite AV media}}: CS1 maint: bot: original URL status unknown (link) - ^ Ferat, Manuel; Pereira, Romain; Roussel, Adrien; Carribault, Patrick; Steffenel, Luiz-Angelo; Gautier, Thierry (September 2022). "Enhancing MPI+OpenMP Task Based Applications for Heterogeneous Architectures with GPU support" (PDF). OpenMP in a Modern World: From Multi-device Support to Meta Programming. IWOMP 2022: 18th International Workshop on OpenMP. Lecture Notes in Computer Science. Vol. 13527. pp. 3–16. doi:10.1007/978-3-031-15922-0_1. ISBN 978-3-031-15921-3. S2CID 251692327.
- ^ Iwasaki, Shintaro; Amer, Abdelhalim; Taura, Kenjiro; Seo, Sangmin; Balaji, Pavan. BOLT: Optimizing OpenMP Parallel Regions with User-Level Threads (PDF). The 28th International Conference on Parallel Architectures and Compilation Techniques.
- ^ a b c d Silberschatz, Abraham; Galvin, Peter Baer; Gagne, Greg (2013). Operating system concepts (9th ed.). Hoboken, N.J.: Wiley. pp. 170–171. ISBN 9781118063330.
- ^ "Multithreading in the Solaris Operating Environment" (PDF). 2002. Archived from the original (PDF) on February 26, 2009.
- ^ Menéndez, Raúl; Lowe, Doug (2001). Murach's CICS for the COBOL Programmer. Mike Murach & Associates. p. 512. ISBN 978-1-890774-09-7.
- ^ O'Hearn, Peter William; Tennent, R. D. (1997). ALGOL-like languages. Vol. 2. Birkhäuser Verlag. p. 157. ISBN 978-0-8176-3937-2.
- ^ Ignatchenko, Sergey (August 2010). "Single-Threading: Back to the Future?". Overload (97). ACCU: 16–19.
- ^ a b Lee, Edward (January 10, 2006). "The Problem with Threads". UC Berkeley.
- ^ Ignatchenko, Sergey (August 2015). "Multi-threading at Business-logic Level is Considered Harmful". Overload (128). ACCU: 4–7.
- ^ 'No Bugs' Hare (12 September 2016). "Operation Costs in CPU Clock Cycles".
Further reading
[edit]- David R. Butenhof: Programming with POSIX Threads, Addison-Wesley, ISBN 0-201-63392-2
- Bradford Nichols, Dick Buttlar, Jacqueline Proulx Farell: Pthreads Programming, O'Reilly & Associates, ISBN 1-56592-115-1
- Paul Hyde: Java Thread Programming, Sams, ISBN 0-672-31585-8
- Jim Beveridge, Robert Wiener: Multithreading Applications in Win32, Addison-Wesley, ISBN 0-201-44234-5
- Uresh Vahalia: Unix Internals: the New Frontiers, Prentice Hall, ISBN 0-13-101908-2
Thread (computing)
View on GrokipediaFundamentals
Definition and Characteristics
In computing, a thread is defined as the smallest sequence of programmed instructions that can be scheduled for execution, serving as the basic unit of CPU utilization within an operating system.[1] It consists of essential components including a program counter to track the current instruction, a stack for managing local variables and function calls, a set of registers for temporary data storage, and a unique thread identifier.[7] Unlike larger execution units, threads enable fine-grained control over concurrent operations by allowing multiple such sequences to run seemingly simultaneously under the management of the scheduler.[8] Threads exhibit key characteristics that distinguish them as lightweight entities compared to full processes. They share the resources of their parent process, such as the address space, code segment, data, open files, and signals, which minimizes overhead in creation and context switching.[1] This resource sharing facilitates concurrency within a single process, allowing multiple threads to perform independent tasks like handling user input, processing data, or managing I/O operations without the expense of duplicating process-level resources.[5] On multi-core systems, threads support true parallelism, where distinct cores can execute different threads simultaneously, enhancing overall system throughput for compute-intensive applications.[9] Threads operate as subunits of processes, providing a mechanism for intra-process concurrency in contrast to standalone processes, which maintain isolated address spaces and resources for inter-process independence.[10] While processes represent complete programs with their own memory and execution context, threads leverage the process's environment to focus solely on execution flow, making them more efficient for modular, responsive program design.[11] The basic lifecycle of a thread encompasses creation, execution, blocking, and termination. A thread is created (or spawned) by allocating its private components like the stack and registers within an existing process, transitioning to a ready or runnable state awaiting scheduler assignment.[12] During execution, it runs on the CPU until it completes, yields control voluntarily, or is preempted. Blocking occurs when the thread waits for an event, such as I/O completion or a resource lock, suspending its progress without terminating the process. Finally, termination happens upon finishing its instructions or explicit exit, freeing its resources while the process may continue with other threads.[13]Threads Within Processes
In computing, a process serves as a container for one or more threads, where each thread represents a distinct sequence of instructions that can execute independently within the shared context of the process. All threads belonging to the same process share the process's address space, including code, data, and heap segments, as well as resources such as open files and global variables. This structural integration allows threads to operate as lightweight subunits of the process, enabling concurrent execution without the full overhead of separate process creation.[14][15] The sharing of resources among threads within a process provides significant benefits, particularly in facilitating efficient communication and reducing operational overhead. By accessing the same memory space, threads can exchange data directly without the need for inter-process communication mechanisms like pipes or message queues, which are slower and more complex. This shared access minimizes the cost of thread creation and context switching compared to spawning new processes, as threads do not require duplicating the entire address space or reinitializing system resources.[15][1] Despite this sharing, each thread maintains certain private elements to ensure isolation and prevent mutual interference during execution. Specifically, every thread has its own stack for local variables and function calls, a unique thread ID for identification by the operating system or runtime library, and an independent execution state including the program counter, registers, and stack pointer. These private components allow threads to maintain distinct control flows while leveraging the process's common resources.[14][15] A practical example of threads operating within a process is found in a multithreaded web server, where a single server process manages multiple client requests concurrently. Each incoming request is assigned to a separate thread, which uses the shared code base and global data structures (such as a request queue) for processing, but executes on its own private stack to handle unique client-specific operations like parsing HTTP headers or generating responses. This approach enables the server to scale efficiently with increasing traffic without creating separate processes for each connection.[15][1]Thread Types
Kernel-Level Threads
Kernel-level threads are threads managed directly by the operating system kernel, where the kernel creates and maintains a thread control block (TCB) for each thread to track its state, registers, stack, and scheduling information. Unlike user-level threads, which are handled by a user-space library invisible to the kernel, kernel-level threads are fully recognized by the operating system, allowing the kernel to treat them as schedulable entities akin to lightweight processes. This design requires system calls for thread creation (e.g., via APIs like pthread_create in POSIX), termination, and synchronization, with the kernel allocating kernel resources such as memory and handling synchronization primitives.[16][17] In implementation, the kernel performs context switches between kernel-level threads by saving and restoring the TCB contents, which involves mode switches from user to kernel space and back, enabling the scheduler to preempt or yield threads based on priorities or time slices. This kernel involvement ensures that threads from the same process can run concurrently on multiprocessor systems, as the scheduler can assign them to different CPUs for true parallelism and load balancing. Furthermore, when a kernel-level thread blocks on an I/O operation or system call, the kernel immediately schedules another ready thread from the same process, avoiding suspension of the entire process.[16][18][17] The primary advantages of kernel-level threads stem from their integration with the OS scheduler, providing efficient support for blocking operations and multiprocessor utilization without the limitations of user-level threads, where a single blocking call can stall all threads in a process. They facilitate better resource allocation, as the kernel can migrate threads across processors to optimize performance. However, these benefits come with significant drawbacks, including higher overhead for thread operations due to mandatory kernel transitions; for instance, each context switch incurs costs averaging around 3.8 microseconds for direct switching time, plus additional indirect costs from cache and TLB flushes that can extend total overhead to several microseconds or more. This makes kernel-level threads slower to create and switch compared to user-level alternatives, potentially impacting performance in applications with frequent thread management.[18][16] Examples of kernel-level threads include the native threading implementation in Windows NT and later versions, where threads are scheduled directly by the kernel executive, and Linux's POSIX threads (pthreads), which map to kernel-level entities via the clone system call, enabling the OS to schedule them independently. These implementations are standard in modern Unix-like systems and Windows, supporting scalable concurrency for server applications and parallel computing tasks.[17][16]User-Level Threads
User-level threads are execution units managed entirely by a user-space threading library within a process, without any awareness or involvement from the operating system's kernel; from the kernel's perspective, the process appears as a single thread of execution.[19] This approach allows multiple user-level threads to share the same address space and resources of the process, enabling efficient concurrency at the application level. In implementation, the user-level library maintains thread control blocks (TCBs) in user space to track each thread's state, stack, and registers. The library multiplexes these threads onto one or a limited number of kernel entities, such as a single kernel thread or lightweight process, handling creation, scheduling, synchronization, and switching through library calls rather than system calls.[20] This user-space management avoids kernel intervention for routine operations, allowing thread switches to occur as simple procedure calls. The primary advantages of user-level threads stem from their isolation from kernel overhead: thread creation and context switching are significantly faster, often by orders of magnitude compared to kernel-level threads, since they bypass system calls and mode switches.[20] Additionally, applications gain flexibility in scheduling, as the library can enforce custom policies optimized for specific workloads, such as priority-based or application-specific algorithms, without relying on the kernel's general-purpose scheduler.[19] However, user-level threads have notable drawbacks. A blocking system call by one thread, such as for I/O, can suspend the entire process in the kernel, stalling all other threads and undermining concurrency.[19] They also exhibit limited scalability on multiprocessor systems, as the kernel schedules the hosting process as a monolithic unit, preventing true parallelism across multiple CPUs despite multiple user-level threads.[21] In contrast to kernel-level threads, this lack of kernel visibility leads to suboptimal resource allocation by the OS. Examples of user-level thread implementations include green threads, which were employed in early versions of the Java Virtual Machine (JVM) to provide portable, lightweight concurrency without native OS support.[22] Another is the GNU Portable Threads (Pth) library, a POSIX/ANSI-C-based system for Unix platforms that supports non-preemptive, priority-based scheduling of multiple threads in user space.[23]Lightweight Threads and Fibers
Fibers represent an ultra-lightweight form of execution context in computing, designed for manual control without reliance on operating system scheduling. Unlike traditional threads, fibers consist primarily of a user-allocated stack and a minimal execution state, such as processor registers and a program counter, omitting the full thread control block (TCB) found in kernel or user-level threads. This structure enables them to serve as building blocks for cooperative multitasking, where execution flows are paused and resumed explicitly by the application code, often to implement coroutines or asynchronous operations within a single host thread.[24] Key characteristics of fibers include their stack-only footprint and programmer-driven switching, typically via API calls that save the current context and restore another without kernel intervention. They share the address space and resources of the hosting thread, making them ideal for scenarios like event loops or non-blocking I/O, where multiple logical execution paths must interleave without the overhead of full thread creation. For instance, in asynchronous programming, a fiber can yield control during I/O waits, allowing other fibers to proceed on the same thread. Fibers thus approximate coroutines, providing a lightweight alternative to heavier user-level threads, which incorporate an automated scheduler.[25][26] The primary advantages of fibers lie in their minimal resource consumption and fast context switching, with overhead limited to the specified stack size—often just a few kilobytes per fiber—compared to the megabytes typical for OS threads. This efficiency supports high-concurrency applications, such as servers handling thousands of connections, by enabling massive parallelism within limited memory and without the latency of kernel-mode transitions. Studies on lightweight threading frameworks highlight how such mechanisms can outperform OS threads in nested parallelism and task-heavy workloads by reducing scheduling costs.[27][28] However, fibers have notable limitations due to their cooperative nature: they lack preemption, requiring explicit yields from the programmer, which can lead to starvation if one fiber performs prolonged computation without relinquishing control. Blocking system calls within a fiber will suspend the entire hosting thread, potentially undermining concurrency, and they do not constitute true parallel execution since all fibers serialize on the host thread. These constraints make fibers unsuitable for CPU-bound tasks needing automatic time-slicing.[29][24] Prominent examples include the Windows Fibers API, introduced in Windows NT, which provides functions like CreateFiber to allocate a fiber with a custom stack and SwitchToFiber for explicit context switching, facilitating user-mode multitasking in applications like database servers. In Lua, coroutines offer fiber-like functionality through primitives such as coroutine.create and coroutine.resume, enabling efficient implementation of generators, simulators, and cooperative schedulers in scripting environments.[25][30]Threading Models
One-to-One Model
The one-to-one threading model establishes a direct correspondence between each user-level thread and a kernel-level thread, such that creating a user thread in an application automatically spawns a corresponding kernel thread managed by the operating system.[31] In this approach, there is no multiplexing of user threads onto fewer kernel threads; instead, every user thread has its own dedicated kernel entity, enabling the kernel to treat them as independent schedulable units.[1] Mechanically, the kernel assumes full responsibility for scheduling, context switching, and resource allocation for all threads, bypassing any user-space thread library involvement in these operations.[32] This direct mapping ensures that system calls and interrupts are handled at the kernel level without interference from other user threads in the same process, as each kernel thread operates autonomously.[33] A key advantage of the one-to-one model is its ability to fully utilize multiprocessor systems, allowing multiple threads to execute in true parallel across available CPU cores without contention from shared kernel resources.[34] Additionally, if one thread blocks on a system call or I/O operation, the kernel can immediately schedule another thread from the same process, preventing the entire process from stalling and enhancing overall concurrency.[35] However, this model is resource-intensive because each user thread incurs the full overhead of kernel thread creation, including memory for thread control blocks and kernel stacks, which can limit scalability to typically a few thousand threads per process before performance degrades significantly.[36] Context switches between kernel threads also involve higher costs due to mode transitions between user and kernel space.[37] Prominent implementations include the Native POSIX Thread Library (NPTL) in modern Linux kernels starting from version 2.6, which adopted this model for POSIX threads (pthreads) to improve scalability and standards compliance.[32] Windows operating systems, from Windows NT onward, employ a one-to-one model where user threads map directly to kernel threads managed by the executive scheduler.[25] Similarly, Solaris versions 9 and later transitioned to this model as the default for lightweight processes (LWPs), replacing earlier hybrid approaches to simplify thread management and boost performance on multiprocessors.[31]Many-to-One Model
The many-to-one threading model, also known as the M:1 model, involves mapping multiple user-level threads to a single kernel-level thread, with thread creation, scheduling, and management performed entirely within a user-space library.[1] In this approach, the operating system's kernel remains unaware of the individual user threads and treats the process as having only one schedulable entity, the kernel thread. The user-level thread library implements its own scheduler to multiplex the user threads onto this single kernel thread, handling context switches and resource allocation without invoking kernel services for each operation.[1] This model offers significant advantages in terms of efficiency and low overhead. Thread operations such as creation, termination, and switching occur in user space, avoiding the expense of kernel traps and system calls, which results in faster execution compared to kernel-managed threading.[38] Context switching between user threads is particularly rapid, as it involves minimal state saving and restoration without kernel involvement, making it suitable for applications requiring high concurrency with limited resources.[38] However, the many-to-one model has notable drawbacks that limit its scalability and reliability. A critical issue arises during blocking system calls: if the single kernel thread blocks—for instance, on I/O operations—all multiplexed user threads are effectively blocked, as the kernel cannot schedule any of them independently.[1] Additionally, it provides no parallelism on multiprocessor systems, since only one kernel thread executes at a time, preventing true concurrent execution across multiple CPUs.[38] In contrast to the one-to-one model, which maps each user thread directly to a kernel thread for better blocking behavior and multiprocessor support, the many-to-one approach prioritizes low-cost multiplexing at the expense of these limitations.[1] Examples of the many-to-one model include early implementations of POSIX threads (pthreads), where user-level libraries managed multiple threads atop a single kernel entity before the adoption of native kernel support.[39] It is also exemplified by green threads in early Java virtual machines, such as those in Solaris, which allowed applications to create numerous lightweight threads without kernel overhead.[40] While largely deprecated in modern general-purpose operating systems due to its blocking and scalability issues, the model persists in some embedded systems and runtime environments where resource constraints favor user-space control.[38]Many-to-Many Model
The many-to-many threading model, also known as the two-level or hybrid model, implements an M:N mapping where multiple user-level threads (M) are dynamically assigned to multiple kernel-level threads (N), with N typically equal to or fewer than M to optimize resource usage.[41] This approach employs a hybrid scheduler that operates partly in user space and partly in kernel space, allowing the operating system to create a sufficient number of kernel threads to support parallelism without requiring a one-to-one correspondence for every user thread.[41] In terms of mechanics, a user-level thread library multiplexes the user threads onto the available kernel threads, managing context switches and scheduling decisions at the user level for efficiency, while the kernel scheduler handles the execution of the kernel threads across available processors.[41] When a user thread blocks—such as during I/O operations—the user-level scheduler can reassign other runnable user threads to different kernel threads, preventing the entire process from stalling; this is facilitated by mechanisms like scheduler activations, where the kernel notifies the user-level scheduler of events such as thread blocking or processor availability.[42] The dynamic assignment ensures that the mapping adapts to system conditions, balancing load across kernel threads without kernel awareness of individual user threads.[43] This model offers advantages in balancing the efficiency of user-level threading—such as low-cost context switching—with the parallelism of kernel-level threading, enabling better utilization of multiprocessor systems and handling blocking operations without halting the whole process.[41] It provides flexibility for applications needing many lightweight threads while avoiding the overhead of creating excessive kernel threads.[42] However, drawbacks include the complexity of implementation, as coordinating the user-kernel interaction requires sophisticated synchronization to maintain mapping integrity, and potential overhead from maintaining and updating the thread-to-thread mappings during runtime.[41] Examples of the many-to-many model include the Kernel Scheduled Entities (KSE) system in FreeBSD 5.x, which supported hybrid user-kernel threading through kernel-provided scheduling entities that allowed user libraries to manage multiple threads across kernel threads.[43] Early versions of Solaris, prior to Solaris 9, utilized this model to map user threads to lightweight processes (kernel threads) for improved concurrency.[44] Modern variants appear in Java's virtual machine threads, introduced in Java 21, where virtual threads are lightweight constructs scheduled by the JVM onto a pool of carrier threads (platform threads) in an M:N fashion, enhancing scalability for high-throughput applications.[45]Scheduling Mechanisms
Preemptive and Cooperative Scheduling
In cooperative scheduling, threads voluntarily relinquish control to the scheduler, typically through explicit yield calls or when blocking on I/O or synchronization primitives, allowing the next thread to execute without interruption from the operating system.[46] This approach simplifies implementation at the user level, as it avoids the need for kernel intervention during thread execution, resulting in lower overhead from context switches and better predictability for short, well-behaved tasks. However, it risks system unresponsiveness if a thread enters an infinite loop or performs lengthy computations without yielding, potentially leading to starvation of other threads or even application hangs.[47] Preemptive scheduling, in contrast, enables the operating system to forcibly interrupt a running thread at any time, typically via hardware timer interrupts or when a higher-priority thread becomes ready, ensuring that no single thread can monopolize the CPU.[48] This mechanism promotes fairness and responsiveness by allowing context switches based on time slices or priority levels, making it suitable for interactive and real-time applications where predictable progress across threads is essential.[49] The scheduler evaluates thread priorities periodically—often every few milliseconds—and preempts the current thread if necessary, saving its state and loading the next one from the ready queue.[50] The primary trade-offs between these approaches lie in overhead and reliability: cooperative scheduling incurs minimal runtime costs since threads manage their own switching, but it demands cooperative behavior from all threads, which is fragile in the presence of bugs or varying workloads.[51] Preemptive scheduling, while introducing overhead from frequent interrupts and context switches, guarantees progress and prevents monopolization, though it can complicate debugging due to non-deterministic execution order.[52] Hybrid models, such as those mixing cooperative execution within preemptive boundaries, have been explored to balance these aspects, but pure forms dominate standard implementations. Most modern operating systems, including Linux and Windows, employ preemptive scheduling for kernel-level threads to ensure robust multitasking, with timer-driven interrupts enforcing time slices typically ranging from 1 to 100 milliseconds depending on priority.[48][25] In contrast, cooperative scheduling persists in lightweight constructs like Windows fibers, where execution is non-preemptive and divided manually among related tasks within a single kernel thread, offering efficiency for domain-specific concurrency without OS involvement.[53] Early systems, such as Windows 3.x, relied heavily on cooperative models but transitioned to preemptive for better stability in multithreaded environments.[54]Scheduling on Uniprocessor and Multiprocessor Systems
In uniprocessor systems, thread scheduling achieves concurrency through time-slicing, where the operating system allocates fixed quanta of CPU time—typically 10 to 100 milliseconds—to each ready thread before performing a context switch to the next. This mechanism serializes execution on the single CPU, creating the illusion of simultaneous progress among multiple threads despite their inherent sequential processing. Context switches involve saving the state of the current thread (such as registers and program counter) and restoring the state of the next, incurring overhead that the scheduler minimizes by selecting appropriate quantum sizes.[55][56] A common algorithm for uniprocessor thread scheduling is round-robin, which cycles through ready threads in a fixed order, granting each one time slice in turn to ensure fair CPU sharing and prevent indefinite blocking of lower-priority threads. This approach is particularly effective for interactive applications, as it maintains responsiveness by regularly preempting threads, though excessive context switches can degrade performance if quanta are too short. In practice, the scheduler maintains a ready queue to manage thread selection, prioritizing based on factors like virtual runtime to approximate fairness.[55][57] Multiprocessor systems extend uniprocessor techniques by distributing threads across multiple CPUs, incorporating processor affinity to bind threads to specific processors and preserve locality in caches and memory. Affinity scheduling reduces migration-induced overhead by favoring the last processor a thread ran on, thereby minimizing cache misses and improving data access speeds. Load balancing complements this by dynamically migrating threads from overloaded CPUs to idle ones, using techniques like periodic polling of runqueue lengths to maintain even utilization across processors.[58][59] Key challenges in multiprocessor scheduling include the loss of cache affinity during thread migration, which invalidates cached data and increases latency, and considerations for Non-Uniform Memory Access (NUMA) architectures in large-scale systems where memory access times vary significantly between local and remote nodes. In NUMA environments, schedulers must balance load while aligning thread placement with memory affinity to avoid remote access penalties, often employing topology-aware policies to map threads to nodes with optimal data locality. These issues can amplify if not addressed, leading to suboptimal performance in memory-intensive workloads.[58][60][61] For multiprocessor load balancing, work-stealing queues provide an efficient algorithm where each processor maintains a double-ended queue of tasks, allowing idle processors to "steal" tasks from the tail of busy processors' queues to equalize workload without centralized coordination. This decentralized approach, with expected O(1) steal operations, scales well to many cores and is exemplified in Java's Fork/Join framework, where it supports recursive task partitioning for parallel computations. Work-stealing minimizes synchronization overhead and adapts dynamically to varying thread loads, outperforming global queues in reducing contention.[62] Modern operating systems illustrate these concepts in practice. The Linux Earliest Eligible Virtual Deadline First (EEVDF) scheduler, which superseded the Completely Fair Scheduler (CFS) in kernel version 6.6 (2023), manages both uniprocessor and multiprocessor scenarios using per-CPU runqueues and a virtual deadline metric to select threads, ensuring fairness while supporting symmetric multiprocessing through load balancing and affinity hints.[63] Similarly, the Windows scheduler employs multiprocessor-aware policies, including ideal processor selection and soft affinity, to assign threads across CPUs, grouping related threads for efficient handling in symmetric environments. These implementations typically integrate preemptive scheduling to interrupt threads at timer intervals, enhancing responsiveness in multithreaded applications.[64]Multithreading Applications
Single-Threaded Versus Multithreaded Programs
Single-threaded programs operate using a single thread of execution within a process, resulting in sequential processing where tasks are handled one at a time without overlap.[4] This model limits concurrency, as the program cannot perform multiple operations simultaneously, making it suitable for straightforward, linear workflows.[1] For instance, command-line interface (CLI) tools like text file processors or basic scripts exemplify this approach, executing operations such as reading input, performing computations, and producing output in a strictly ordered manner. In multithreaded programs, multiple threads share the same process address space and resources, enabling parallel execution of tasks that can overlap in time.[4] This concurrency is advantageous for I/O-bound tasks, where one thread can wait for input/output operations while others proceed, or for CPU-bound tasks that benefit from distribution across available processors.[1] Web servers represent a common use case, where separate threads handle incoming client requests concurrently, improving responsiveness without blocking the entire program on a single operation.[4] Similarly, applications like games or scientific simulations leverage multithreading to update graphics, process user inputs, and run background computations in parallel. The design implications differ markedly between the two models. Single-threaded programs require no synchronization mechanisms, as there is no risk of concurrent access to shared data, simplifying development and debugging while avoiding potential issues like race conditions.[65] Multithreaded programs, by contrast, demand thread-safe code to manage shared resources, necessitating techniques such as mutexes or atomic operations to ensure data integrity across threads.[66] Transitioning from a single-threaded to a multithreaded design involves identifying independent tasks, then using thread creation APIs—likepthread_create in POSIX environments—to spawn and manage additional threads, thereby refactoring the program for concurrency.
Benefits and Drawbacks of Multithreading
Multithreading offers several key advantages in computing systems, primarily by enabling concurrent execution within a single process. One primary benefit is improved responsiveness, particularly in interactive applications, where a user interface thread can continue processing events while other threads handle time-consuming operations like I/O or computations, preventing the application from appearing frozen.[67] Another advantage is efficient resource sharing, as threads within the same process access a common memory space and resources without the need for inter-process communication mechanisms, which reduces overhead compared to separate processes.[67] Additionally, multithreading enhances resource utilization on multiprocessor systems by allowing threads to execute in parallel across multiple cores, thereby exploiting hardware parallelism for CPU-bound tasks.[67] This economy of creation and switching—where thread management is significantly cheaper than process management—further contributes to overall system efficiency.[67] Despite these benefits, multithreading introduces notable drawbacks that can complicate software development and performance. A major issue is increased programming complexity, stemming from the need to manage shared resources and avoid issues like race conditions, where concurrent access to data leads to unpredictable outcomes.[10] Threads exhibit inherent nondeterminism due to unpredictable interleaving of executions, making program behavior difficult to reason about or verify, as even simple designs can produce varying results across runs.[10] Debugging multithreaded code is particularly challenging, as concurrency bugs such as deadlocks may remain undetected for extended periods despite thorough testing, with one reported case persisting undetected for four years in a mature system.[10] Overhead associated with thread management also poses performance challenges. Creating and switching threads incurs costs, with context switching for kernel-level threads typically requiring several microseconds; direct costs average around 3.8 μs as measured in 2007, but total costs can exceed 1,000 μs when factoring in cache misses from data access patterns.[68] This overhead can result in performance degradation in scenarios with frequent switches, particularly in applications without sufficient parallelism.[68] Furthermore, scalability is limited by Amdahl's law, which quantifies that the maximum speedup from multithreading is constrained by the fraction of the program that remains sequential, such that even with many processors, non-parallelizable portions bottleneck overall gains—for instance, if 5% of a task is sequential, the theoretical speedup is capped at 20x regardless of core count.[69] Multithreading is particularly well-suited for I/O-intensive applications, where threads can overlap waiting periods to maintain responsiveness, but it offers limited value for purely CPU-bound workloads on single-core systems due to switching overheads outweighing parallelism benefits.[67] In contrast to single-threaded programs, which provide deterministic execution but poor concurrency, multithreaded designs trade simplicity for these performance enhancements in suitable contexts.[67]Thread Pools and Resource Management
A thread pool is a collection of pre-initialized worker threads that execute tasks concurrently, allowing applications to manage concurrency without repeatedly creating and destroying threads. This design pattern addresses the high cost of thread lifecycle management in multithreaded environments by maintaining a fixed or dynamic set of reusable threads. Tasks are submitted to the pool via a queue, and idle threads are assigned to process them as they become available.[70][41] The mechanics of thread pools involve a central task queue where incoming work units, such as runnables or callables, are enqueued for execution. Worker threads continuously poll this queue, dequeuing and performing tasks until the queue is empty or they are signaled to terminate. To prevent resource exhaustion, pools enforce bounds on their size—typically a minimum number of always-active threads and a maximum to cap concurrency—ensuring that the system does not spawn excessive threads during peak loads. For example, if the pool reaches its maximum size, additional tasks wait in the queue rather than triggering new thread creation. Dynamic pools may adjust thread counts based on demand, while fixed pools maintain a constant size for predictability.[71][72] Thread pools offer significant benefits by minimizing the overhead of thread creation, which involves kernel allocation and context switching, thereby improving performance in scenarios with frequent short-lived tasks. They also provide fine-grained control over resource usage; by limiting the maximum number of threads—often tuned to the number of CPU cores—pools avoid overwhelming the system with too much parallelism, reducing context-switching costs and memory footprint. This approach enhances scalability in server applications, where unbounded threading could lead to thrashing under high load.[70][41][73] Implementation of thread pools typically relies on executor frameworks that abstract thread management, task scheduling, and rejection policies for overflow conditions. Developers configure parameters like core pool size, maximum pool size, and queue capacity, often monitoring metrics such as queue length and thread utilization to optimize sizing for specific workloads, such as aligning with available CPU cores for compute-intensive tasks. In the Java ecosystem, the ExecutorService interface, backed by ThreadPoolExecutor, serves as a standard implementation for submitting tasks and shutting down pools gracefully. The Apache Tomcat server employs shared thread pools via its Executor component to process HTTP requests, configurable for minimum and maximum threads to balance throughput and responsiveness.[71][74][72]Synchronization and Concurrency
Thread Synchronization Techniques
Thread synchronization techniques are essential mechanisms in multithreaded computing to coordinate access to shared resources among concurrent threads, ensuring data consistency and preventing race conditions. These techniques provide primitives for mutual exclusion, signaling, and coordination, allowing threads to operate safely in parallel without interfering with each other. Common approaches include locking mechanisms, signaling tools, and lock-free methods, each tailored to specific scenarios such as protecting critical sections or managing producer-consumer interactions. Mutexes, or mutual exclusion locks, are fundamental synchronization primitives that enforce exclusive access to a shared resource by allowing only one thread at a time to hold the lock. A thread acquires the mutex before entering a critical section and releases it upon exit, blocking other threads attempting to acquire the same mutex until it becomes available. In POSIX threads (pthreads), thepthread_mutex_lock() function implements this by atomically blocking the calling thread if the mutex is already locked, while pthread_mutex_unlock() releases it, potentially waking a waiting thread based on the system's scheduling policy.[75] Mutexes are designed as low-level building blocks for higher-level synchronization, with implementations often relying on hardware support for atomic operations to minimize overhead.[75]
Semaphores extend mutex functionality by supporting counting, enabling synchronization for scenarios involving multiple resources or producer-consumer patterns. Introduced by Edsger Dijkstra in the 1960s, a semaphore maintains a non-negative integer counter that threads can increment with a signal operation (V) or decrement with a wait operation (P), blocking if the counter is zero.[76] Binary semaphores, with a counter limited to 0 or 1, function similarly to mutexes for mutual exclusion, while counting semaphores allow up to N threads to access a resource pool. In practice, semaphores coordinate bounded buffers in producer-consumer systems, where producers signal availability and consumers wait for data without busy-waiting.[76]
Condition variables complement mutexes by providing a way for threads to wait for specific conditions on shared data, avoiding inefficient polling. A condition variable is always used in conjunction with an associated mutex protecting the shared variables; a thread waits by releasing the mutex and blocking until signaled, then reacquires the mutex upon wakeup. In pthreads, pthread_cond_wait() atomically releases the mutex and suspends the thread until pthread_cond_signal() or pthread_cond_broadcast() is called, ensuring the waiting thread rechecks the condition predicate to handle spurious wakeups. This mechanism is crucial for efficient synchronization in event-driven scenarios, such as queue management.
Atomic operations offer lock-free synchronization by performing read-modify-write actions on shared variables indivisibly, without using explicit locks. The compare-and-swap (CAS) instruction is a cornerstone, atomically checking if a memory location holds an expected value and, if so, replacing it with a new value; otherwise, it fails without modifying the location.[77] CAS enables non-blocking algorithms like lock-free queues, where threads retry operations on failure, reducing contention in high-throughput environments. Hardware support for CAS, available in most modern processors, ensures progress under reasonable contention, though it may require exponential backoff to mitigate the ABA problem.[77]
Barriers synchronize threads at specific points, ensuring all participants reach a milestone before any proceeds, commonly used in parallel algorithms like parallel matrix multiplication. In pthreads, pthread_barrier_wait() blocks until the required number of threads arrive, then releases all simultaneously. Read-write locks optimize scenarios with multiple readers but exclusive writers, allowing concurrent reads while blocking writers and vice versa. POSIX pthread_rwlock_rdlock() permits multiple readers, whereas pthread_rwlock_wrlock() grants exclusive access for modifications, improving throughput in read-heavy workloads like database caching.
Practical implementations include POSIX pthreads for C/C++ systems and Java's synchronized keyword, which implicitly acquires an intrinsic lock on an object or class for methods or blocks.[78] In Java, marking a method as synchronized ensures only one thread executes it at a time per object instance, simplifying mutual exclusion for shared state.[78]
Best practices for thread synchronization emphasize minimizing the duration of critical sections to reduce lock hold times and contention, as prolonged locks can lead to performance degradation in multiprocessor systems. Developers should prefer higher-level abstractions like futures or promises for asynchronous coordination, which encapsulate synchronization internally and promote scalable designs over low-level primitives.[79]
Common Data Race and Deadlock Issues
A data race occurs when two or more threads concurrently access the same shared memory location, with at least one access being a write, and without appropriate synchronization, leading to nondeterministic behavior and potential data corruption.[80] Such races can result in subtle bugs that are difficult to reproduce, as the outcome depends on thread scheduling and timing, often manifesting as incorrect computations or crashes in multithreaded applications.[81] Deadlocks arise in multithreaded systems when two or more threads are permanently blocked, each waiting for resources held by the others in a circular dependency, preventing any progress. For a deadlock to occur, four necessary conditions must hold simultaneously: mutual exclusion, where resources cannot be shared and must be held exclusively; hold-and-wait, where a thread holds at least one resource while waiting for others; no preemption, meaning resources cannot be forcibly taken from a thread; and circular wait, where there is a cycle of threads each waiting for a resource held by the next. Beyond deadlocks, livelocks represent another concurrency issue where threads are actively running but unable to make meaningful progress, often due to repeated attempts to resolve conflicts that fail in a loop, such as two threads politely yielding to each other indefinitely.[82] Starvation, conversely, happens when a thread is perpetually denied access to necessary resources, typically because higher-priority threads or unfair scheduling mechanisms continually preempt it, leading to indefinite delays in execution.[82] Detection of data races and deadlocks often relies on dynamic tools like ThreadSanitizer (TSan), which instruments code at compile time to track memory accesses and synchronization events, reporting races with low false positives by combining happens-before and lockset analyses during runtime.[83] Static analysis techniques, such as those in tools like Eraser or more recent hybrid approaches, examine code without execution to identify potential races by modeling lock usage and access patterns, though they may produce false positives.[81] For deadlocks, runtime monitoring can detect circular waits by graphing resource dependencies, while static methods verify absence through model checking. Avoidance strategies for data races emphasize proper synchronization, such as using atomic operations or mutexes to protect shared data, ensuring all accesses are serialized.[80] Deadlock prevention targets breaking one of Coffman's conditions, for instance, by imposing a global lock ordering to eliminate circular waits or using timeouts on resource requests to avoid indefinite holds. The Banker's algorithm provides a systematic avoidance method by simulating resource allocations in advance to ensure the system remains in a safe state, where an ordering of threads exists that can complete without deadlock, originally developed for operating system resource management. For livelocks and starvation, techniques include randomized backoffs to break symmetry in conflicting threads and priority inheritance protocols to guarantee bounded waits for lower-priority ones.[82]Historical Development
Early Concepts and Influences
The early motivations for concepts that would evolve into threading stemmed from the limitations of batch processing systems in the 1950s and early 1960s, where computers executed one job at a time, resulting in substantial CPU idle time during input/output (I/O) operations and interrupts. To mitigate this, multiprogramming techniques were developed to keep the CPU busy by overlapping computation with I/O, allowing multiple programs to reside in memory and switch execution contexts upon interrupts, thus marking the transition from single-job execution to basic multitasking. This approach was essential for improving resource utilization in environments dominated by slow peripheral devices.[84] In the 1960s, time-sharing systems further advanced these ideas to support interactive multi-user access, influencing foundational threading concepts. The MULTICS project, launched in 1965 by MIT's Project MAC, Bell Labs, and General Electric, emphasized hierarchical memory protection and concurrent resource sharing to enable efficient timesharing, with early experimentation in lightweight concurrency such as Max Smith's 1970 prototype using multiple stacks within a single process for background tasks. Complementing this, Edsger W. Dijkstra's THE multiprogramming system, designed in 1965–1966 and detailed in 1968, organized the operating system into cooperating sequential processes across layers, using semaphores for synchronization to manage concurrency without hardware support for parallelism.[85][86] Key contributions from prominent figures solidified these foundations. Dijkstra's 1965 work on cooperating sequential processes introduced semaphores as a primitive for coordinating concurrent activities, providing a mechanism to avoid race conditions in multiprogrammed environments and becoming a cornerstone for later threading synchronization. In parallel, Dennis Ritchie's contributions to Unix in the late 1960s and early 1970s, alongside Ken Thompson, established a process model with the fork() system call for creating child processes that shared the parent's address space initially, promoting efficient, lightweight execution units that influenced threading by prioritizing simplicity in concurrency and inter-process communication.[87][88] Preceding formal threads, simpler units like subroutines and coroutines served as conceptual precursors by enabling modular and cooperative control flow. Subroutines, a staple since the 1940s, allowed reusable code blocks with their own local state, laying the groundwork for independent execution paths within a program. Coroutines, termed by Melvin E. Conway in 1958 and elaborated in his 1963 paper, generalized subroutines to support symmetric, non-preemptive yielding and resumption between routines, facilitating structured concurrency in assembly-level programs without full OS involvement.Evolution in Operating Systems
The evolution of threads in operating systems began in earnest during the 1990s, driven by the need for efficient concurrency in Unix-like systems. The POSIX standard played a pivotal role in this development, with the IEEE Std 1003.1c-1995 defining the Pthreads API as an extension to the Portable Operating System Interface (POSIX.1), providing a standardized interface for creating and managing threads in C programs.[89] This standardization facilitated portability across Unix variants, marking a shift toward kernel-supported threading models that improved performance over purely user-level implementations by allowing direct kernel scheduling and reducing context-switch overhead.[90] In Linux, early threading relied on user-level libraries like LinuxThreads, but performance limitations—such as poor scalability on multiprocessors due to the many-to-one (M:1) model—prompted a transition to a one-to-one (1:1) model. This culminated in the Native POSIX Thread Library (NPTL), introduced in Linux kernel 2.6 in 2003, which maps each user thread directly to a kernel thread for better responsiveness and efficiency, particularly in handling blocking system calls without stalling other threads.[32] NPTL's adoption addressed scalability issues, enabling thousands of threads with minimal overhead, and became the default implementation for POSIX threads in subsequent distributions. Microsoft Windows introduced kernel-level threads with Windows NT 3.1 in 1993, where the executive scheduler dispatches threads as the basic unit of execution, supporting multiprocessor parallelism and preemptive multitasking from the outset.[91] Later versions, such as Windows 7 and beyond, introduced user-mode scheduling (UMS) through dedicated APIs, allowing applications to manage their own thread scheduling for specific workloads, though kernel threads remained the foundation for system-wide concurrency.[92] Other Unix derivatives followed suit with innovative threading architectures in the 1990s and 2000s. Sun Microsystems' Solaris implemented lightweight processes (LWPs) starting with Solaris 2 in the early 1990s, serving as kernel-visible entities that bridge user threads and kernel threads in a many-to-many (M:N) model, enabling efficient multiplexing and reduced kernel overhead for I/O-bound applications.[93] In FreeBSD, the 2000s saw the adoption of a hybrid model via Kernel Scheduled Entities (KSE), proposed in 2000 and integrated as the default threading library by 2004, combining user-level threads (ULT) with kernel entities to balance flexibility and performance on symmetric multiprocessors.[43][94] Key milestones in this era included the broader industry shift from user-level to kernel-level threads for superior performance, as kernel threads avoid the pitfalls of user-level blocking—where one thread's system call could halt the entire process—and enable true parallelism on multicore hardware.[95] POSIX standardization ensured interoperability, influencing implementations across systems and paving the way for scalable concurrency. By 2025, trends emphasize asynchronous enhancements like Linux's io_uring interface, introduced in 2019 and expanded with features like multishot operations—such as receive multishot introduced in Linux kernel 6.0 (2022)—enabling efficient handling of multiple I/O events and reducing overhead in high-throughput applications.[96] Additionally, Rust's memory-safe concurrency model has begun influencing kernel development, with Rust code integrated into the Linux kernel since version 6.1 to mitigate race conditions in drivers and subsystems, enhancing overall thread safety without compromising performance.[97][98]Language and Library Support
Built-in Language Features
Programming languages provide built-in support for threading through core syntactic constructs and semantics that enable concurrent execution without relying solely on external libraries. In imperative languages like Java, theThread class, part of the core java.lang package since Java 1.0 in 1996, allows direct creation and management of threads by extending the class or implementing the Runnable interface.[99] The synchronized keyword, also introduced in Java 1.0, provides intrinsic locking for methods and blocks to ensure thread safety by preventing concurrent access to shared resources.[78] Starting with Java 21 in September 2023, virtual threads were added as a built-in feature, enabling the creation of lightweight, JVM-managed threads via methods like Thread.ofVirtual(), which support high-throughput concurrency for tasks that are often blocked on I/O, allowing millions of threads with low overhead while maintaining compatibility with existing code.[45] Similarly, C# integrates concurrency through the Task Parallel Library (TPL), introduced in .NET 4.0 in 2010, which builds on the language's Task type to simplify parallel operations, though core thread management uses the Thread class from System.Threading since .NET 1.0 in 2002.[100]
Influenced by functional programming paradigms, Go incorporates goroutines as a lightweight concurrency primitive, launched via the go keyword since Go 1.0 in 2012, allowing thousands of concurrent executions with minimal overhead managed by the runtime.[101] Channels, a built-in typed construct in Go since its 1.0 release, enable safe communication and synchronization between goroutines by passing ownership of data, promoting a "share by communicating" model over shared memory.[102]
At the low level, C and C++ lack native syntactic keywords for threading, instead depending on platform-specific libraries such as POSIX threads (pthreads) for concurrency, which requires explicit inclusion and linking.[103] In contrast, Rust integrates threading via the std::thread module since its 1.0 stable release in 2015, but its true built-in innovation lies in the ownership system and Send and Sync traits, which enforce compile-time guarantees against data races and unsafe shared access across threads.[104][105]
Over time, languages have evolved toward asynchronous concurrency models for better handling of I/O-bound tasks. Python introduced the async and await keywords in version 3.5 in 2015 via PEP 492, enabling cooperative multitasking through native coroutines that integrate seamlessly with event loops, shifting from traditional preemptive threading to non-blocking execution.[106] This progression reflects a broader trend in language design toward safer, more efficient concurrency primitives that reduce boilerplate while maintaining performance.