Threads
Chapter 4
Thread Creation
The simplest creation function in user mode is
CreateThread. This function creates a thread in the current process, accepting the following arguments:An optional security attributes structure - This specifies the security descriptor to attach to the newly created thread. It also specifies whether the thread handle is to be created as inheritable.
An optional stack size - If zero is specified, a default is taken from the executable's header. This always applies to the first thread in a user-mode process.
A function pointer - This serves as the entry point for the new thread's execution.
An optional argument - This is to pass to the thread's function.
Optional flags - One controls whether the thread starts suspended (
CREATE_SUSPENDED). The other controls the interpretation of the stack size argument (initial committed size or maximum reserved size).
On successful completion, a non-zero handle is returned for the new thread and, if requested by the caller, the unique thread ID.
An extended thread creation function is
CreateRemoteThread. This function accepts an extra argument (the first), which is a handle to a target process where the thread is to be created. You can use this function to inject a thread into another process.One common use of this technique is for a debugger to force a break in a debugged process. The debugger injects the thread, which immediately causes a breakpoint by calling the
DebugBreakfunction. Another common use of this technique is for one process to obtain internal information about another process, which is easier when running within the target process context.The final function worth mentioning here is
CreateRemoteThreadEx, which is a superset ofCreateThreadandCreateRemoteThread. In fact, the implementation ofCreateThreadandCreateRemoteThreadsimply callsCreateRemoteThreadExwith the appropriate defaults.CreateRemoteThreadExadds the ability to provide an attribute list (similar to theSTARTUPINFOEXstructure's role with an additional member overSTARTUPINFOwhen creating processes). Examples of attributes include setting the ideal processor and group affinity.If all goes well,
CreateRemoteThreadExeventually callsNtCreateThreadExin Ntdll.dll. This makes the usual transition to kernel mode, where execution continues in the executive functionNtCreateThreadEx. There, the kernel mode part of thread creation occurs.Exiting a kernel thread's function does not automatically destroy the thread object. Instead, drivers must call
PsTerminateSystemThreadfrom within the thread function to properly terminate the thread. Consequently, this function never returns.
Thread Internals
At the operating-system (OS) level, a Windows thread is represented by an executive thread object. The executive thread object encapsulates an
ETHREADstructure, which in turn contains aKTHREADstructure as its first member.The
ETHREADstructure and the other structures it points to exist in the system address space. The only exception is the thread environment block (TEB), which exists in the process address space (similar to a PEB, because user-mode components need to access it).


The Windows subsystem process (Csrss) maintains a parallel structure for each thread created in a Windows subsystem application, called the
CSR_THREAD.For threads that have called a Windows subsystem USER or GDI function, the kernel-mode portion of the Windows subsystem (Win32k.sys) maintains a per-thread data structure (
W32THREAD) that theKTHREADstructure points to.The first member of the
ETHREADis calledTcb. This is short for thread control block, which is a structure of typeKTHREAD. Following that are the thread identification information, the process identification information (including a pointer to the owning process so that its environment information can be accessed), security information in the form of a pointer to the access token and impersonation information, fields relating to Asynchronous Local Procedure Call (ALPC) messages, pending I/O requests (IRPs) and Windows 10βspecific fields related to power management and CPU Sets.Internally,
TEBis made up of a header called the Thread Information Block (TIB), which mainly existed for compatibility with OS/2 and Win9x applications. It also allows exception and stack information to be kept into a smaller structure when creating new threads by using an initialTIB.

The
TEBstores context information for the image loader and various Windows DLLs. Because these components run in user mode, they need a data structure writable from user mode. That's why this structure exists in the process address space instead of in the system space, where it would be writable only from kernel mode.The
CSR_THREADis analogous to the data structure ofCSR_PROCESS, but it's applied to threads and is maintained by each Csrss process within a session and identifies the Windows subsystem threads running within it.CSR_THREADstores a handle that Csrss keeps for the thread, various flags, the client ID (thread ID and process ID), and a copy of the thread's creation time.Note that threads are registered with Csrss when they send their first message to Csrss, typically due to some API that requires notifying Csrss of some operation or condition.

Finally, the
W32THREADstructure is analogous to the data structure ofW32PROCESS, but it's applied to threads. This structure mainly contains information useful for the GDI subsystem (brushes and Device Context attributes) and DirectX, as well as for the User Mode Print Driver (UMPD) framework that vendors use to write user-mode printer drivers. Finally, it contains a rendering state useful for desktop compositing and anti-aliasing.

Birth Of A Thread
A thread's life cycle starts when a process creates a new thread. The request filters down to the Windows executive, where the process manager allocates space for a thread object and calls the kernel to initialize the thread control block (
KTHREAD).The following steps are taken inside the
CreateRemoteThreadExfunction in Kernel32.dll to create a Windows thread:The function converts the Windows API parameters to native flags and builds a native structure describing object parameters (
OBJECT_ATTRIBUTES).It builds an attribute list with two entries: client ID and TEB address.
It determines whether the thread is created in the calling process or another process indicated by the handle passed in. If the handle is equal to the pseudo handle returned from
GetCurrentProcess(with a value of -1), then it's the same process. If the process handle is different, it could still be a valid handle to the same process, so a call is made toNtQueryInformationProcess(in Ntdll) to find out whether that is indeed the case.It calls
NtCreateThreadEx(in Ntdll) to make the transition to the executive in kernel mode and continues inside a function with the same name and arguments.NtCreateThreadEx(inside the executive) creates and initializes the user-mode thread context (its structure is architecture-specific) and then callsPspCreateThreadto create a suspended executive thread object. (For a description of the steps performed by this function, see the descriptions of stage 3 and stage 5 in Chapter 3 in the section "Flow of CreateProcess.") Then the function returns, eventually ending back in user mode atCreateRemoteThreadEx.CreateRemoteThreadExallocates an activation context for the thread used by side-by-side assembly support. It then queries the activation stack to see if it requires activation and activates it if needed. The activation stack pointer is saved in the new thread's TEB.Unless the caller created the thread with the
CREATE_SUSPENDEDflag set, the thread is now resumed so that it can be scheduled for execution. When the thread starts running, it executes the steps described in Chapter 3 in the section "Stage 7: performing process initialization in the context of the new process" before calling the actual user's specified start address.The thread handle and the thread ID are returned to the caller.
Protected Process Thread Limitations
Limitations of Protected Process and Protected Process Light (PPL) also apply to the threads inside the process. This ensures that the actual code running inside the protected process cannot be hijacked or otherwise affected through standard Windows functions, which require access rights that are not granted for protected process threads. In fact, the only permissions granted are
THREAD_SUSPEND_RESUMEandTHREAD_SET/QUERY_LIMITED_INFORMATION.
Group-Based Scheduling
We will go into standard thread-based scheduling in Windows later. Standard thread-based scheduling reliably serves general user and server scenarios.
However, because thread-based scheduling attempts to fairly share the processor or processors only among competing threads of the same priority, it does not account for higher-level requirements such as the distribution of threads to users and the potential for certain users to benefit from more overall CPU time at the expense of other users.
This is problematic in terminal-services environments, in which dozens of users compete for CPU time. If only thread-based scheduling is used, a single high-priority thread from a given user has the potential to starve threads from all users on the machine.
Windows 8 and Server 2012 introduced a group-based scheduling mechanism, built around the concept of a scheduling group (
KSCHEDULING_GROUP). A scheduling group maintains a policy, scheduling parameters, and a list of kernel scheduling control blocks (KSCBs), one per processor, that are part of the scheduling group.The flip side is that a thread points to a scheduling group it belongs to. If that pointer is null, it means the thread is outside any scheduling group's control.

In this figure, threads T1, T2, and T3 belong to the scheduling group, while thread T4 does not.
Here are some terms related to group scheduling:
Generation - This is the amount of time over which to track CPU usage.
Quota - This is the amount of CPU usage allowed to a group per generation. Over quota means the group has used up all its budget. Under quota means the group has not used its full budget.
Weight - This is the relative importance of a group, between 1 and 9, where the default is 5.
Fair-share scheduling - With this type of scheduling, idle cycles can be given to threads that are over quota if no under-quota threads want to run.
The KSCB structure contains CPU-related information as follows:
Cycle usage for this generation
Long-term average cycle usage, so that a burst of thread activity can be distinguished from a true hog
Control flags such as hard capping, which means that even if CPU time is available above the assigned quota, it will not be used to give the thread extra CPU time
Ready queues, based on the standard priorities (0 to 15 only because real-time threads are never part of a scheduling group)
An important parameter maintained by a scheduling group is called rank, which can be considered a scheduling priority of the entire group of threads. A rank with a value of 0 is the highest. A higher-rank number means the group has used more CPU time and so is less likely to get more CPU time.
Rank always trumps priority. This means that given two threads with different ranks, the lower value rank is preferred, regardless of priority. Equal-rank threads are compared based on priority. The rank is adjusted periodically as cycle usage increases.
Rank 0 is the highest (so it always wins out) against a higher number rank, and is implicit for some threads. This can indicate one of the following:
The thread is not in any scheduling group ("normal" threads)
Under-quota threads
Real-time priority threads (16β31)
Threads executing at IRQL
APC_LEVEL(1) within a kernel critical or guarded region
If a scheduling group exists, the lowest value rank wins out, followed by priority (if ranks are equal), followed by the first arriving thread (if priorities are equal; round-robin at quantum end).
Dynamic Fair Share Scheduling
Dynamic fair share scheduling (DFSS) is a mechanism that can be used to fairly distribute CPU time among sessions running on a machine. It prevents one session from potentially monopolizing the CPU if some threads running under that session have a relatively high priority and run a lot.
During the very last parts of system initialization, as the registry
SOFTWAREhive is initialized by Smss.exe, the process manager initiates the final post-boot initialization inPsBootPhaseComplete, which callsPspIsDfssEnabled. Here, the system decides which of the two CPU quota mechanisms (DFSS or legacy) will be employed.For DFSS to be enabled, the
EnableCpuQuotaregistry value must be set to a non-zero value in both of the quota keys. The first of these isHKLM\SOFTWARE\Policies\Microsoft\Windows\Session Manager\Quota System, for the policy-based setting. The second isHKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Quota System, under the system key. This determines whether the system supports the functionality (which, by default, is set to TRUE on Windows Server with the Remote Desktop role).If DFSS is enabled, the
PsCpuFairShareEnabledglobal variable is set to TRUE, which makes all threads belong to scheduling groups (except session 0 processes).DFSS configuration parameters are read from the aforementioned keys by a call to
PspReadDfssConfigurationValuesand stored in global variables. These keys are monitored by the system. If modified, the notification callback callsPspReadDfssConfigurationValuesagain to update the configuration values.

After DFSS is enabled, whenever a new session is created (other than session 0),
MiSessionObjectCreateallocates a scheduling group associated with the session with the default weight of 5, which is the middle ground between the minimum of 1 and the maximum of 9.A scheduling group manages either DFSS or CPU rate, control information based on a policy structure (
KSCHEDULING_GROUP_POLICY) that is part of a scheduling group. The Type member indicates whether it's configured for DFSS (WeightBased=0) or rate control (RateControl=1).MiSessionObjectCreatecallsKeInsertSchedulingGroupto insert the scheduling group into a global system list (maintained in the global variableKiSchedulingGroupList, needed for weight recalculation if processors are hot-added). The resulting scheduling group is also pointed to by theSESSION_OBJECTstructure for the particular session.
CPU rate limits
DFSS works by automatically placing new threads inside the session-scheduling group. This is fine for a terminal-services scenario, but is not good enough as a general mechanism to limit the CPU time of threads or processes.
The scheduling-group infrastructure can be used in a more granular fashion by using a job object. One of the limitations you can place on a job is a CPU rate control, which you do by calling
SetInformationJobObjectwithJobObjectCpuRateControlInformationas the job information class and a structure of typeJOBOBJECT_CPU_RATE_CONTROL_INFORMATIONcontaining the actual control data.The structure contains a set of flags that enable you to apply one of three settings to limit CPU time:
CPU rate - This value can be between 1 and 10000 and represents a percent multiplied by 100 (for example, for 40 percent the value should be 4000).
Weight-based - This value can be between 1 and 9, relative to the weight of other jobs. (DFSS is configured with this setting.)
Minimum and maximum CPU rates - These values are specified similarly to the first option. When the threads in the job reach the maximum percentage specified in the measuring interval (600 ms by default), they cannot get any more CPU time until the next interval begins. You can use a control flag to specify whether to use hard capping to enforce the limit even if there is spare CPU time available.
Dynamic processor addition and replacement
Dynamic processor support for CPU hot-swapping is provided through the HAL, which notifies the kernel of a new processor on the system through the
KeStartDynamicProcessorfunction. This routine does similar work to that performed when the system detects more than one processor at startup and needs to initialize the structures related to them.When a dynamic processor is added, various system components perform some additional work. For example, the memory manager allocates new pages and memory structures optimized for the CPU. It also initializes a new DPC kernel stack while the kernel initializes the global descriptor table (GDT), the interrupt dispatch table (IDT), the processor control region (PCR), the process control block (PRCB), and other related structures for the processor.
Other executive parts of the kernel are also called, mostly to initialize the per-processor look-aside lists for the processor that was added. For example, the I/O manager, executive look-aside list code, cache manager, and object manager all use per-processor look-aside lists for their frequently allocated structures.
Finally, the kernel initializes threaded DPC support for the processor and adjusts exported kernel variables to report the new processor. Different memory-manager masks and process seeds based on processor counts are also updated, and processor features need to be updated for the new processor to match the rest of the system; for example, enabling virtualization support on the newly added processor. The initialization sequence completes with the notification to the Windows Hardware Error Architecture (WHEA) component that a new processor is online.
The HAL is also involved in this process. It is called once to start the dynamic processor after the kernel is aware of it, and called again after the kernel has finished initialization of the processor.
Drivers are notified of the newly added processor using the default executive callback object,
ProcessorAdd, with which drivers can register for notifications. Once drivers are notified, the final kernel component called is the Plug and Play manager, which adds the processor to the system's device node and rebalances interrupts so that the new processor can handle interrupts that were already registered for other processors.Applications do not take advantage of a dynamically added processor by default. They must request it. This is done to prevent potential race conditions or misdistribution of work.
The
SetProcessAffinityUpdateModeandQueryProcessAffinityUpdateModeWindows APIs, which use the undocumentedNtSet/QueryInformationProcesssystem call) tell the process manager that these applications should have their affinity updated (by setting theAffinityUpdateEnableflag inEPROCESS) or that they do not want to deal with affinity updates (by setting theAffinityPermanentflag inEPROCESS). This is a one-time change. After an application has told the system that its affinity is permanent, it cannot later change its mind and request affinity updates.As part of
KeStartDynamicProcessor, a new step has been added after interrupts are rebalanced: calling the process manager to perform affinity updates throughPsUpdateActiveProcessAffinity. Some Windows core processes and services already have affinity updates enabled, while third-party software will need to be recompiled to take advantage of the new API call. The System process, Svchost processes, and Smss are all compatible with dynamic processor addition.
Worker Factories (Thread Pool)
Worker factories are the internal mechanism used to implement user-mode thread pools. Most of the functionality required to support the user-mode thread pool implementation in Windows is now located in the kernel.
Ntdll.dllmerely provides the interfaces and high-level APIs required for interfacing with the worker factory kernel code.This kernel thread pool functionality in Windows is managed by an object manager type called
TpWorkerFactory, as well as four native system calls for managing the factory and its workers (NtCreateWorkerFactory,NtWorkerFactoryWorkerReady,NtReleaseWorkerFactoryWorker, andNtShutdownWorkerFactory); two query/set native calls (NtQueryInformationWorkerFactoryandNtSetInformationWorkerFactory); and a wait call (NtWaitForWorkViaWorkerFactory).Just like other native system calls, these calls provide user mode with a handle to the
TpWorkerFactoryobject, which means developers work with opaque descriptors: aTP_POOLpointer for a thread pool and other opaque pointers for object created from a pool, includingTP_WORK(work callback),TP_TIMER(timer callback),TP_WAIT(wait callbacks), etc. These structures hold various pieces of information, such as the handle to theTpWorkerFactoryobject.A worker factory will create a new thread whenever all of the following conditions are met:
Dynamic thread creation is enabled.
The number of available workers is lower than the maximum number of workers configured for the factory (default of 500).
The worker factory has bound objects (for example, an ALPC port that this worker thread is waiting on) or a thread has been activated into the pool.
There are pending I/O request packets associated with a worker thread.
In addition, it will terminate threads whenever they've become idle for more than 10 seconds (by default).
The "release" worker factory call (which queues work) is a wrapper around
IoSetIoCompletionEx, which increases pending work, while the "wait" call is a wrapper aroundIoRemoveIoCompletion. Both these routines call into the kernel queue implementation.Therefore, the job of the worker factory code is to manage either a persistent, static, or dynamic thread pool; wrap the I/O completion port model into interfaces that try to prevent stalled worker queues by automatically creating dynamic threads; and simplify global cleanup and termination operations during a factory shutdown request.
The executive function that creates the worker factory,
NtCreateWorkerFactory, accepts several arguments that allow customization of the thread pool, such as the maximum threads to create and the initial committed and reserved stack sizes.The
CreateThreadpoolWindows API, however, uses the default stack sizes embedded in the executable image (just like a defaultCreateThreadwould). The Windows API does not, however, provide a way to override these defaults. This is somewhat unfortunate, as in many cases thread-pool threads don't require deep call stacks, and it would be beneficial to allocate smaller stacks.The
NtQueryInformationWorkerFactoryAPI dumps almost every field in the worker factory structure.
Thread Scheduling
Windows implements a priority-driven, preemptive thread scheduling system. At least one of the highest-priority runnable (ready) threads always runs, with the caveat that certain high-priority threads ready to run might be limited by the processors on which they might be allowed or preferred to run on β a phenomenon called processor affinity.
Processor affinity is defined based on a given processor group, which collects up to 64 processors. By default, threads can run only on available processors within the processor group associated with the process. (This is to maintain compatibility with older versions of Windows, which supported only 64 processors.)
Developers can alter processor affinity by using the appropriate APIs or by setting an affinity mask in the image header, and users can use tools to change affinity at run time or at process creation.
However, although multiple threads in a process can be associated with different groups, a thread on its own can run only on the processors available within its assigned group.
Additionally, developers can choose to create group-aware applications, which use extended scheduling APIs to associate logical processors on different groups with the affinity of their threads. Doing so converts the process into a multigroup process that can theoretically run its threads on any available processor within the machine.
After a thread is selected to run, it runs for an amount of time called a quantum. A quantum is the length of time a thread is allowed to run before another thread at the same priority level is given a turn to run.
Quantum values can vary from system to system and process to process for any of three reasons:
System configuration settings (long or short quantums, variable or fixed quantums, and priority separation)
Foreground or background status of the process
Use of the job object to alter the quantum
A thread might not get to complete its quantum, however, because Windows implements a preemptive scheduler. That is, if another thread with a higher priority becomes ready to run, the currently running thread might be preempted before finishing its time slice. In fact, a thread can be selected to run next and be preempted before even beginning its quantum!
The Windows scheduling code is implemented in the kernel. There's no single "scheduler" module or routine, however. The code is spread throughout the kernel in which scheduling-related events occur. The routines that perform these duties are collectively called the kernel's dispatcher.
The following events might require thread dispatching:
A thread becomes ready to execute.
A thread leaves the running state because its time quantum ends, it terminates, it yields execution, or it enters a wait state.
A thread's priority changes, either because of a system service call or because Windows itself changes the priority value.
A thread's processor affinity changes so that it will no longer run on the processor on which it was running.
After a logical processor has selected a new thread to run, it eventually performs a context switch to it. A context switch is the procedure of saving the volatile processor state associated with a running thread, loading another thread's volatile state, and starting the new thread's execution.
Windows schedules at the thread granularity level. Scheduling decisions are made strictly on a thread basis, no consideration is given to what process the thread belongs to.
Priority Levels
Windows uses 32 priority levels internally, ranging from 0 to 31 (31 is the highest):
Sixteen real-time levels (16 through 31)
Sixteen variable levels (0 through 15), out of which level 0 is reserved for the zero page thread.

Thread priority levels are assigned from two different perspectives: those of the Windows API and those of the Windows kernel. The Windows API first organizes processes by the priority class to which they are assigned at creation (the numbers in parentheses represent the internal
PROCESS_PRIORITY_CLASSindex recognized by the kernel):Real-Time (4)
High (3)
Above Normal (6)
Normal (2)
Below Normal (5)
Idle (1)
The Windows API
SetPriorityClassallows changing a process's priority class to one of these levels.It then assigns a relative priority of the individual threads within those processes. Here, the numbers represent a priority delta that is applied to the process base priority:
Time-Critical (15)
Highest (2)
Above-Normal (1)
Normal (0)
Below-Normal (β1)
Lowest (β2)
Idle (β15)
Time-Critical and Idle levels (+15 and β15) are called saturation values and represent specific levels that are applied rather than true offsets. These values can be passed to the
SetThreadPriorityWindows API to change a thread's relative priority.Therefore, in the Windows API, each thread has a base priority that is a function of its process priority class and its relative thread priority. In the kernel, the process priority class is converted to a base priority by using the
PspPriorityTableglobal array and thePROCESS_PRIORITY_CLASSindices shown earlier, which sets priorities of 4, 8, 13, 14, 6, and 10, respectively. (This is a fixed mapping that cannot be changed.)The relative thread priority is then applied as a differential to this base priority. For example, a Highest thread will receive a thread base priority of two levels higher than the base priority of its process.


The Time-Critical and Idle relative thread priorities maintain their respective values regardless of the process priority class (unless it is Real-Time) because the Windows API requests saturation of the priority from the kernel, by passing in +16 or β16 as the requested relative priority. The formula used to get these values is as follows (
HIGH_PRIORITYequals 31):
These values are then recognized by the kernel as a request for saturation, and the Saturation field in
KTHREADis set. For positive saturation, this causes the thread to receive the highest possible priority within its priority class (dynamic or real-time); for negative saturation, it's the lowest possible one. Additionally, future requests to change the base priority of the process will no longer affect the base priority of these threads because saturated threads are skipped in the processing code.Regardless of how the thread's priority came to be by using the Windows API, from the point of view of the scheduler, only the final result matters.
Whereas a process has only a single base priority value, each thread has two priority values: current (dynamic) and base. Scheduling decisions are made based on the current priority (which can be altered using priority boosts). Windows never adjusts the priority of threads in the Real-Time range (16 through 31), so they always have the same base and current priority.
A thread's initial base priority is inherited from the process base priority. A process, by default, inherits its base priority from the process that created it. You can override this behavior on the
CreateProcessfunction or by using the command-linestartcommand. You can also change a process priority after it is created by using theSetPriorityClassfunction or by using various tools that expose that function, such as Task Manager or Process Explorer.Changing the priority of a process changes the thread priorities up or down, but their relative settings remain the same.
Normally, user applications and services start with a normal base priority, so their initial thread typically executes at priority level 8. However, some Windows system processes (such as the Session Manager, Service Control Manager, and local security authentication process) have a base process priority slightly higher than the default for the Normal class (8). This higher default value ensures that the threads in these processes will all start at a higher priority than the default value of 8.
Real-Time Priorities
You can raise or lower thread priorities within the dynamic range in any application. However, you must have the increase scheduling priority privilege (
SeIncreaseBasePriorityPrivilege) to enter the Real-Time range.Be aware that many important Windows kernel-mode system threads run in the Real-Time priority range, so if threads spend excessive time running in this range, they might block critical system functions (such as in the memory manager, cache manager, or some device drivers)
Using the standard Windows APIs, once a process has entered the Real-Time range, all its threads (even Idle ones) must run at one of the Real-Time priority levels. It is thus impossible to mix real-time and dynamic threads within the same process through standard interfaces. This is because the
SetThreadPriorityAPI calls the nativeNtSetInformationThreadAPI with theThreadBasePriorityinformation class, which allows priorities to remain only in the same range.Furthermore, this information class allows priority changes only in the recognized Windows API deltas of β2 to 2 (or Time-Critical/Idle) unless the request comes from CSRSS or another real-time process. In other words, this means that a real-time process can pick thread priorities anywhere between 16 and 31, even though the standard Windows API relative thread priorities would seem to limit its choices based on the table that was shown earlier.
As mentioned, calling
SetThreadPrioritywith one of a set of special values causes a call toNtSetInformationThreadwith theThreadActualBasePriorityinformation class, the kernel base priority for the thread can be directly set, including in the dynamic range for a real-time process.
Thread States
The thread states are as follows:
ReadyA thread in the ready state is waiting to execute or to be in-swapped after completing a wait. When looking for a thread to execute, the dispatcher considers only the threads in the ready state.Deferred readyThis state is used for threads that have been selected to run on a specific processor but have not actually started running there. This state exists so that the kernel can minimize the amount of time the per-processor lock on the scheduling database is held.StandbyA thread in this state has been selected to run next on a particular processor. When the correct conditions exist, the dispatcher performs a context switch to this thread. Only one thread can be in the standby state for each processor on the system. Note that a thread can be preempted out of the standby state before it ever executes (if, for example, a higher-priority thread becomes runnable before the standby thread begins execution).RunningAfter the dispatcher performs a context switch to a thread, the thread enters the running state and executes. The threadβs execution continues until its quantum ends (and an- other thread at the same priority is ready to run), it is preempted by a higher-priority thread, it terminates, it yields execution, or it voluntarily enters the waiting state.WaitingA thread can enter the waiting state in several ways: A thread can voluntarily wait for an object to synchronize its execution, the OS can wait on the threadβs behalf (such as to resolve a paging I/O), or an environment subsystem can direct the thread to suspend itself. When the threadβs wait ends, depending on its priority, the thread either begins running immediately or is moved back to the ready state.TransitionA thread enters the transition state if it is ready for execution but its kernel stack is paged out of memory. After its kernel stack is brought back into memory, the thread enters the ready state.TerminatedWhen a thread finishes executing, it enters this state. After the thread is terminated, the executive thread object (the data structure in system memory that describes the thread) might or might not be deallocated. The object manager sets the policy regarding when to delete the object. For example, the object remains if there are any open handles to the thread. A thread can also enter the terminated state from other states if itβs killed explicitly by some other threadβfor example, by calling theTerminateThreadWindows API.InitializedThis state is used internally while a thread is being created.

The numeric values shown represent the internal values of each state and can be viewed with a tool such as Performance Monitor. The ready and deferred ready states are represented as one. This reflects the fact that the deferred ready state acts as a temporary placeholder for the scheduling routines. This is true for the standby state as well.
Dispatcher Database
To make thread-scheduling decisions, the kernel maintains a set of data structures known collectively as the dispatcher database. The dispatcher database keeps track of which threads are waiting to execute and which processors are executing which threads.
To improve scalability, including thread-dispatching concurrency, Windows multiprocessor systems have per-processor dispatcher ready queues and shared processor group queues. In this way, each CPU can check its own shared ready queue for the next thread to run without having to lock the system-wide ready queues.

Starting with Windows 8 and Windows Server 2012, a shared ready queue and ready summary are used for a group of processors. This enables the system to make better decisions about which processor to use next for that group of processors. (The per-CPU ready queues are still there and used for threads with affinity constraints.)
The maximum group size is 4 processors. If the number of logical processors is greater than four, then more than one group would be created, and the available processors spread evenly.
The ready queues, ready summary, and some other information is stored in a kernel structure named
KSHARED_READY_QUEUEthat is stored in the Processor Control Block (PRCB). Although it exists for every processor, it's used only on the first processor of each processor group, sharing it with the rest of the processors in that group.The dispatcher ready queues (
ReadListHeadinKSHARED_READY_QUEUE) contain the threads that are in the ready state, waiting to be scheduled for execution. There is one queue for each of the 32 priority levels.To speed up the selection of which thread to run or preempt, Windows maintains a 32-bit bitmask called the ready summary (
ReadySummary). Each bit set indicates one or more threads in the ready queue for that priority level (bit 0 represents priority 0, bit 1 priority 1, and so on). Instead of scanning each ready list to see whether it is empty or not (which would make scheduling decisions dependent on the number of different priority threads), a single bit scan is performed as a native processor command to find the highest bit set. Regardless of the number of threads in the ready queue, this operation takes a constant amount of time.The dispatcher database is synchronized by raising IRQL to
DISPATCH_LEVEL (2). Raising IRQL in this way prevents other threads from interrupting thread dispatching on the processor because threads normally run at IRQL 0 or 1.
Quantum
A quantum is the amount of time a thread is permitted to run before Windows checks to see whether another thread at the same priority is waiting to run. If a thread completes its quantum and there are no other threads at its priority, Windows permits the thread to run for another quantum.
On client versions of Windows, threads run for two clock intervals by default. On server systems, threads run for 12 clock intervals by default. The rationale for the longer default value on server systems is to minimize context switching. By having a longer quantum, server applications that wake up because of a client request have a better chance of completing the request and going back into a wait state before their quantum ends.
The length of the clock interval varies according to the hardware platform. The frequency of the clock interrupts is up to the HAL, not the kernel. This clock interval is stored in the kernel variable
KeMaximumIncrementas hundreds of nanoseconds.Although threads run in units of clock intervals, the system does not use the count of clock ticks as the gauge for how long a thread has run and whether its quantum has expired. This is because thread run-time accounting is based on processor cycles. When the system starts up, it multiplies the processor speed (CPU clock cycles per second) in hertz (Hz) by the number of seconds it takes for one clock tick to fire (based on the
KeMaximumIncrementvalue described earlier) to calculate the number of clock cycles to which each quantum is equivalent. This value is stored in the kernel variableKiCyclesPerClockQuantum.The result of this accounting method is that threads do not actually run for a quantum number based on clock ticks. Instead, they run for a quantum target, which represents an estimate of what the number of CPU clock cycles the thread has consumed should be when its turn would be given up. This target should be equal to an equivalent number of clock interval timer ticks.
Quantum Accounting
Each process has a quantum reset value in the process control block (
KPROCESS). This value is used when creating new threads inside the process and is duplicated in the thread control block (KTHREAD), which is then used when giving a thread a new quantum target. The quantum reset value is stored in terms of actual quantum units, which are then multiplied by the number of clock cycles per quantum, resulting in the quantum target.As a thread runs, CPU clock cycles are charged at different events, such as context switches, interrupts, and certain scheduling decisions. If, at a clock interval timer interrupt, the number of CPU clock cycles charged has reached (or passed) the quantum target, quantum end processing is triggered. If there is another thread at the same priority waiting to run, a context switch occurs to the next thread in the ready queue.
Internally, a quantum unit is represented as one-third of a clock tick. That is, one clock tick equals three quantums. This means that on client Windows systems, threads have a quantum reset value of 6 (2 Γ 3) and that server systems have a quantum reset value of 36 (12 Γ 3) by default. For this reason, the
KiCyclesPerClockQuantumvalue is divided by 3 at the end of the calculation previously described, because the original value describes only CPU clock cycles per clock interval timer tick.The reason a quantum was stored internally as a fraction of a clock tick rather than as an entire tick was to allow for partial quantum decay-on-wait completion on versions of Windows prior to Windows Vista. Prior versions used the clock interval timer for quantum expiration. If this adjustment had not been made, it would have been possible for threads to never have their quantums reduced. For example, if a thread ran, entered a wait state, ran again, and entered another wait state but was never the currently running thread when the clock interval timer fired, it would never have its quantum charged for the time it was running. Because threads now have CPU clock cycles charged instead of quantums, and because this no longer depends on the clock interval timer, these adjustments are not required.
Variable Quantums
Last updated