Threads

Chapter 4

Thread Creation

  • The simplest creation function in user mode is CreateThread. This function creates a thread in the current process, accepting the following arguments:

    • An optional security attributes structure - This specifies the security descriptor to attach to the newly created thread. It also specifies whether the thread handle is to be created as inheritable.

    • An optional stack size - If zero is specified, a default is taken from the executable's header. This always applies to the first thread in a user-mode process.

    • A function pointer - This serves as the entry point for the new thread's execution.

    • An optional argument - This is to pass to the thread's function.

    • Optional flags - One controls whether the thread starts suspended (CREATE_SUSPENDED). The other controls the interpretation of the stack size argument (initial committed size or maximum reserved size).

  • On successful completion, a non-zero handle is returned for the new thread and, if requested by the caller, the unique thread ID.

  • An extended thread creation function is CreateRemoteThread. This function accepts an extra argument (the first), which is a handle to a target process where the thread is to be created. You can use this function to inject a thread into another process.

  • One common use of this technique is for a debugger to force a break in a debugged process. The debugger injects the thread, which immediately causes a breakpoint by calling the DebugBreak function. Another common use of this technique is for one process to obtain internal information about another process, which is easier when running within the target process context.

  • The final function worth mentioning here is CreateRemoteThreadEx, which is a superset of CreateThread and CreateRemoteThread. In fact, the implementation of CreateThread and CreateRemoteThread simply calls CreateRemoteThreadEx with the appropriate defaults. CreateRemoteThreadEx adds the ability to provide an attribute list (similar to the STARTUPINFOEX structure's role with an additional member over STARTUPINFO when creating processes). Examples of attributes include setting the ideal processor and group affinity.

  • If all goes well, CreateRemoteThreadEx eventually calls NtCreateThreadEx in Ntdll.dll. This makes the usual transition to kernel mode, where execution continues in the executive function NtCreateThreadEx. There, the kernel mode part of thread creation occurs.

  • Exiting a kernel thread's function does not automatically destroy the thread object. Instead, drivers must call PsTerminateSystemThread from within the thread function to properly terminate the thread. Consequently, this function never returns.

Thread Internals

  • At the operating-system (OS) level, a Windows thread is represented by an executive thread object. The executive thread object encapsulates an ETHREAD structure, which in turn contains a KTHREAD structure as its first member.

  • The ETHREAD structure and the other structures it points to exist in the system address space. The only exception is the thread environment block (TEB), which exists in the process address space (similar to a PEB, because user-mode components need to access it).

ETHREAD Structure
KTHREAD Structure
  • The Windows subsystem process (Csrss) maintains a parallel structure for each thread created in a Windows subsystem application, called the CSR_THREAD.

  • For threads that have called a Windows subsystem USER or GDI function, the kernel-mode portion of the Windows subsystem (Win32k.sys) maintains a per-thread data structure (W32THREAD) that the KTHREAD structure points to.

  • The first member of the ETHREAD is called Tcb. This is short for thread control block, which is a structure of type KTHREAD. Following that are the thread identification information, the process identification information (including a pointer to the owning process so that its environment information can be accessed), security information in the form of a pointer to the access token and impersonation information, fields relating to Asynchronous Local Procedure Call (ALPC) messages, pending I/O requests (IRPs) and Windows 10–specific fields related to power management and CPU Sets.

  • Internally, TEB is made up of a header called the Thread Information Block (TIB), which mainly existed for compatibility with OS/2 and Win9x applications. It also allows exception and stack information to be kept into a smaller structure when creating new threads by using an initial TIB.

Thread Environment Block Structure
  • The TEB stores context information for the image loader and various Windows DLLs. Because these components run in user mode, they need a data structure writable from user mode. That's why this structure exists in the process address space instead of in the system space, where it would be writable only from kernel mode.

  • The CSR_THREAD is analogous to the data structure of CSR_PROCESS, but it's applied to threads and is maintained by each Csrss process within a session and identifies the Windows subsystem threads running within it. CSR_THREAD stores a handle that Csrss keeps for the thread, various flags, the client ID (thread ID and process ID), and a copy of the thread's creation time.

  • Note that threads are registered with Csrss when they send their first message to Csrss, typically due to some API that requires notifying Csrss of some operation or condition.

CSR_THREAD Structure
  • Finally, the W32THREAD structure is analogous to the data structure of W32PROCESS, but it's applied to threads. This structure mainly contains information useful for the GDI subsystem (brushes and Device Context attributes) and DirectX, as well as for the User Mode Print Driver (UMPD) framework that vendors use to write user-mode printer drivers. Finally, it contains a rendering state useful for desktop compositing and anti-aliasing.

W32THREAD Structure

Birth Of A Thread

  • A thread's life cycle starts when a process creates a new thread. The request filters down to the Windows executive, where the process manager allocates space for a thread object and calls the kernel to initialize the thread control block (KTHREAD).

  • The following steps are taken inside the CreateRemoteThreadEx function in Kernel32.dll to create a Windows thread:

    1. The function converts the Windows API parameters to native flags and builds a native structure describing object parameters (OBJECT_ATTRIBUTES).

    2. It builds an attribute list with two entries: client ID and TEB address.

    3. It determines whether the thread is created in the calling process or another process indicated by the handle passed in. If the handle is equal to the pseudo handle returned from GetCurrentProcess (with a value of -1), then it's the same process. If the process handle is different, it could still be a valid handle to the same process, so a call is made to NtQueryInformationProcess (in Ntdll) to find out whether that is indeed the case.

    4. It calls NtCreateThreadEx (in Ntdll) to make the transition to the executive in kernel mode and continues inside a function with the same name and arguments.

    5. NtCreateThreadEx (inside the executive) creates and initializes the user-mode thread context (its structure is architecture-specific) and then calls PspCreateThread to create a suspended executive thread object. (For a description of the steps performed by this function, see the descriptions of stage 3 and stage 5 in Chapter 3 in the section "Flow of CreateProcess.") Then the function returns, eventually ending back in user mode at CreateRemoteThreadEx.

    6. CreateRemoteThreadEx allocates an activation context for the thread used by side-by-side assembly support. It then queries the activation stack to see if it requires activation and activates it if needed. The activation stack pointer is saved in the new thread's TEB.

    7. Unless the caller created the thread with the CREATE_SUSPENDED flag set, the thread is now resumed so that it can be scheduled for execution. When the thread starts running, it executes the steps described in Chapter 3 in the section "Stage 7: performing process initialization in the context of the new process" before calling the actual user's specified start address.

    8. The thread handle and the thread ID are returned to the caller.

Protected Process Thread Limitations

  • Limitations of Protected Process and Protected Process Light (PPL) also apply to the threads inside the process. This ensures that the actual code running inside the protected process cannot be hijacked or otherwise affected through standard Windows functions, which require access rights that are not granted for protected process threads. In fact, the only permissions granted are THREAD_SUSPEND_RESUME and THREAD_SET/QUERY_LIMITED_INFORMATION.

Group-Based Scheduling

  • We will go into standard thread-based scheduling in Windows later. Standard thread-based scheduling reliably serves general user and server scenarios.

  • However, because thread-based scheduling attempts to fairly share the processor or processors only among competing threads of the same priority, it does not account for higher-level requirements such as the distribution of threads to users and the potential for certain users to benefit from more overall CPU time at the expense of other users.

  • This is problematic in terminal-services environments, in which dozens of users compete for CPU time. If only thread-based scheduling is used, a single high-priority thread from a given user has the potential to starve threads from all users on the machine.

  • Windows 8 and Server 2012 introduced a group-based scheduling mechanism, built around the concept of a scheduling group (KSCHEDULING_GROUP). A scheduling group maintains a policy, scheduling parameters, and a list of kernel scheduling control blocks (KSCBs), one per processor, that are part of the scheduling group.

  • The flip side is that a thread points to a scheduling group it belongs to. If that pointer is null, it means the thread is outside any scheduling group's control.

  • In this figure, threads T1, T2, and T3 belong to the scheduling group, while thread T4 does not.

  • Here are some terms related to group scheduling:

    • Generation - This is the amount of time over which to track CPU usage.

    • Quota - This is the amount of CPU usage allowed to a group per generation. Over quota means the group has used up all its budget. Under quota means the group has not used its full budget.

    • Weight - This is the relative importance of a group, between 1 and 9, where the default is 5.

    • Fair-share scheduling - With this type of scheduling, idle cycles can be given to threads that are over quota if no under-quota threads want to run.

  • The KSCB structure contains CPU-related information as follows:

    • Cycle usage for this generation

    • Long-term average cycle usage, so that a burst of thread activity can be distinguished from a true hog

    • Control flags such as hard capping, which means that even if CPU time is available above the assigned quota, it will not be used to give the thread extra CPU time

    • Ready queues, based on the standard priorities (0 to 15 only because real-time threads are never part of a scheduling group)

  • An important parameter maintained by a scheduling group is called rank, which can be considered a scheduling priority of the entire group of threads. A rank with a value of 0 is the highest. A higher-rank number means the group has used more CPU time and so is less likely to get more CPU time.

  • Rank always trumps priority. This means that given two threads with different ranks, the lower value rank is preferred, regardless of priority. Equal-rank threads are compared based on priority. The rank is adjusted periodically as cycle usage increases.

  • Rank 0 is the highest (so it always wins out) against a higher number rank, and is implicit for some threads. This can indicate one of the following:

    • The thread is not in any scheduling group ("normal" threads)

    • Under-quota threads

    • Real-time priority threads (16–31)

    • Threads executing at IRQL APC_LEVEL (1) within a kernel critical or guarded region

  • If a scheduling group exists, the lowest value rank wins out, followed by priority (if ranks are equal), followed by the first arriving thread (if priorities are equal; round-robin at quantum end).

Dynamic Fair Share Scheduling

  • Dynamic fair share scheduling (DFSS) is a mechanism that can be used to fairly distribute CPU time among sessions running on a machine. It prevents one session from potentially monopolizing the CPU if some threads running under that session have a relatively high priority and run a lot.

  • During the very last parts of system initialization, as the registry SOFTWARE hive is initialized by Smss.exe, the process manager initiates the final post-boot initialization in PsBootPhaseComplete, which calls PspIsDfssEnabled. Here, the system decides which of the two CPU quota mechanisms (DFSS or legacy) will be employed.

  • For DFSS to be enabled, the EnableCpuQuota registry value must be set to a non-zero value in both of the quota keys. The first of these is HKLM\SOFTWARE\Policies\Microsoft\Windows\Session Manager\Quota System, for the policy-based setting. The second is HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Quota System, under the system key. This determines whether the system supports the functionality (which, by default, is set to TRUE on Windows Server with the Remote Desktop role).

  • If DFSS is enabled, the PsCpuFairShareEnabled global variable is set to TRUE, which makes all threads belong to scheduling groups (except session 0 processes).

  • DFSS configuration parameters are read from the aforementioned keys by a call to PspReadDfssConfigurationValues and stored in global variables. These keys are monitored by the system. If modified, the notification callback calls PspReadDfssConfigurationValues again to update the configuration values.

  • After DFSS is enabled, whenever a new session is created (other than session 0), MiSessionObjectCreate allocates a scheduling group associated with the session with the default weight of 5, which is the middle ground between the minimum of 1 and the maximum of 9.

  • A scheduling group manages either DFSS or CPU rate, control information based on a policy structure (KSCHEDULING_GROUP_POLICY) that is part of a scheduling group. The Type member indicates whether it's configured for DFSS (WeightBased=0) or rate control (RateControl=1).

  • MiSessionObjectCreate calls KeInsertSchedulingGroup to insert the scheduling group into a global system list (maintained in the global variable KiSchedulingGroupList, needed for weight recalculation if processors are hot-added). The resulting scheduling group is also pointed to by the SESSION_OBJECT structure for the particular session.

CPU rate limits

  • DFSS works by automatically placing new threads inside the session-scheduling group. This is fine for a terminal-services scenario, but is not good enough as a general mechanism to limit the CPU time of threads or processes.

  • The scheduling-group infrastructure can be used in a more granular fashion by using a job object. One of the limitations you can place on a job is a CPU rate control, which you do by calling SetInformationJobObject with JobObjectCpuRateControlInformation as the job information class and a structure of type JOBOBJECT_CPU_RATE_CONTROL_INFORMATION containing the actual control data.

  • The structure contains a set of flags that enable you to apply one of three settings to limit CPU time:

    • CPU rate - This value can be between 1 and 10000 and represents a percent multiplied by 100 (for example, for 40 percent the value should be 4000).

    • Weight-based - This value can be between 1 and 9, relative to the weight of other jobs. (DFSS is configured with this setting.)

    • Minimum and maximum CPU rates - These values are specified similarly to the first option. When the threads in the job reach the maximum percentage specified in the measuring interval (600 ms by default), they cannot get any more CPU time until the next interval begins. You can use a control flag to specify whether to use hard capping to enforce the limit even if there is spare CPU time available.

Dynamic processor addition and replacement

  • Dynamic processor support for CPU hot-swapping is provided through the HAL, which notifies the kernel of a new processor on the system through the KeStartDynamicProcessor function. This routine does similar work to that performed when the system detects more than one processor at startup and needs to initialize the structures related to them.

  • When a dynamic processor is added, various system components perform some additional work. For example, the memory manager allocates new pages and memory structures optimized for the CPU. It also initializes a new DPC kernel stack while the kernel initializes the global descriptor table (GDT), the interrupt dispatch table (IDT), the processor control region (PCR), the process control block (PRCB), and other related structures for the processor.

  • Other executive parts of the kernel are also called, mostly to initialize the per-processor look-aside lists for the processor that was added. For example, the I/O manager, executive look-aside list code, cache manager, and object manager all use per-processor look-aside lists for their frequently allocated structures.

  • Finally, the kernel initializes threaded DPC support for the processor and adjusts exported kernel variables to report the new processor. Different memory-manager masks and process seeds based on processor counts are also updated, and processor features need to be updated for the new processor to match the rest of the system; for example, enabling virtualization support on the newly added processor. The initialization sequence completes with the notification to the Windows Hardware Error Architecture (WHEA) component that a new processor is online.

  • The HAL is also involved in this process. It is called once to start the dynamic processor after the kernel is aware of it, and called again after the kernel has finished initialization of the processor.

  • Drivers are notified of the newly added processor using the default executive callback object, ProcessorAdd, with which drivers can register for notifications. Once drivers are notified, the final kernel component called is the Plug and Play manager, which adds the processor to the system's device node and rebalances interrupts so that the new processor can handle interrupts that were already registered for other processors.

  • Applications do not take advantage of a dynamically added processor by default. They must request it. This is done to prevent potential race conditions or misdistribution of work.

  • The SetProcessAffinityUpdateMode and QueryProcessAffinityUpdateMode Windows APIs, which use the undocumented NtSet/QueryInformationProcess system call) tell the process manager that these applications should have their affinity updated (by setting the AffinityUpdateEnable flag in EPROCESS) or that they do not want to deal with affinity updates (by setting the AffinityPermanent flag in EPROCESS). This is a one-time change. After an application has told the system that its affinity is permanent, it cannot later change its mind and request affinity updates.

  • As part of KeStartDynamicProcessor, a new step has been added after interrupts are rebalanced: calling the process manager to perform affinity updates through PsUpdateActiveProcessAffinity. Some Windows core processes and services already have affinity updates enabled, while third-party software will need to be recompiled to take advantage of the new API call. The System process, Svchost processes, and Smss are all compatible with dynamic processor addition.

Worker Factories (Thread Pool)

  • Worker factories are the internal mechanism used to implement user-mode thread pools. Most of the functionality required to support the user-mode thread pool implementation in Windows is now located in the kernel. Ntdll.dll merely provides the interfaces and high-level APIs required for interfacing with the worker factory kernel code.

  • This kernel thread pool functionality in Windows is managed by an object manager type called TpWorkerFactory, as well as four native system calls for managing the factory and its workers (NtCreateWorkerFactory, NtWorkerFactoryWorkerReady, NtReleaseWorkerFactoryWorker, and NtShutdownWorkerFactory); two query/set native calls (NtQueryInformationWorkerFactory and NtSetInformationWorkerFactory); and a wait call (NtWaitForWorkViaWorkerFactory).

  • Just like other native system calls, these calls provide user mode with a handle to the TpWorkerFactory object, which means developers work with opaque descriptors: a TP_POOL pointer for a thread pool and other opaque pointers for object created from a pool, including TP_WORK (work callback), TP_TIMER (timer callback), TP_WAIT (wait callbacks), etc. These structures hold various pieces of information, such as the handle to the TpWorkerFactory object.

  • A worker factory will create a new thread whenever all of the following conditions are met:

    • Dynamic thread creation is enabled.

    • The number of available workers is lower than the maximum number of workers configured for the factory (default of 500).

    • The worker factory has bound objects (for example, an ALPC port that this worker thread is waiting on) or a thread has been activated into the pool.

    • There are pending I/O request packets associated with a worker thread.

  • In addition, it will terminate threads whenever they've become idle for more than 10 seconds (by default).

  • The "release" worker factory call (which queues work) is a wrapper around IoSetIoCompletionEx, which increases pending work, while the "wait" call is a wrapper around IoRemoveIoCompletion. Both these routines call into the kernel queue implementation.

  • Therefore, the job of the worker factory code is to manage either a persistent, static, or dynamic thread pool; wrap the I/O completion port model into interfaces that try to prevent stalled worker queues by automatically creating dynamic threads; and simplify global cleanup and termination operations during a factory shutdown request.

  • The executive function that creates the worker factory, NtCreateWorkerFactory, accepts several arguments that allow customization of the thread pool, such as the maximum threads to create and the initial committed and reserved stack sizes.

  • The CreateThreadpool Windows API, however, uses the default stack sizes embedded in the executable image (just like a default CreateThread would). The Windows API does not, however, provide a way to override these defaults. This is somewhat unfortunate, as in many cases thread-pool threads don't require deep call stacks, and it would be beneficial to allocate smaller stacks.

  • The NtQueryInformationWorkerFactory API dumps almost every field in the worker factory structure.

Thread Scheduling

  • Windows implements a priority-driven, preemptive thread scheduling system. At least one of the highest-priority runnable (ready) threads always runs, with the caveat that certain high-priority threads ready to run might be limited by the processors on which they might be allowed or preferred to run on β€” a phenomenon called processor affinity.

  • Processor affinity is defined based on a given processor group, which collects up to 64 processors. By default, threads can run only on available processors within the processor group associated with the process. (This is to maintain compatibility with older versions of Windows, which supported only 64 processors.)

  • Developers can alter processor affinity by using the appropriate APIs or by setting an affinity mask in the image header, and users can use tools to change affinity at run time or at process creation.

  • However, although multiple threads in a process can be associated with different groups, a thread on its own can run only on the processors available within its assigned group.

  • Additionally, developers can choose to create group-aware applications, which use extended scheduling APIs to associate logical processors on different groups with the affinity of their threads. Doing so converts the process into a multigroup process that can theoretically run its threads on any available processor within the machine.

  • After a thread is selected to run, it runs for an amount of time called a quantum. A quantum is the length of time a thread is allowed to run before another thread at the same priority level is given a turn to run.

  • Quantum values can vary from system to system and process to process for any of three reasons:

    • System configuration settings (long or short quantums, variable or fixed quantums, and priority separation)

    • Foreground or background status of the process

    • Use of the job object to alter the quantum

  • A thread might not get to complete its quantum, however, because Windows implements a preemptive scheduler. That is, if another thread with a higher priority becomes ready to run, the currently running thread might be preempted before finishing its time slice. In fact, a thread can be selected to run next and be preempted before even beginning its quantum!

  • The Windows scheduling code is implemented in the kernel. There's no single "scheduler" module or routine, however. The code is spread throughout the kernel in which scheduling-related events occur. The routines that perform these duties are collectively called the kernel's dispatcher.

  • The following events might require thread dispatching:

    • A thread becomes ready to execute.

    • A thread leaves the running state because its time quantum ends, it terminates, it yields execution, or it enters a wait state.

    • A thread's priority changes, either because of a system service call or because Windows itself changes the priority value.

    • A thread's processor affinity changes so that it will no longer run on the processor on which it was running.

  • After a logical processor has selected a new thread to run, it eventually performs a context switch to it. A context switch is the procedure of saving the volatile processor state associated with a running thread, loading another thread's volatile state, and starting the new thread's execution.

  • Windows schedules at the thread granularity level. Scheduling decisions are made strictly on a thread basis, no consideration is given to what process the thread belongs to.

Priority Levels

  • Windows uses 32 priority levels internally, ranging from 0 to 31 (31 is the highest):

    • Sixteen real-time levels (16 through 31)

    • Sixteen variable levels (0 through 15), out of which level 0 is reserved for the zero page thread.

Thread Priority Levels
  • Thread priority levels are assigned from two different perspectives: those of the Windows API and those of the Windows kernel. The Windows API first organizes processes by the priority class to which they are assigned at creation (the numbers in parentheses represent the internal PROCESS_PRIORITY_CLASS index recognized by the kernel):

    • Real-Time (4)

    • High (3)

    • Above Normal (6)

    • Normal (2)

    • Below Normal (5)

    • Idle (1)

  • The Windows API SetPriorityClass allows changing a process's priority class to one of these levels.

  • It then assigns a relative priority of the individual threads within those processes. Here, the numbers represent a priority delta that is applied to the process base priority:

    • Time-Critical (15)

    • Highest (2)

    • Above-Normal (1)

    • Normal (0)

    • Below-Normal (–1)

    • Lowest (–2)

    • Idle (–15)

  • Time-Critical and Idle levels (+15 and –15) are called saturation values and represent specific levels that are applied rather than true offsets. These values can be passed to the SetThreadPriority Windows API to change a thread's relative priority.

  • Therefore, in the Windows API, each thread has a base priority that is a function of its process priority class and its relative thread priority. In the kernel, the process priority class is converted to a base priority by using the PspPriorityTable global array and the PROCESS_PRIORITY_CLASS indices shown earlier, which sets priorities of 4, 8, 13, 14, 6, and 10, respectively. (This is a fixed mapping that cannot be changed.)

  • The relative thread priority is then applied as a differential to this base priority. For example, a Highest thread will receive a thread base priority of two levels higher than the base priority of its process.

Available Thread Priorities
  • The Time-Critical and Idle relative thread priorities maintain their respective values regardless of the process priority class (unless it is Real-Time) because the Windows API requests saturation of the priority from the kernel, by passing in +16 or –16 as the requested relative priority. The formula used to get these values is as follows (HIGH_PRIORITY equals 31):

  • These values are then recognized by the kernel as a request for saturation, and the Saturation field in KTHREAD is set. For positive saturation, this causes the thread to receive the highest possible priority within its priority class (dynamic or real-time); for negative saturation, it's the lowest possible one. Additionally, future requests to change the base priority of the process will no longer affect the base priority of these threads because saturated threads are skipped in the processing code.

  • Regardless of how the thread's priority came to be by using the Windows API, from the point of view of the scheduler, only the final result matters.

  • Whereas a process has only a single base priority value, each thread has two priority values: current (dynamic) and base. Scheduling decisions are made based on the current priority (which can be altered using priority boosts). Windows never adjusts the priority of threads in the Real-Time range (16 through 31), so they always have the same base and current priority.

  • A thread's initial base priority is inherited from the process base priority. A process, by default, inherits its base priority from the process that created it. You can override this behavior on the CreateProcess function or by using the command-line start command. You can also change a process priority after it is created by using the SetPriorityClass function or by using various tools that expose that function, such as Task Manager or Process Explorer.

  • Changing the priority of a process changes the thread priorities up or down, but their relative settings remain the same.

  • Normally, user applications and services start with a normal base priority, so their initial thread typically executes at priority level 8. However, some Windows system processes (such as the Session Manager, Service Control Manager, and local security authentication process) have a base process priority slightly higher than the default for the Normal class (8). This higher default value ensures that the threads in these processes will all start at a higher priority than the default value of 8.

Real-Time Priorities

  • You can raise or lower thread priorities within the dynamic range in any application. However, you must have the increase scheduling priority privilege (SeIncreaseBasePriorityPrivilege) to enter the Real-Time range.

  • Be aware that many important Windows kernel-mode system threads run in the Real-Time priority range, so if threads spend excessive time running in this range, they might block critical system functions (such as in the memory manager, cache manager, or some device drivers)

  • Using the standard Windows APIs, once a process has entered the Real-Time range, all its threads (even Idle ones) must run at one of the Real-Time priority levels. It is thus impossible to mix real-time and dynamic threads within the same process through standard interfaces. This is because the SetThreadPriority API calls the native NtSetInformationThread API with the ThreadBasePriority information class, which allows priorities to remain only in the same range.

  • Furthermore, this information class allows priority changes only in the recognized Windows API deltas of –2 to 2 (or Time-Critical/Idle) unless the request comes from CSRSS or another real-time process. In other words, this means that a real-time process can pick thread priorities anywhere between 16 and 31, even though the standard Windows API relative thread priorities would seem to limit its choices based on the table that was shown earlier.

  • As mentioned, calling SetThreadPriority with one of a set of special values causes a call to NtSetInformationThread with the ThreadActualBasePriority information class, the kernel base priority for the thread can be directly set, including in the dynamic range for a real-time process.

Thread States

  • The thread states are as follows:

    • Ready A thread in the ready state is waiting to execute or to be in-swapped after completing a wait. When looking for a thread to execute, the dispatcher considers only the threads in the ready state.

    • Deferred ready This state is used for threads that have been selected to run on a specific processor but have not actually started running there. This state exists so that the kernel can minimize the amount of time the per-processor lock on the scheduling database is held.

    • Standby A thread in this state has been selected to run next on a particular processor. When the correct conditions exist, the dispatcher performs a context switch to this thread. Only one thread can be in the standby state for each processor on the system. Note that a thread can be preempted out of the standby state before it ever executes (if, for example, a higher-priority thread becomes runnable before the standby thread begins execution).

    • Running After the dispatcher performs a context switch to a thread, the thread enters the running state and executes. The thread’s execution continues until its quantum ends (and an- other thread at the same priority is ready to run), it is preempted by a higher-priority thread, it terminates, it yields execution, or it voluntarily enters the waiting state.

    • Waiting A thread can enter the waiting state in several ways: A thread can voluntarily wait for an object to synchronize its execution, the OS can wait on the thread’s behalf (such as to resolve a paging I/O), or an environment subsystem can direct the thread to suspend itself. When the thread’s wait ends, depending on its priority, the thread either begins running immediately or is moved back to the ready state.

    • Transition A thread enters the transition state if it is ready for execution but its kernel stack is paged out of memory. After its kernel stack is brought back into memory, the thread enters the ready state.

    • Terminated When a thread finishes executing, it enters this state. After the thread is terminated, the executive thread object (the data structure in system memory that describes the thread) might or might not be deallocated. The object manager sets the policy regarding when to delete the object. For example, the object remains if there are any open handles to the thread. A thread can also enter the terminated state from other states if it’s killed explicitly by some other threadβ€”for example, by calling the TerminateThread Windows API.

    • Initialized This state is used internally while a thread is being created.

  • The numeric values shown represent the internal values of each state and can be viewed with a tool such as Performance Monitor. The ready and deferred ready states are represented as one. This reflects the fact that the deferred ready state acts as a temporary placeholder for the scheduling routines. This is true for the standby state as well.

Dispatcher Database

  • To make thread-scheduling decisions, the kernel maintains a set of data structures known collectively as the dispatcher database. The dispatcher database keeps track of which threads are waiting to execute and which processors are executing which threads.

  • To improve scalability, including thread-dispatching concurrency, Windows multiprocessor systems have per-processor dispatcher ready queues and shared processor group queues. In this way, each CPU can check its own shared ready queue for the next thread to run without having to lock the system-wide ready queues.

Multiprocessor Dispatcher Database
  • Starting with Windows 8 and Windows Server 2012, a shared ready queue and ready summary are used for a group of processors. This enables the system to make better decisions about which processor to use next for that group of processors. (The per-CPU ready queues are still there and used for threads with affinity constraints.)

  • The maximum group size is 4 processors. If the number of logical processors is greater than four, then more than one group would be created, and the available processors spread evenly.

  • The ready queues, ready summary, and some other information is stored in a kernel structure named KSHARED_READY_QUEUE that is stored in the Processor Control Block (PRCB). Although it exists for every processor, it's used only on the first processor of each processor group, sharing it with the rest of the processors in that group.

  • The dispatcher ready queues (ReadListHead in KSHARED_READY_QUEUE) contain the threads that are in the ready state, waiting to be scheduled for execution. There is one queue for each of the 32 priority levels.

  • To speed up the selection of which thread to run or preempt, Windows maintains a 32-bit bitmask called the ready summary (ReadySummary). Each bit set indicates one or more threads in the ready queue for that priority level (bit 0 represents priority 0, bit 1 priority 1, and so on). Instead of scanning each ready list to see whether it is empty or not (which would make scheduling decisions dependent on the number of different priority threads), a single bit scan is performed as a native processor command to find the highest bit set. Regardless of the number of threads in the ready queue, this operation takes a constant amount of time.

  • The dispatcher database is synchronized by raising IRQL to DISPATCH_LEVEL (2) . Raising IRQL in this way prevents other threads from interrupting thread dispatching on the processor because threads normally run at IRQL 0 or 1.

Quantum

  • A quantum is the amount of time a thread is permitted to run before Windows checks to see whether another thread at the same priority is waiting to run. If a thread completes its quantum and there are no other threads at its priority, Windows permits the thread to run for another quantum.

  • On client versions of Windows, threads run for two clock intervals by default. On server systems, threads run for 12 clock intervals by default. The rationale for the longer default value on server systems is to minimize context switching. By having a longer quantum, server applications that wake up because of a client request have a better chance of completing the request and going back into a wait state before their quantum ends.

  • The length of the clock interval varies according to the hardware platform. The frequency of the clock interrupts is up to the HAL, not the kernel. This clock interval is stored in the kernel variable KeMaximumIncrement as hundreds of nanoseconds.

  • Although threads run in units of clock intervals, the system does not use the count of clock ticks as the gauge for how long a thread has run and whether its quantum has expired. This is because thread run-time accounting is based on processor cycles. When the system starts up, it multiplies the processor speed (CPU clock cycles per second) in hertz (Hz) by the number of seconds it takes for one clock tick to fire (based on the KeMaximumIncrement value described earlier) to calculate the number of clock cycles to which each quantum is equivalent. This value is stored in the kernel variable KiCyclesPerClockQuantum.

  • The result of this accounting method is that threads do not actually run for a quantum number based on clock ticks. Instead, they run for a quantum target, which represents an estimate of what the number of CPU clock cycles the thread has consumed should be when its turn would be given up. This target should be equal to an equivalent number of clock interval timer ticks.

Quantum Accounting

  • Each process has a quantum reset value in the process control block (KPROCESS). This value is used when creating new threads inside the process and is duplicated in the thread control block (KTHREAD), which is then used when giving a thread a new quantum target. The quantum reset value is stored in terms of actual quantum units, which are then multiplied by the number of clock cycles per quantum, resulting in the quantum target.

  • As a thread runs, CPU clock cycles are charged at different events, such as context switches, interrupts, and certain scheduling decisions. If, at a clock interval timer interrupt, the number of CPU clock cycles charged has reached (or passed) the quantum target, quantum end processing is triggered. If there is another thread at the same priority waiting to run, a context switch occurs to the next thread in the ready queue.

  • Internally, a quantum unit is represented as one-third of a clock tick. That is, one clock tick equals three quantums. This means that on client Windows systems, threads have a quantum reset value of 6 (2 Γ— 3) and that server systems have a quantum reset value of 36 (12 Γ— 3) by default. For this reason, the KiCyclesPerClockQuantum value is divided by 3 at the end of the calculation previously described, because the original value describes only CPU clock cycles per clock interval timer tick.

  • The reason a quantum was stored internally as a fraction of a clock tick rather than as an entire tick was to allow for partial quantum decay-on-wait completion on versions of Windows prior to Windows Vista. Prior versions used the clock interval timer for quantum expiration. If this adjustment had not been made, it would have been possible for threads to never have their quantums reduced. For example, if a thread ran, entered a wait state, ran again, and entered another wait state but was never the currently running thread when the clock interval timer fired, it would never have its quantum charged for the time it was running. Because threads now have CPU clock cycles charged instead of quantums, and because this no longer depends on the clock interval timer, these adjustments are not required.

Variable Quantums

Last updated