Compact NUMA-aware Locks
20250208927 ยท 2025-06-26
Inventors
Cpc classification
G06F9/52
PHYSICS
International classification
Abstract
A computer comprising multiple processors and non-uniform memory implements multiple threads that perform a lock operation using a shared lock structure that includes a pointer to a tail of a first-in-first-out (FIFO) queue of threads waiting to acquire the lock. To acquire the lock, a thread allocates and appends a data structure to the FIFO queue. The lock is released by selecting and notifying a waiting thread to which control is transferred, with the thread selected executing on the same processor socket as the thread controlling the lock. A secondary queue of threads is managed for threads deferred during the selection process and maintained within the data structures of the waiting threads such that no memory is required within the lock structure. If no threads executing on the same processor socket are waiting for the lock, entries in the secondary queue are transferred to the FIFO queue preserving FIFO order.
Claims
1.-20. (canceled)
21. A method, comprising: maintaining a plurality of first-in-first-out (FIFO) queues individually comprising one or more threads waiting to allocate a lock, the plurality of queues comprising a first queue and a second queue, wherein the second queue comprises one or more threads executing on a non-preferred processor; transferring control of the lock by a controlling thread, comprising: selecting a thread from the plurality of FIFO queues according to a fairness requirement, wherein the thread is selected from the first queue if fairness is not required and the thread is selected from the second queue if fairness is required; and writing, to a wait structure of the selected thread, an address of a first wait structure of the second queue to transfer control of the lock to the selected thread.
22. The method of claim 21, wherein the first queue comprises one or more threads executing on a preferred processor, wherein the controlling thread executes on the preferred processor, and wherein the non-preferred processor comprises different memory access characteristics than the preferred processor.
23. The method of claim 21, further comprising: moving respective wait structures of the one or more threads executing on the non-preferred processor from the second queue to the first queue responsive to determining that fairness is required.
24. The method of claim 23, wherein moving the respective wait structures of the one or more threads executing on the non-preferred processor from the second queue to the first queue comprises inserting the respective wait structures at the head of the first queue to ensure fairness, and wherein the respective wait structures of the one or more threads are linked together preserving the order in which they were enqueued in the secondary queue.
25. The method of claim 21, wherein maintaining the plurality of FIFO queues comprises: adding, by a particular thread, a wait structure to the first queue to wait for allocation of a lock, the wait structure comprising a wait field initialized to a zero value; and moving the wait structure to the second queue responsive to determining that the particular thread executes on the non-preferred processor.
26. The method of claim 25, wherein moving the wait structure to the second queue comprises inserting the wait structure at the tail of the second queue to ensure fairness.
27. The method of claim 25, wherein adding the wait structure to the first queue comprises writing an identifier of a processor executing the particular thread to the wait structure, and wherein determining that the particular thread executes on the non-preferred processor comprises determining that an identifier of the non-preferred processor matches the written identifier.
28. One or more non-transitory computer-accessible storage media storing program instructions that when executed on or across a plurality of processors cause the plurality of processors to perform: maintaining a plurality of first-in-first-out (FIFO) queues individually comprising one or more threads waiting to allocate a lock, the plurality of queues comprising a first queue and a second queue, wherein the second queue comprises one or more threads executing on a non-preferred processor; transferring control of the lock by a controlling thread, comprising: selecting a thread from the plurality of FIFO queues according to a fairness requirement, wherein the thread is selected from the first queue if fairness is not required and the thread is selected from the second queue if fairness is required; and writing, to a wait structure of the selected thread, an address of a first wait structure of the second queue to transfer control of the lock to the selected thread.
29. The one or more non-transitory computer-accessible storage media of claim 28, wherein the first queue comprises one or more threads executing on a preferred processor, wherein the controlling thread executes on the preferred processor, and wherein the non-preferred processor comprises different memory access characteristics than the preferred processor.
30. The one or more non-transitory computer-accessible storage media of claim 29, the program instructions that when executed on or across the plurality of processors cause the plurality of processors to further perform: moving respective wait structures of the one or more threads executing on the non-preferred processor from the second queue to the first queue responsive to determining that fairness is required.
31. The one or more non-transitory computer-accessible storage media of claim 30, wherein moving the respective wait structures of the one or more threads executing on the non-preferred processor from the second queue to the first queue comprises inserting the respective wait structures at the head of the first queue to ensure fairness, and wherein the respective wait structures of the one or more threads are linked together preserving the order in which they were enqueued in the secondary queue.
32. The one or more non-transitory computer-accessible storage media of claim 28, wherein maintaining the plurality of FIFO queues comprises: adding, by a particular thread, a wait structure to the first queue to wait for allocation of a lock, the wait structure comprising a wait field initialized to a zero value; and moving the wait structure to the second queue responsive to determining that the particular thread executes on the non-preferred processor.
33. The one or more non-transitory computer-accessible storage media of claim 32, wherein moving the wait structure to the second queue comprises inserting the wait structure at the tail of the second queue to ensure fairness.
34. The one or more non-transitory computer-accessible storage media of claim 32, wherein adding the wait structure to the first queue comprises writing an identifier of a processor executing the particular thread to the wait structure, and wherein determining that the particular thread executes on the non-preferred processor comprises determining that an identifier of the non-preferred processor matches the written identifier.
35. A system, comprising: a plurality of processors and a memory comprising program instructions executable by the plurality of processors implement an application configured to: maintain a plurality of first-in-first-out (FIFO) queues individually comprising one or more threads waiting to allocate a lock, the plurality of queues comprising a first queue and a second queue, wherein the second queue comprises one or more threads executing on a non-preferred processor; transfer control of the lock by a controlling thread, wherein to transfer control of the lock the controlling thread is configured to: select a thread from the plurality of FIFO queues according to a fairness requirement, wherein the thread is selected from the first queue if fairness is not required and the thread is selected from the second queue if fairness is required; and write, to a wait structure of the selected thread, an address of a first wait structure of the second queue to transfer control of the lock to the selected thread.
36. The system of claim 35, wherein the first queue comprises one or more threads executing on a preferred processor, wherein the controlling thread executes on the preferred processor, and wherein the non-preferred processor comprises different memory access characteristics than the preferred processor.
37. The system of claim 35, the application further configured to: move respective wait structures of the one or more threads executing on the non-preferred processor from the second queue to the first queue responsive to determining that fairness is required.
38. The system of claim 37, wherein to move the respective wait structures of the one or more threads executing on the non-preferred processor from the second queue to the first queue the application is configured to insert the respective wait structures at the head of the first queue to ensure fairness, and wherein the respective wait structures of the one or more threads are linked together preserving the order in which they were enqueued in the secondary queue.
39. The system of claim 35, wherein to maintain the plurality of FIFO queues the application is configured to: add, by a particular thread, a wait structure to the first queue to wait for allocation of a lock, the wait structure comprising a wait field initialized to a zero value; and move the wait structure to the second queue responsive to determining that the particular thread executes on the non-preferred processor.
40. The system of claim 39, wherein to move the wait structure to the second queue the application is configured to insert the wait structure at the tail of the second queue to ensure fairness.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0007]
[0008]
[0009]
[0010]
[0011] allocation of a lock.
[0012]
[0013]
[0014]
[0015]
[0016] While the disclosure is described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the disclosure is not limited to embodiments or drawings described. It should be understood that the drawings and detailed description hereto are not intended to limit the disclosure to the particular form disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. Any headings used herein are for organizational purposes only and are not meant to limit the scope of the description or the claims. As used herein, the word may is used in a permissive sense (i.e., meaning having the potential to) rather than the mandatory sense (i.e. meaning must). Similarly, the words include, including, and includes mean including, but not limited to.
[0017] Various units, circuits, or other components may be described as configured to perform a task or tasks. In such contexts, configured to is a broad recitation of structure generally meaning having circuitry that performs the task or tasks during operation. As such, the unit/circuit/component can be configured to perform the task even when the unit/circuit/component is not currently on. In general, the circuitry that forms the structure corresponding to configured to may include hardware circuits. Similarly, various units/circuits/components may be described as performing a task or tasks, for convenience in the description. Such descriptions should be interpreted as including the phrase configured to. Reciting a unit/circuit/component that is configured to perform one or more tasks is expressly intended not to invoke 35 U.S.C. 112(f) interpretation for that unit/circuit/component.
[0018] This specification includes references to one embodiment or an embodiment. The appearances of the phrases in one embodiment or in an embodiment do not necessarily refer to the same embodiment, although embodiments that include any combination of the features are generally contemplated, unless expressly disclaimed herein. Particular features, structures, or characteristics may be combined in any suitable manner consistent with this disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[0019] Locks are used by concurrently running processes (or threads) to acquire exclusive access to shared data. Studies have shown that the performance of such software quite often depends directly on the efficiency of the locks it employs and the evolution of lock implementations is tightly coupled with the evolution of computing architectures. Modern architectures feature an increasing number of CPU nodes (or sockets), each including locally attached memory, a fast local cache and multiple processing units (or cores). Accesses by a core to a local memory or local cache may be significantly faster than accesses to the remote memory or cache lines residing on another node, a characteristic known as NUMA, or Non-Uniform Memory Access. As a result, software methods may be broadly categorized as either NUMA-aware methods that are adapted to these characteristics or NUMA-oblivious methods that are unaware of NUMA performance issues.
[0020] NUMA-aware locking methods have been developed which prefer lock ownership to remain within the same socket. This NUMA-aware locking approach decreases remote cache misses and associated inter-socket traffic as it increases the chance that the lock data, as well as the subsequently accessed shared data, will be cached locally to the socket on which a lock holder is running.
[0021] While NUMA-aware locks may offer performance benefits over other locking approaches, characteristics of these locks hamper adoption. While existing NUMA-oblivious lock implementations may require only a single memory word per locking context, NUMA-aware locks are hierarchical in nature, thus requiring a thread to acquire multiple low-level locks before it can enter a critical section and consume memory space proportional to the number of processor sockets. These requirements reduce performance in low-contention or no contention locking applications and preclude their use in certain environments sensitive to memory use.
[0022] Various embodiments of the present invention implement compact NUMA-aware locking requiring the acquisition of only a single lock and the performance of only a single atomic operation per lock acquisition while requiring only a single word of memory per locking context. Thus, the present lock implementations mitigate the performance disadvantages of existing NUMA-aware locks while enabling their use in memory-sensitive applications.
[0023]
[0024] Threads which desire to allocate the lock add a representative data structure onto a First-In-First-Out (FIFO) queue. A second queue is managed by the locking operation that contains data structures for threads that have been deferred for later allocation. The thread-specific data structure and the lock data structure are shown in
[0025]
[0026] Additionally, the two remaining waiting threads, T3 303 and T4 304, are represented in a secondary queue as they have been deferred for executing on a socket other than the socket executing T1 301. The existence of this secondary queue is indicated by the spin field 210 of Tl 301 containing a pointer to the data structure representing T3 303. In addition, the tail field 230 of T3 303 contains a pointer to the data structure of T4 304, enabling rapid access to the last element in the secondary queue.
[0027] The spin field 210 of cna_node 200 may contain one of three possible states, a wait state indicating that the lock has not yet been acquired, an active state indicating that the lock has been acquired but that no secondary queue exists, and a pointer state indicating that the lock has been acquired and that a secondary queue exists. In the case of a pointer state, the spin value contains a pointer to the head of the secondary queue. Different implementations of the spin field 210 may be envisioned in various embodiments. For example, in a preferred embodiment, a wait state may be indicated with a value of zero and an active state indicated with a value of one. In many modern computer architectures, neither a value of zero or one is a valid pointer address as a zero value indicates a NULL pointer and even byte alignments of multi-element data structures such as the cna_node structure 200 are easily ensured. Thus, a single pointer field may contain all three required states without restriction on the range of memory addresses allowable for cna_node data structures. Other encodings of the spin field 210 may also be employed and it should be understood that the conventions described above are not intended to be limiting.
[0028] The ID field 220 of cna_node 200 may contain a value indicating a processor, or processor complex or socket, for which an affinity to transfer control of the lock may be desired. Examples would include individual processor cores or individual processor sockets of a multi-socket system where multiple cores within the same socket share similar access to memory and cache. In some embodiments, this information may be obtained through a system call while in other embodiments it may be obtained by directly executing one or more CPU instructions. Other means of obtaining values for the ID field 220 may also be employed and it should be understood that the techniques described above are not intended to be limiting.
[0029]
[0030] Next, the thread initializes the next 240 and tail 230 fields of the data structure to 0 and initializes the ID field 220 to an invalid socket ID, such as 1. By initializing the ID field 220 in this manner, latency for low-contention or no-contention operation may be reduced should the determination of socket ID, as discussed above, be costly. Alternatively, the ID field 220 may be initialized to the proper socket ID value should the operation impose modest execution cost.
[0031] Next, the thread atomically records the existing value of the lockTail field 260 of the cna_lock structure 250 and writes the address of its allocated data structure to the lockTail field 260. In some embodiments this may be performed by an atomic swap instruction commonly available in modern processor architectures while in other embodiments it may be performed using a sequence of instructions executed indivisibly. It should be understood that any of a variety of techniques may be employed to perform this sequence of operations atomically and that the atomic swap instruction described above is not intended to be limiting.
[0032] Next, the lockTail value recorded in step 430 is compared with 0 in step 440. If the previous lockTail value is 0, then it is indicated that no elements previously existing on the FIFO queue and the lock is therefore unallocated. In this case, the lock allocation request is complete and the thread has successfully acquired the lock. If, however, the lockTail value recorded in step 430 is not 0, execution proceeds to step 450.
[0033] Next, the thread records the processor ID in the ID field 230 in step 450 as described above. Once the ID field is recorded, execution proceeds to step 460 where the thread links the data structure into the FIFO queue that was previously determined to exist in step 440. The address of the data structure is written into the next field 240 of the data structure identified by the lockTail value recorded in step 430. Execution proceeds to step 470 where the thread waits for the spin field 210 to indicate that the thread has acquired the lock.
[0034] Once a thread has acquired the lock, the thread may perform any application-specific operations for which the lock provides synchronized access. Upon completion of these operations, the thread releases ownership of the lock.
[0035] If no threads exist in the FIFO queue, the thread determines if waiting threads exist in the secondary queue in step 520. This is indicated by a pointer state stored in the spin field 210 of the thread, as discussed above. If waiting threads exist in the secondary queue, the secondary queue is transferred to the FIFO queue by setting the lockTail to the tail value 240 of the first waiting thread identified by the pointer state of the spin field 210 of the thread in step 530. Two situations may occur, the current value of lockTail 260 may point to the data structure of the thread or it may point to another data structure indicating that another thread is in the process of adding itself to the FIFO queue for allocation of the lock. For this reason, an atomic compare-and-swap (CAS) instruction is used. This instruction is commonly available in modern computer architectures. The CAS instruction replaces the contents of a memory operation with a new value if and only if the existing contents of the memory location match a provided third value. If the current contents of the lockTail field 260 contain a pointer to the data structure of the thread, the lockTail field 260 is written with a pointer to the data structure of the last waiting thread in the secondary queue and the operation proceeds to step 575 through decision step 535. If, however, the current contents of the lockTail field 260 does not contain a pointer to the data structure of the thread, the CAS instruction fails indicating that another thread is in the process of enqueuing into the FIFO queue. In this event, the thread waits for the waiting thread to appear in the FIFO queue in step 527 by wait for its next field 240 to contain a non-zero value. Once this occurs, execution proceeds to step 540.
[0036] If, however, no threads exist in the secondary queue then the lock may become free. In this case, execution proceeds to step 525 where the thread attempts to set lockTail 260 to 0 indicating that the lock is free. Two situations may occur, the current value of lockTail 260 may point to the data structure of the thread or it may point to another data structure indicating that another thread is in the process of adding itself to the FIFO queue for allocation of the lock. For this reason, an atomic compare-and-swap (CAS) instruction is again used. If the current contents of the lockTail field 260 contain a pointer to the data structure of the thread, the lockTail field 260 is written with a value of 0 and the operation is complete. If, however, the current contents of the lockTail field 260 does not contain a pointer to the data structure of the thread, the CAS instruction fails indicating that another thread is in the process of enqueuing into the FIFO queue. In this event, the thread waits for the waiting thread to appear in the FIFO queue in step 527 by wait for its next field 240 to contain a non-zero value. Once this occurs, execution proceeds to step 540.
[0037] Once execution proceeds to step 540, at least one waiting thread exists in the FIFO queue. First, the thread determines if a switch to another processor ID is required to ensure desirable fairness. This determination may be made in a number of ways. In one embodiment, a count of the number of waiting threads deferred may be maintained. If the number of deferred threads exceeds a threshold, a fairness requirement is determined and execution proceeds to step 560. It should be understood, however, that any of a variety of techniques may be employed to determine that fairness may be required and that method described above is not intended to be limiting.
[0038] If fairness is not required, execution proceeds to step 550 where a successor thread is identified which executes on the same socket as the thread. This process is detailed in
[0039] If, however, a successor thread is not found, then a processor switch must occur. First, the thread determines if threads exist in the secondary queue in step 560, as these threads must take priority over threads remaining in the FIFO queue. If no such threads exist, execution proceeds to step 565 where the spin value 210 of the next node in the FIFO queue is written with a value of 1. At this point, ownership of the lock has been transferred and the operation is complete.
[0040] If, however, threads exist in the secondary queue, the threads remaining in the FIFO queue are transferred to the tail of the secondary queue in step 570 and the first node in the secondary queue is written with a value of 1 in step 575. At this point, ownership of the lock has been transferred and the operation is complete.
[0041]
[0042]
[0043] secondary queue. These four threads execute on two different sockets, T1 701 and T2 702 executing on socket ID 0 as identified by their respective ID fields 220, and T3 703 and T4 704 executing on socket ID 1 as identified by its respective ID field 220. Thread T1 701 currently controls the lock, as indicated by a non-zero value in its respective spin field 210. The first waiting thread in the FIFO queue is T2 702 as indicated by the next field 240 of T1 701 containing a pointer to T2 702. The second waiting thread in the FIFO queue is T3 703 as indicated by the next field 240 of T2 702 containing a pointer to T3 703. The third and final waiting thread in the FIFO queue is T4 704 as indicated by the next field 240 of T3 703 containing a pointer to T4 704. As T4 704 is the final waiting thread, the lockTail field 710 of the cna_lock structure 250 contains a pointer to T4 704. Furthermore, the tail fields 230 of all threads in the lock queue have been initialized to 0.
[0044]
[0045]
[0046] the existence of a secondary queue for which thread T2 702 is the first and only element. The first waiting thread in the FIFO queue is T4 704 as indicated by the next field 240 of T3 703 containing a pointer to T4 704. The second and final waiting thread in the FIFO queue is T1 701 as indicated by the next field 240 of T4 704 containing a pointer to T1 701. As T1 701 is the final waiting thread, the lockTail field 710 of the cna_lock structure 250 contains a pointer to T1 701. Finally, as thread T2 702 is the only element in the secondary queue, the tail field 230 of thread T2 702 contains a pointer to thread T2 702.
[0047]
[0048]
[0049]
[0050] Some of the mechanisms described herein may be provided as a computer program product, or software, that may include a non-transitory, computer-readable storage medium having stored thereon instructions which may be used to program a computer system 400 (or other electronic devices) to perform a process according to various embodiments. A computer-readable storage medium may include any mechanism for storing information in a form (e.g., software, processing application) readable by a machine (e.g., a computer). The machine-readable storage medium may include, but is not limited to, magnetic storage medium (e.g., floppy diskette); optical storage medium (e.g., CD-ROM); magneto-optical storage medium; read only memory (ROM); random access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; electrical, or other types of medium suitable for storing program instructions. In addition, program instructions may be communicated using optical, acoustical or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.)
[0051] In various embodiments, computer system 800 may include one or more processors 860; each may include multiple cores, any of which may be single-or multi-threaded. For example, multiple processor cores may be included in a single processor chip (e.g., a single processor 860), and multiple processor chips may be included in computer system 800. Each of the processors 860 may include a cache or a hierarchy of caches 870, in various embodiments. For example, each processor chip 860 may include multiple LI caches (e.g., one per processor core) and one or more other caches (which may be shared by the processor cores on a single processor). The computer system 800 may also include one or more storage devices 850 (e.g. optical storage, magnetic storage, hard drive, tape drive, solid state memory, etc.) and one or more system memories 810 (e.g., one or more of cache, SRAM, DRAM, RDRAM, EDO RAM, DDR RAM, SDRAM, Rambus RAM, EEPROM, etc.). In some embodiments, one or more of the storage device(s) 450 may be implemented as a module on a memory bus (e.g., on interconnect 840) that is similar in form and/or function to a single in-line memory module (SIMM) or to a dual in-line memory module (DIMM). Various embodiments may include fewer or additional components not illustrated in
[0052] The one or more processors 860, the storage device(s) 850, and the system memory 810 may be coupled to the system interconnect 840. One or more of the system memories 810 may contain application data 828 and program instructions 820. Application data 828 may contain various data structures to implement enhanced ticket locks while Program instructions 820 may be executable to implement one or more applications 822, shared libraries 824, and/or operating systems 826.
[0053] Program instructions 820 may be encoded in platform native binary, any interpreted language such as Java byte-code, or in any other language such as C/C++, the Java programming language, etc., or in any combination thereof. In various embodiments,
[0054] applications 822, operating system 826, and/or shared libraries 824 may each be implemented in any of various programming languages or methods. For example, in one embodiment, operating system 826 may be based on the Java programming language, while in other embodiments it may be written using the C or C++ programming languages.
[0055] Similarly, applications 822 may be written using the Java programming language, C, C++, or another programming language, according to various embodiments. Moreover, in some embodiments, applications 822, operating system 826, and/shared libraries 824 may not be implemented using the same programming language. For example, applications 822 may be C++ based, while shared libraries 824 may be developed using C.
[0056] Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. For example, although many of the embodiments are described in terms of particular types of operations that support synchronization within multi-threaded applications that access particular shared resources, it should be noted that the techniques and mechanisms disclosed herein for accessing and/or operating on shared resources may be applicable in other contexts in which applications access and/or operate on different types of shared resources than those described in the examples herein. It is intended that the following claims be interpreted to embrace all such variations and modifications.
[0057] In conclusion, embodiments of a compact NUMA-aware lock are disclosed. These embodiments require only a single word of memory per lock and are therefore useful to provide NUMA-aware locking semantics in applications that are sensitive to memory grown in locking contexts. While similar to existing locking approaches such as the MCS lock and possessing similar benefits, these locking embodiments additionally organize waiting threads in two queues, one composed of threads running on the same processor socket as the current lock holder and another composed of threads running on a different processor socket(s). This enables single-threaded performance comparable to existing locking implementations, such as MCS, while significantly outperforming those implementations under high lock contention, thus achieving the same or better performance without memory growth encountered in traditional NUMA-aware locks.