CROSS-BLADE CACHE SLOT DONATION
20220121571 · 2022-04-21
Assignee
Inventors
- John Creed (Innishannon, IE)
- Steve Ivester (Worcester, MA, US)
- John Krasner (Coventry, RI, US)
- Kaustubh Sahasrabudhe (Westborough, MA, US)
Cpc classification
G06F15/161
PHYSICS
G06F15/17331
PHYSICS
G06F9/542
PHYSICS
International classification
G06F12/06
PHYSICS
G06F15/173
PHYSICS
G06F9/30
PHYSICS
Abstract
Remote cache slots are donated in a storage array without requiring a cache slot starved compute node to search for candidates in remote portions of a shared memory. One or more donor compute nodes create donor cache slots that are reserved for donation. The cache slot starved compute node broadcasts a message to the donor compute nodes indicating a need for donor cache slots. The donor compute nodes provide donor cache slots to the cache slot starved compute node in response to the message. The message may be broadcast by updating a mask of compute node operational status in the shared memory. The donor cache slots may be provided by providing pointers to the donor cache slots.
Claims
1. An apparatus comprising: a data storage system comprising: a plurality of non-volatile drives; and a plurality of interconnected compute nodes that present at least one logical production volume to hosts and manage access to the drives, each of the compute nodes comprising a local memory and being configured to allocate a portion of that local memory to a shared memory that can be accessed by each of the compute nodes of the plurality of compute nodes, the shared memory comprising cache slots that are used to store logical production volume data for servicing input-output commands (IOs) to the logical production volume, the cache slots being accessible by each of the plurality of compute nodes; wherein a first one of the compute nodes is configured to create donor cache slots that are available for donation to other ones of the compute nodes for storage of logical production volume data that is accessible by each of the plurality of compute nodes, a second one of the compute nodes is configured to generate a message that indicates a need for donor cache slots, and the first compute node is configured to provide at least some of the donor cache slots to the second compute node in response to the message, whereby the second compute node acquires remote donor cache slots for storage of logical production volume data that is accessible by all of the compute nodes without searching for candidates in remote portions of the shared memory.
2. The apparatus of claim 1 wherein the first compute node is configured to provide the donor cache slots to the second compute node by providing pointers to the donor cache slots.
3. The apparatus of claim 2 wherein the data storage system further comprises a plurality of worker threads that maintain statistical data indicative of operational status of each of the compute nodes.
4. The apparatus of claim 3 wherein the statistical data comprises one or more of local cache slot allocation rate, current number of local dirty cache slots, current depth of local shared slot queues, and fall-through time (FTT).
5. The apparatus of claim 4 wherein the statistical data is maintained in a Cache_Donation_Source Board-Mask in the shared memory.
6. The apparatus of claim 5 wherein the message is broadcast by updating the Cache_Donation_Source Board-Mask in the shared memory.
7. The apparatus of claim 6 wherein the first compute node calculates a number of donor cache slots to create based on the statistical data.
8. A method for acquiring remote donor cache slots for storage of logical production volume data that is accessible by each of a plurality of interconnected compute nodes without searching for candidates in remote portions of a shared memory in a data storage system comprising a plurality of non-volatile drives, wherein the plurality of interconnected compute nodes present at least one logical production volume to hosts and manage access to the drives, each of the compute nodes comprising a local memory and being configured to allocate a portion of that local memory to the shared memory that can be accessed by each of the compute nodes, the shared memory comprising cache slots that are used to store logical production volume data for servicing input-output commands (IOs) to the logical production volume, the cache slots being accessible by each of the plurality of compute nodes, the method comprising: a first one of the compute nodes creating donor cache slots that are available for donation to other ones of the compute nodes for storage of logical production volume data that is accessible by each of the plurality of compute nodes; a second one of the compute nodes generating a message that indicates a need for donor cache slots; and the first compute node providing at least some of the donor cache slots to the second compute node in response to the message.
9. The method of claim 8 comprising first compute node providing the donor cache slots to the second compute node by providing pointers to the donor cache slots.
10. The method of claim 9 comprising a plurality of worker threads maintaining statistical data indicative of operational status of each of the compute nodes.
11. The method of claim 10 wherein maintaining the statistical data comprises maintain one or more of local cache slot allocation rate, current number of local dirty cache slots, current depth of local shared slot queues, and fall-through time (FTT).
12. The method of claim 11 comprising maintain the statistical data in a Cache_Donation_Source Board-Mask in the shared memory.
13. The method of claim 12 comprising broadcasting the message by updating the Cache_Donation_Source Board-Mask in the shared memory.
14. The method of claim 13 comprising calculating a number of donor cache slots to create based on the statistical data.
15. A computer-readable storage medium storing instructions that when executed by a compute node cause the compute node to perform a method for acquiring remote donor cache slots for storage of logical production volume data that is accessible by each of a plurality of interconnected compute nodes without searching for candidates in remote portions of a shared memory in a data storage system comprising a plurality of non-volatile drives, wherein the plurality of interconnected compute nodes present at least one logical production volume to hosts and manage access to the drives, each of the compute nodes comprising a local memory and being configured to allocate a portion of the local memory to the shared memory that can be accessed by each of the compute nodes, the shared memory comprising cache slots that are used to store logical production volume data for servicing input-output commands (IOs) to the logical production volume, the cache slots being accessible by each of the plurality of compute nodes, the method comprising: creating donor cache slots that are available for donation to ones of the compute nodes for storage of logical production volume data that is accessible by each of the plurality of compute nodes; generating a message that indicates a need for donor cache slots; and providing at least some of the donor cache slots to the second compute node in response to the message.
16. The computer-readable storage medium of claim 15 wherein the method comprises providing the donor cache slots by providing pointers to the donor cache slots.
17. The computer-readable storage medium of claim 16 wherein the method comprises a plurality of worker threads maintaining statistical data indicative of operational status of each of the compute nodes.
18. The computer-readable storage medium of claim 17 wherein maintaining the statistical data comprises maintaining one or more of local cache slot allocation rate, current number of local dirty cache slots, current depth of local shared slot queues, and fall-through time (FTT).
19. The computer-readable storage medium of claim 18 wherein the method comprises maintaining the statistical data in a Cache_Donation_Source Board-Mask in the shared memory.
20. The computer-readable storage medium of claim 19 wherein the method comprises broadcasting the message by updating the Cache_Donation_Source Board-Mask in the shared memory.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
DETAILED DESCRIPTION
[0012] All examples, aspects, and features mentioned in this disclosure can be combined in any technically possible way. The terminology used in this disclosure is intended to be interpreted broadly within the limits of subject matter eligibility. The terms “disk” and “drive” are used interchangeably herein and are not intended to refer to any specific type of non-volatile electronic storage media. The terms “logical” and “virtual” are used to refer to features that are abstractions of other features, e.g. and without limitation abstractions of tangible features. The term “physical” is used to refer to tangible features that possibly include, but are not limited to, electronic hardware. For example, multiple virtual computers could operate simultaneously on one physical computer. The term “logic,” if used herein, refers to special purpose physical circuit elements, firmware, software, computer instructions that are stored on a non-transitory computer-readable medium and implemented by multi-purpose tangible processors, alone or in any combination. Aspects of the inventive concepts are described as being implemented in a data storage system that includes host servers and a storage array. Such implementations should not be viewed as limiting. Those of ordinary skill in the art will recognize that there are a wide variety of implementations of the inventive concepts in view of the teachings of the present disclosure.
[0013] Some aspects, features, and implementations described herein may include machines such as computers, electronic components, optical components, and processes such as computer-implemented procedures and steps. It will be apparent to those of ordinary skill in the art that the computer-implemented procedures and steps may be stored as computer-executable instructions on a non-transitory computer-readable medium. Furthermore, it will be understood by those of ordinary skill in the art that the computer-executable instructions may be executed on a variety of tangible processor devices, i.e. physical hardware. For practical reasons, not every step, device, and component that may be part of a computer or data storage system is described herein. Those of ordinary skill in the art will recognize such steps, devices, and components in view of the teachings of the present disclosure and the knowledge generally available to those of ordinary skill in the art. The corresponding machines and processes are therefore enabled and within the scope of the disclosure.
[0014]
[0015] The storage array 100, which is depicted in a simplified data center environment with two host servers 103 that run host applications, is one example of a storage area network (SAN). The host servers 103 may be implemented as individual physical computing devices, virtual machines running on the same hardware platform under control of a hypervisor, or in containers on the same hardware platform. The storage array 100 includes one or more bricks 104. Each brick includes an engine 106 and one or more drive array enclosures (DAEs) 108. Each engine 106 includes a pair of interconnected compute nodes 112, 114 that are arranged in a failover relationship and may be referred to as “storage directors” or simply “directors.” Although it is known in the art to refer to the compute nodes of a SAN as “hosts,” that naming convention is avoided in this disclosure to help distinguish the network server hosts 103 from the compute nodes 112, 114. Nevertheless, the host applications could run on the compute nodes, e.g. on virtual machines or in containers. Each compute node includes resources such as at least one multi-core processor 116 and local memory 118. The processor may include central processing units (CPUs), graphics processing units (GPUs), or both. The local memory 118 may include volatile media such as dynamic random-access memory (DRAM), non-volatile memory (NVM) such as storage class memory (SCM), or both. Each compute node allocates a portion of its local memory 118 to a shared memory 210 (
[0016] Data associated with instances of a host application running on the hosts 103 is maintained persistently on the managed drives 101. The managed drives 101 are not discoverable by the hosts 103 but the storage array creates logical storage devices known as production volumes 140, 142 that can be discovered and accessed by the hosts. Without limitation, a production volume may alternatively be referred to as a storage object, source device, production device, or production LUN, where the logical unit number (LUN) is a number used to identify logical storage volumes in accordance with the small computer system interface (SCSI) protocol. From the perspective of the hosts 103, each production volume 140, 142 is a single drive having a set of contiguous fixed-size logical block addresses (LBAs) on which data used by the instances of the host application resides. However, the host application data is stored at non-contiguous addresses on various managed drives 101, e.g. at ranges of addresses distributed on multiple drives or multiple ranges of addresses on one drive. The compute nodes maintain metadata that maps between the production volumes 140, 142 and the managed drives 101 in order to process IOs from the hosts.
[0017]
[0018]
[0019] Donor cache slots are created and held in reserve by compute nodes based on operational status. Each compute node 300, 302, 304, 306 maintains operational status metrics 338, 340, 342, 344 such as one or more of recent cache slot allocation rate, current number of write pending or dirty cache slots, current depth of local shared slot queues, and recent fall-through time (FTT). The recent cache slot allocation rate indicates how many local cache misses occurred within a predetermined window of time, e.g. the past S seconds or M minutes. The current number of write pending (WP) or dirty cache slots indicates how many of the local cache slots contain changed data that must be destaged to the managed drives before the associated cache slot can be recycled. A smaller number indicates better suitability for creation of donor cache slots. The current depth of the local shared slot queues indicates the number of free cache slots required to service new IOs. The depth of the local shared slot queues also indicates the state of the race condition that exists between worker thread recycling and IO workload. A shorter depth indicates better suitability for creation of donor cache slots. Recent FTT indicates the average time that BE TRKs are resident in the local cache slots before being recycled, e.g. time between being written to a cache slot and being flushed or destaged from the cache slot by a worker thread. A larger FTT indicates better suitability for creation of donor cache slots.
[0020] The operational status metrics 338, 340, 342, 344 are captured and written by the worker thread of each compute node to the shared memory 210 and used to calculate how many donor cache slots, if any, to create. In the illustrated example compute nodes 300, 302, and 328 each generate a different quantity of donor cache slots 332, 334, 336 based on local operational status while cache slot starved compute node 306 has no donor cache slots. The operational status information and donor cache slot information collectively form part of the Cache_Donation_Source Board-Mask. The cache slot starved compute node 306 generates a cache slot donation target message 346 that is broadcast to the other compute nodes 300, 302, 304. The message may be broadcast by writing to the Cache_Donation_Source Board-Mask. In response to the message, one or more of the potential remote cache slot donor compute nodes provides remote cache slots to the cache slot starved compute node 306. In the illustrated example compute node 302 is shown donating remote cache slots to compute node 306. Donation of remote cache slots may include providing pointers to the locations of the remote cache slots in the shared memory. The remote cache slots can be accessed by the cache slot starved compute node 306 using DMA or RDMA. The local worker thread for the remote cache slot, e.g. WT 326 for the remote cache slots donated by compute node 302, eventually recycles the donated remote cache slots.
[0021] The number of cache slots to be queued as donor slots is limited to avoid degrading performance of the donor compute node. Capability to donate cache slots is based on per-director cache statistics, e.g. eliminating as candidates directors that have more than a predetermined number of WP, are above an 85% out of pool (dirty) slots limit, and have a local FTT that is below a predetermined level compared to the storage array average FTT for a specific segment. Per-director DSA statistics, pre-determined pass/fail criteria for each emulation on the director, max work queues or some other indicator of spare cycles, and per-slice DSA statistics for the remaining emulations may also be used. Director cache statistics are not necessarily static so the number of donor cache slots maintained by a director may be dynamically adjusted.
[0022]
[0023]
[0024]
[0025] Specific examples have been presented to provide context and convey inventive concepts. The specific examples are not to be considered as limiting. A wide variety of modifications may be made without departing from the scope of the inventive concepts described herein. Moreover, the features, aspects, and implementations described herein may be combined in any technically possible way. Accordingly, modifications and combinations are within the scope of the following claims.