DE-DUPLICATION OF CLIENT-SIDE DATA CACHE FOR VIRTUAL DISKS
20170329530 · 2017-11-16
Inventors
Cpc classification
G06F3/0665
PHYSICS
International classification
Abstract
A computer receives a write request including an offset within a virtual disk. The computer writes the data block to a remote platform and calculates a hash value of the data. If the hash value does not exist in a first table of a block cache of the computer, the computer adds a pair to the first table: hash value/block cache data offset. Next, the computer adds a pair in a second table of the block cache: virtual disk offset of the data/hash value. A read request uses these tables to find the data in the cache without accessing the storage platform. The read consults the second table to find the hash value corresponding to the virtual disk offset of block. The hash value is used as a key into the first table to find the block cache data offset of the data; the data is read from the block cache at that offset.
Claims
1. A method of writing a block of data to a virtual disk on a remote storage platform, said method comprising receiving a write request to write said block of data from a computer server to said remote storage platform, said write request including an offset within said virtual disk; writing said block of data to a storage node of said storage platform; calculating a hash value of said block of data using a hash function; determining whether said hash value exists in a first metadata table of a block cache of said computer server; and when it is determined that said hash value exists in said first metadata table, adding an entry in a second metadata table of said block cache including said virtual disk offset and said hash value as a key/value pair.
2. A method as recited in claim 1 wherein said first metadata table includes as a key/value pair said hash value and a block cache data offset that indicates where within said block cache that said block of data exists.
3. A method as recited in claim 1 wherein said block cache is in persistent storage of said computer server.
4. A method as recited in claim 1 further comprising: not writing said block of data into said block cache after said determining.
5. A method as recited in claim 1 further comprising: receiving said write request at a virtual machine of said computer server from an application executing upon said computer server.
6. A method as recited in claim 1 wherein said virtual disk offset entry in said second metadata table includes a name of said virtual disk.
7. A method as recited in claim 1 wherein said block cache does not include duplicates of any data block within said block cache.
8. A method of writing a block of data to a virtual disk on a remote storage platform, said method comprising receiving a write request to write said block of data from a computer server to said remote storage platform, said write request including an offset within said virtual disk; writing said block of data to a storage node of said storage platform; calculating a hash value of said block of data using a hash function; determining whether said hash value exists in a first metadata table of a block cache of said computer server; when it is determined that said hash value does not exist in said first metadata table, writing said block of data into said block cache at a block cache data offset and storing said hash value and said block cache data offset as a key/value pair in said first metadata table; and adding an entry in a second metadata table of said block cache including said virtual disk offset and said hash value as a key/value pair.
9. A method as recited in claim 8 wherein said block cache is in persistent storage of said computer server.
10. A method as recited in claim 8 further comprising: receiving said write request at a virtual machine of said computer server from an application executing upon said computer server.
11. A method as recited in claim 8 wherein said virtual disk offset entry in said second metadata table includes a name of said virtual disk.
12. A method as recited in claim 8 wherein said block cache does not include duplicates of any data block within said block cache.
13. A method of reading a block of data from a virtual disk on a remote storage platform, said method comprising receiving, at a computer server, a read request to read said block of data from said remote storage platform, said read request including an offset within said virtual disk; determining whether said virtual disk offset exists as an entry in a first metadata table of a block cache of said computer server; when it is determined that said virtual disk offset exists in said first metadata table, retrieving a unique identifier corresponding to said virtual disk offset in said entry; accessing a second metadata table of said block cache and retrieving a block cache data offset using said unique identifier as a key; and reading said block of data from said block cache at said block cache data offset.
14. A method as recited in claim 13 wherein said block cache is in persistent storage of said computer server.
15. A method as recited in claim 13 further comprising: not reading said block of data from said remote storage platform after said determining.
16. A method as recited in claim 13 further comprising: receiving said read request at a virtual machine of said computer server from an application executing upon said computer server.
17. A method as recited in claim 13 wherein said virtual disk offset entry in said first metadata table includes a name of said virtual disk.
18. A method as recited in claim 16 further comprising: returning said block of data to said application.
19. A method as recited in claim 13 wherein said block cache does not include duplicates of any data block within said block cache.
20. A method of reading a block of data from a virtual disk on a remote storage platform, said method comprising receiving, at a virtual machine of a computer server, a read request to read said block of data from said remote storage platform, said read request including an offset within said virtual disk; determining whether said virtual disk offset exists as an entry in a first metadata table of a block cache of said computer server; when it is determined that said virtual disk offset does not exist in said first metadata table, reading said block of data from said remote storage platform; and returning said block of data to a software application executing upon said computer server.
21. A method as recited in claim 20 wherein said first metadata table includes key/value pairs, wherein said keys are offsets within said virtual disk and wherein said values are unique identifiers that each identify a block of data within said block cache.
22. A method as recited in claim 20 wherein said block cache is in persistent storage of said computer server.
23. A method as recited in claim 20 wherein said virtual disk offset entry in said second metadata table includes a name of said virtual disk.
24. A method as recited in claim 20 wherein said block cache does not include duplicates of any data block within said block cache.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The invention, together with further advantages thereof, may best be understood by reference to the following description taken in conjunction with the accompanying drawings in which:
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Storage System
[0024]
[0025] Computers nodes 30-40 are shown logically being grouped together, although they may be spread across data centers and may be in different geographic locations. A management console 40 used for provisioning virtual disks within the storage platform communicates with the platform over a link 44. Any number of remotely located computer servers 50-52 each typically executes a hypervisor in order to host any number of virtual machines. Server computers 50-52 form what is typically referred to as a compute farm. As shown, these virtual machines may be implementing any of a variety of applications such as a database server, an e-mail server, etc., including applications from companies such as Oracle, Microsoft, etc. These applications write to and read data from the storage platform using a suitable storage protocol such as iSCSI or NFS, although each application will not be aware that data is being transferred over link 54 using a different protocol.
[0026] Management console 40 is any suitable computer able to communicate over an Internet connection or link 44 with storage platform 20. When an administrator wishes to manage the storage platform (e.g., provisioning a virtual disk, snapshots, revert, clone, analyze metrics, determine health of cluster, etc.) he or she uses the management console to access the storage platform and is put in communication with a management console routine executing as part of a software module on any one of the computer nodes within the platform. The management console routine is typically a Web server application.
[0027] In order to provision a new virtual disk within storage platform 20 for a particular application running on a virtual machine, the virtual disk is first created and then attached to a particular virtual machine. In order to create a virtual disk, a user uses the management console to first select the size of the virtual disk (e.g., 100 GB), and then selects the individual policies that will apply to that virtual disk. For example, the user selects a replication factor, a data center aware policy and other policies concerning whether or not to compress the data, the type of disk storage, etc. Once the virtual disk has been created, it is then attached to a particular virtual machine within one of the computer servers 50-52 and the provisioning process is complete.
[0028] Advantageously, storage platform 20 is able to simulate prior art central storage nodes (such as the VMax and Clarion products from EMC, VMWare products, etc.) and the virtual machines and software applications will be unaware that they are communicating with storage platform 20 instead of a prior art central storage node. In addition, the provisioning process can be completed on the order of minutes or less, rather than in four to eight weeks as was typical with prior art techniques. The advantage is that one only needs to add metadata concerning a new virtual disk in order to provision the disk and have the disk ready to perform writes and reads.
Provision Virtual Disk
[0029] Typically, an administrator is aware that a particular software application desires a virtual disk within the platform and is aware of the characteristics that the virtual disk should have. The administrator first uses the management console to access the platform and connect with the management console Web server on any one of the computer nodes within the platform. The administrator chooses the characteristics of the new virtual disk such as a name; a size; a replication factor; a residence; compressed; a replication policy; cache enabled (a quality-of-service choice); and a disk type (indicating whether the virtual disk is of a block type—the iSCSI protocol—or of a file type—the NFS protocol).
[0030] As mentioned above, one of the characteristics for the virtual disk that may be chosen is whether or not the client-side cache of the local computer should be enabled for that virtual disk. Applications that do not read or write frequently may not desire the cache to be enabled (as writing to the cache can add overhead), while applications that read and write frequently may desire the cache to be enabled. Cache enablement, thus, is an optional feature that may be turned on or off for each virtual disk.
[0031] Once chosen, these characteristics are stored as “virtual disk information” metadata onto a computer node within the storage platform and may be replicated. In this fashion, the virtual disk metadata has been stored upon metadata nodes within the platform (which might be different from the nodes where the actual data of the virtual disk will be stored). In addition, the identities of the storage nodes which store this metadata for the virtual disk is also sent to the controller virtual machine for placing into a cache.
[0032] The virtual disk that has been created is also attached to a virtual machine of the compute farm. In this step, the administrator is aware of which virtual machine on which computer of the compute farm needs the virtual disk. Thus, information regarding the newly created virtual disk (i.e., name, space available, virtual disk information, etc.) is sent from the management console routine to the appropriate computer within the compute farm. The information is provided to a controller virtual machine which stores the information in a cache, ready for use when the virtual machine needs to write or to read data. The administrator also supplies the name of the virtual disk to the application that will use it.
[0033]
[0034] Similar to a traditional hard disk, as data is written to the virtual disk at a particular offset 340 (ranging from 0 up to the size of the virtual disk) the virtual disk will fill up symbolically from left to right. Each container of data will be stored upon a particular node or nodes within the storage platform that are chosen during the write process. In the example of
Controller Virtual Machine
[0035]
[0036] As shown, server 51 includes a hypervisor and virtual machines 182 and 186 that desire to perform I/O handling using the iSCSI protocol 187 or the NFS protocol 183. Server 51 also includes a specialized controller virtual machine (CVM) 180 that is adapted to handle communications with the virtual machines using either protocol (and others), yet communicates with the storage platform using a proprietary protocol 189. Protocol 189 may be any suitable protocol for passing data between storage platform 20 and a remote computer server 51 such as TCP. In addition, the CVM may also communicate with public cloud storage using the same or different protocol 191. Advantageously, the CVM need not communicate any “liveness” information between itself and the computer nodes of the platform. There is no need for any CVM to track the status of nodes in the cluster. The CVM need only talk to a node in the platform, which is then able to route requests to other nodes and public storage nodes.
[0037] The CVM also uses a memory cache 181 on the computer server 51. In communication with computer server 51 and with CVM 180 are also any number of solid-state disks (or other similar persistent storage) 195 that will be explained in greater detail below. These disks may be used as a data cache to store data blocks that are written into storage platform 20 and then to rapidly retrieve these data blocks instead of retrieving them from the remote storage platform.
[0038] CVM 180 handles different protocols by simulating an entity that the protocol would expect. For example, when communicating under the iSCSI block protocol, CVM responds to an iSCSI Initiation by behaving as an iSCSI Target. In other words, when virtual machine 186 performs I/O handling, it is the iSCSI Initiator and the controller virtual machine is the iSCSI Target. When an application is using the block protocol, the CVM masquerades as the iSCSI Target, traps the iSCSI CDBs, translates this information into its own protocol, and then communicates this information to the storage platform. Thus, when the CVM presents itself as an iSCSI Target, the application may simply talk to a block device as it would do normally.
[0039] Similarly, when communicating with an NFS client, the CVM behaves as an NFS server. When virtual machine 182 performs I/O handling the controller virtual machine is the NFS server and the NFS client (on behalf of virtual machine 182) executes either in the hypervisor of computer server 51 or in the operating system kernel of virtual machine 182. Thus, when an application is using the NFS protocol, the CVM masquerades as an NFS server, captures NFS packets, and then communicates this information to the storage platform using its own protocol.
[0040] An application is unaware that the CVM is trapping and intercepting its calls under the iSCSI or NFS protocol, or that the CVM even exists. One advantage is that an application need not be changed in order to write to and read from the storage platform. Use of the CVM allows an application executing upon a virtual machine to continue using the protocol it expects, yet allows these applications on the various computer servers to write data to and read data from the same storage platform 20.
[0041] Replicas of a virtual disk may be stored within public cloud storage 190. As known in the art, public cloud storage refers to those data centers operated by enterprises that allow the public to store data for a fee. Included within these data centers are those known as Amazon Web Services and Google Compute. During a write request, the write request will include an identifier for each computer node to which a replica should be written. For example, nodes may be identified by their IP address. Thus, the computer node within the platform that first fields the write request from the CVM will then route the data to be written to nodes identified by their IP addresses. Any replica that should be sent to the public cloud can then simply be sent to the DNS name of a particular node which request (and data) is then routed to the appropriate public storage cloud. Any suitable computer router within the storage platform may handle this operation.
Client-Side Cache
[0042] As mentioned above, a client machine, such as computer 51, uses a data cache 195 in order to store blocks of data that it has written to storage platform 20 in order to retrieve those blocks more quickly when a read is performed. The present invention provides an apparatus and technique in order to efficiently cache data on the client side so that during a read operation from a software application it may not be necessary to access the remote storage platform 20. One advantage of the present invention is that very large sizes of a data cache are supported and that blocks of data are stored efficiently. The invention facilitates very large data caches because the invention de-duplicates data in the cache as well, which in turn increases the cache capacity by the factor of the de-duplication ratio.
[0043]
[0044]
[0045]
Write Using Client-Side Cache
[0046]
[0047] In step 504 the virtual machine (on behalf of its software application) that desires to write data into the storage platform sends a write request including the data to be written to a particular virtual disk. The request may originate from a virtual machine on the same computer as the CVM, or from a virtual machine on a different computer. As mentioned, a write request may originate with any of the applications on one of computer servers 50-52 and may use any of a variety of storage protocols. The write request typically takes the form: write (offset, size, virtual disk name). The parameter “virtual disk name” is the name of the virtual disk. The parameter “offset” is an offset within the virtual disk (i.e., a value from 0 up to the size of the virtual disk), and the parameter “size” is the length of the data to be written in bytes. As mentioned above, the CVM will trap or capture this write request sent by the application (in the block protocol or NFS protocol, for example).
[0048] Next, in step 508 the CVM calculates the MD5 of each block within the data to be written. Blocks may be of any size, although typically the size is 4 k bytes. After all of the message digests have been calculated (or perhaps after each one is calculated), in step 512 the CVM performs a lookup in metadata 410 of the block cache 195 to determine if each MD5 exists within table 440 in order to prevent duplicates from being stored. If an MD5 exists, this indicates that that exact block of data has already been written into the client-side cache 195 (for any virtual disk accessed by that CVM) and that it will not be necessary to write that block of data again into the cache. If the MD5 does not exist, this indicates that the block of data does not exist within the block cache yet and that the data block should be written to the cache. It is possible that within the data requested to be written, that some blocks already exist within the block cache and that some do not. It is also possible that the MD5s for certain blocks will be the same (e.g., if all of these blocks are entirely filled with zeros). For each query of table 440 with an MD5, the result returned is whether or not the MD5 exists, and if it exists, the block cache data offset 448.
[0049] For those blocks of data that do not already exist within the block cache, step 516 will write those unique blocks to the data region 420 of the block cache and return the block cache data offset where each block was written in data 420.
[0050] Next, for those unique blocks written in step 516 their metadata will be updated in step 520. In step 520 the CVM updates table 440 with the MD5 of each block written to the block cache and its corresponding block cache data offset, so that the block can later be found in the block cache using its MD5.
[0051] In step 512 if, for any block of data, its MD5 does already exist in table 440, this indicates that the block of data does exist in the block cache, and control moves to step 524. In step 524, table 480 is updated for every block of data in the write request. This table will be updated to include the virtual disk offset of each block along with its corresponding MD5. Knowing the offset from the write request and the block size, it is a simple matter to calculate the virtual disk offset for each block of the write request. In this fashion, the MD5s for all blocks of the write request will be available in table 480 by using the virtual disk offset for each block as a key, which will be useful when reading data from the storage platform and using this client-side cache. In addition, by performing the check in step 512, duplicate blocks of data are not written to the cache.
Read Using Client-Side Cache
[0052]
[0053] In step 604 the virtual machine that desires to read data from the storage platform sends a read request from a particular application to the desired virtual disk. As explained above, the controller virtual machine will then trap or capture the request (depending upon whether it is a block request or an NFS request) and then typically places a request into its own protocol before sending the request to the storage platform.
[0054] As mentioned, a read request may originate with any of the virtual machines on computers 50-52 (for example) and may use any of a variety of storage protocols. The read request typically takes the form: read (offset, size, virtual disk name). The parameter “virtual disk name” is the name of a virtual disk on the storage platform. The parameter “offset” is an offset within the virtual disk (i.e., a value from 0 up to the size of the virtual disk), and the parameter “size” is the length of the data to be read in bytes.
[0055] The CVM is aware of which virtual disks have the client-side cache enabled, and, if so, before sending the read request to the storage platform, the CVM will first check its block cache 195 to determine whether any of the blocks to be read are already present within this cache. Thus, in step 608, the CVM divides up the read request into blocks; e.g., a request of size 64 k is divided up into sixteen blocks of 4 k each, each block having a corresponding offset within the named virtual disk. Thus, an offset within the named virtual disk is calculated for each block of data.
[0056] Step 612 then checks metadata 410 to determine whether an entry exists in table 480 for each of the calculated offsets of the named virtual disk. If an entry exists, this means that the corresponding data block has been stored in the client-side cache and the MD5 488 corresponding to that entry is returned to the CVM. Thus, in step 616 the CVM consults table 440 using the returned MD5 in order to obtain the block cache data offset for that particular block within data 420. Once obtained, the data block is simply read from the block cache at the block cache data offset, thus obviating the need to read a data block from the remote storage platform 20.
[0057] If an entry does not exist in table 480 for any of the calculated offsets for the named virtual disk, this means that the corresponding data block has not been previously stored in the client-side cache and that the data block must be read from the remote storage platform. Accordingly, in step 620 a read request for that particular data block is sent to the storage platform which then returns the data block.
[0058] It is possible that within a given read request there may be some data blocks that have been stored in the client-side cache and some that have not. Thus, for those data blocks that must be read from the storage platform, the CVM may choose to read those data blocks from the remote storage platform one at a time, or may choose to send a single, combined read request. Those data blocks that do exist within the client-side cache may also be read one by one, or the CVM may issue a single read request for all of those blocks at one time.
[0059] In step 624, after collecting both the data blocks read from the storage platform and the data blocks read from the block cache, the CVM then returns this data corresponding to the original read request to the requesting virtual machine using the appropriate protocol, again masquerading either as a block device or as an NFS device depending upon the protocol used by the particular application.
Computer System Embodiment
[0060]
[0061]
[0062] CPU 922 is also coupled to a variety of input/output devices such as display 904, keyboard 910, mouse 912 and speakers 930. In general, an input/output device may be any of: video displays, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, biometrics readers, or other computers. CPU 922 optionally may be coupled to another computer or telecommunications network using network interface 940. With such a network interface, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the above-described method steps. Furthermore, method embodiments of the present invention may execute solely upon CPU 922 or may execute over a network such as the Internet in conjunction with a remote CPU that shares a portion of the processing.
[0063] In addition, embodiments of the present invention further relate to computer storage products with a computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media such as floptical disks; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter.
[0064] Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. Therefore, the described embodiments should be taken as illustrative and not restrictive, and the invention should not be limited to the details given herein but should be defined by the following claims and their full scope of equivalents.