DATA MIGRATION
20230132830 · 2023-05-04
Inventors
Cpc classification
G06F3/0607
PHYSICS
International classification
Abstract
A method for performing a data migration from a source storage system to a destination storage system includes performing an intermediate incremental synchronization of data items further comprising: i) scanning the source and destination storage system thereby obtaining a source and destination data item list; ii) retrieving stored status records of the respective data items indicative for a last known synchronization state of the respective data items; iii) generating commands for performing the intermediate incremental synchronization based on the source and destination data item list and the status records; iv) executing the commands; v) obtaining results of the executed commands; and vi) updating the status records with the results.
Claims
1.-14. (canceled)
15. A computer-implemented method for performing a data migration from a source storage system to a destination storage system; the method comprising performing an intermediate incremental synchronization of data items further comprising: scanning the source and destination storage system thereby obtaining a source and destination data item list; retrieving stored status records of the respective data items indicative for a last known synchronization state of the respective data items; generating commands for performing the intermediate incremental synchronization based on the source and destination data item list and the status records; executing the commands; obtaining results of the executed commands; and updating the status records with the results.
16. The method according to claim 15 wherein the status record of a data item comprises: a source change timestamp indicative of a moment in time on which the data item was last changed on the source storage system; and a destination change timestamp indicative of a moment in time on which the data item was last changed on the destination storage system.
17. The method according to claim 16 wherein the source and destination data item list comprise the change timestamp of the data item in the respective source and destination storage system, and the generating the commands comprises comparing the change timestamp from the data item list with the change timestamp from the status record.
18. The method according to claim 15 wherein the status record of a data item comprises at least one of: a message digest of the data item; a size of the data item; a type of the data item; an owner of the data item; access permissions of the data item; and retention information of the data item.
19. The method according to claim 15 wherein the status record of a data item comprises a synchronization status selectable from a group comprising: a first status option indicative of a valid synchronization of the respective data item; and a second status option indicative of a synchronization mismatch of the respective data item.
20. The method according to claim 15 further comprising performing an initial synchronization of data items thereby creating the status records.
21. The method according to claim 20 wherein the performing the initial synchronization further comprises: scanning the source storage system thereby obtaining an initial source data item list; generating commands for performing the initial synchronization based on the scanning; executing the commands; obtaining results of the executed commands; and creating the status records based on the results.
22. The method according to claim 19 wherein the destination storage system already comprises data items before the data migration; and wherein the performing the initial synchronization further comprises: scanning the source and destination storage system thereby obtaining an initial source and initial destination data item list; generating commands for performing the initial synchronization based on the scanning; executing the commands; obtaining results of the executed commands; and creating the status records with the results.
23. The method according to claim 15 further comprising performing a final cutover synchronization of data items thereby obtaining final status records.
24. The method according to claim 23 further comprising a data migration verification step based on the final status records.
25. The method according to claim 23 further comprising: obtaining information for protecting one or more of the data items by a write once read many, WORM, state; and applying the WORM state to the one or more of the data items on the destination storage system based on the final status records and the data item lists obtained during final cutover synchronization.
26. A controller comprising at least one processor and at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the controller to perform the method according to claim 15.
27. According to a third example aspect, the disclosure relates to a computer program product comprising computer-executable instructions for causing an apparatus to perform at least the method according to claim 15.
28. According to a fourth example aspect, the disclosure relates to a computer readable storage medium comprising computer-executable instructions for performing the method according to claim 15 when the program is run on a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0050] Some example embodiments will now be described with reference to the accompanying drawings.
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
DETAILED DESCRIPTION OF EMBODIMENT(S)
[0058] The current disclosure relates to data migration between data storage systems and more particular the data migration from a source storage system to a destination storage system.
[0059] A data item may for example correspond to a file system item such as a file or a directory within a hierarchical or structured file system. Various protocols may be used for accessing such file system items such as for example the Apple Filing Protocol (AFP), the Web Distributed Authoring and Versioning (WebDAV) protocol, the Server Message Block (SMB) protocol, the Common Internet File System (CIFS) protocol, the File Transfer Protocol (FTP), the Network File System (NFS) and the SSH file transfer protocol (SFTP).
[0060] A data item may for example correspond to an object of an object addressable storage system. Such an object comprises a key and a value wherein the key serves as a unique identifier of the value which holds the actual data that is stored. Data can be retrieved from an object addressable storage system by providing the unique identifier upon which the associate data, i.e. value, is returned. Because of the key-value storage, an object addressable storage system stores data in an unstructured manner as opposed to for example a file system. The object addressable storage system may be a cloud based object addressable storage system that is interfaceable by a pre-defined application programming interface (API) over a computer network such as the Internet. An example of a cloud based object addressable storage system is Amazon S3 or Amazon Simple Storage Service as offered by Amazon Web Services (AWS) that provides such object addressable storage through a web-based API. Another example is Google Cloud Storage offered by Google providing RESTful object storage on the Google Cloud Platform infrastructure.
[0061]
[0062] At some point in time, a data migration is started. Before and during the migration data storage may still be provided from the source data storage system to users. During the migration the destination storage system is populated with copies of the data items. At the end of the migration, during the cutover or switchover, all user access is denied to both source and destination storage systems or the users have read-only access to the source storage system and the remaining unsynchronized data items are synchronized to the destination storage system. Then, all users are given access to the destination storage while the source storage system can be decommissioned. By the cutover during which access is denied, data integrity is guaranteed.
[0063] In a first step 401 an initial synchronization is performed. In
[0064] The initial synchronization 301 may comprise a copy of all data item on the source to the destination. In the first step 401 all data items making up the data are thus copied from the source storage system 100 to the destination storage system 120. Data items that are likely to change before the cutover may also be excluded from the initial synchronization 301. As the data items are still likely to change, a new copy will anyhow have to be made before or during the cutover. Therefore, by excluding such a data portions from the initial copy, the initial copy will take less time to perform and network bandwidth is saved.
[0065] After performing the initial synchronization in step 401, one or more incremental synchronizations 302 to 306 are made until the start of the actual cutover. During an incremental synchronization differences between the source and destination system 100 and 120 are identified. These differences are then translated to commands such that the destination is again synchronized with the source. In
[0066] The step 402 of performing the incremental synchronizations may be repeated several times until the cutover 404. Step 402 may be repeated at least until the transfer size of the incremental synchronizations has reached a steady state 403. In
[0067] Then, in step 404, the actual cutover synchronization 311 is performed during a certain maintenance window 322, preferably after the steady state 403 is reached. During this maintenance window 322, all access to the data is denied or only read-only access is granted and a final cutover synchronization 311 is performed.
[0068]
[0069] The synchronization 204 then proceeds to a next step 203 wherein the source and destination data item list 201, 211 is compared with status records 210. These status records are stored as a list or report 209 and comprise information on the synchronization of the data items, i.e. information on the last known synchronization state of the file system items. Based on the data item lists 201, 202 and the status records 210, a set of commands 204 is generated to perform the synchronization, i.e. commands that are to be executed on the source and/or destination storage system.
[0070] A status record in the status report 209 may comprise the following fields: [0071] a data item path; [0072] a synchronization status; [0073] stream name; [0074] a data item type; [0075] a data item size; [0076] a data item content digest; [0077] a source change timestamp at the moment the data item was last migrated to the destination; [0078] a destination change timestamp after the data item was migrated to the destination; [0079] information on the status; [0080] security permissions; [0081] owner information; [0082] retention information; and [0083] additional metadata associated with the data item.
[0084] The above example is applicable for file system items. Similar fields may be defined for data items in object storage systems. In some situations, one or more fields may be undefined. For example, a file system item that has ‘directory’ as data item type may have no size or content digest. The data item path is the unique identifier of the data item and may for example correspond to the file system path relative to the root of the migration, i.e. relative to the highest directory in the migration. The status field is indicative of the status of the data item as it was known when last synchronized. According to an embodiment, the status field may take any of the values as shown in Table 1 below.
TABLE-US-00001 TABLE 1 Possible values for the status field Status Description IN_SYNC The data item is synchronized between source and destination. EXCLUDED The data item is present on the source but excluded from the migration scope and deleted from the destination. EXCLUDED AND The data item is excluded from the migration scope, RETAINED but not deleted from the destination. RETAINED The data item is present on the destination, but not on the source, and it was not deleted. OUT OF SYNC The data item was synchronized at a certain point in time, but the source changed and for some reason the change was not propagated to the destination. UNKNOWN The synchronization state of the data item is unknown.
A data record may provide further additional information about the item depending on the status in the ‘information’ field. For data items that are OUT OF SYNC it may give the reason why the item is out of sync, for example the destination storage system does not allow a data item having a specific name, e.g. a very long name, or one using special characters. For data items that are UNKNOWN it may comprise the reason why the item status is unknown, for example because there was a scan error on the source storage system.
[0085] The data item type contains a value that defines the type of data item, for example ‘FILE’ for a file, ‘DIRECTORY’ for a directory, ‘SYMBOLIC_LINK’ for a symbolic link, ‘PIPE’ for named pipes, ‘SOCKET’ for a named Unix domain socket, ‘BLOCK_DEVICE’ for a block device file type that provides buffered access to hardware devices, ‘CHAR_DEVICE’ for a character special file or character device that provides unbuffered, direct access to a hardware device, and ‘MOUNT_POINT’ for a mount point or location in the storage system for accessing a partition of a storage device. The data item size corresponds to the number of bytes that would be returned when reading the data item from start to end. The data item content digest contains the digest of the content. To obtain a digest value, a hashing algorithm may be used. The hashing algorithm may then also be used to verify migrated content during the last cutover synchronization. Different algorithms may be used such as for example MD5, SHA-1, SHA-256 and SHA-512 generating respectively 32, 40, 64 and 128 character long digest values. Table 2 below shows an illustrative example of possible combinations of a data item type, data item size and data item content digest.
TABLE-US-00002 TABLE 2 Different data item types and related data item size and data item content digest Data item type Data item size data item content digest Directory File Nr of bytes in file Digest of file content Alternate data Nr of bytes in ADS Digest of ADS content stream (ADS) Named attribute Nr of bytes in Digest of named named attribute attribute content Symbolic link Nr of bytes in Digest of target target path path (substitute name on Windows), converted to UTF-8.2 Named pipe (FIFO) Named unix domain socket Block device file Character device file Mount point
[0086] The change timestamp in both source and destination is useful because the data item is mutable during the migration at both the source and destination. The change timestamp identifies which version of the data item was copied from source to destination and further allows detecting any subsequent changes to the source or destination data item outside the scope of the migration, i.e. apart from the changes done by generated commands 204. In most file systems, the change timestamp is updated by the filesystem itself every time the data of the data item or metadata of the data item is altered. Moreover, this updating is performed automatically and cannot be set by a user or user program. Timestamps may be formatted in a standard, relatively compact and human readable ISO 8601 format with second, millisecond or nanosecond resolution, depending on the protocol used.
[0087] Based on the state records 210 and the scan results 200, 202 a list of commands 204 is generated that are to be executed in a next step 205. A command may be an action that is to be performed on the source or destination storage system, e.g. to delete a data item on the source and/or destination, to copy a date item from source to destination, to update metadata associated with a data item etc. A command may also be an action that does not directly change the status of the source or destination storage system, but that will update the status report 209 during a later step 208, i.e. update a status record of a data item.
[0088] Table 3 below shows different types of commands 204 that may be generated by step 203 according to an example embodiment.
TABLE-US-00003 TABLE 3 Possible commands generated from scan results and status records. Command Description COPY_NEW Copy a data item from source to destination for the first time. COPY Update a data item that already exists on the destination. COPY_METADATA Only copy or update the metadata associated with a data item on the destination. DELETE Delete the data item. REPAIR Do everything necessary for synchronizing the data item from source to destination. VERIFY Verify the data item and report differences for that data item between source and destination. COPY_WORM Copies worm-related settings for the data item, e.g. the retention period and commit state. REPORT_ERROR Report an error so it can be propagated to the status list. REPORT_EXCLUDED Report a data item as excluded so it can be propagated to the status list.
[0089] Then, in a next step 205, the list of generated commands 204 are executed. Besides the execution itself, this step also generates a result list 212. The results in the list 212 are then used in a next step 206 to determine a merge list 207, i.e. a list with updates for the status list 209. In a next merge step 208 the status list 209 is then updated based on this merge list 207. Table 4 below shows possible entries of merge list 207 depending on the executed command and the result of the command as both specified in the result list 212.
TABLE-US-00004 TABLE 4 Possible commands generated from scan results and status records. Command Executed Command result Merge list entry COPY_NEW SUCCESS Create new status record SKIPPED Data item is out of sync with the source COPY SUCCESS Update status record SKIPPED Data item is out of sync with the source REPAIR SUCCESS Update status record COPY_METADATA SUCCESS Update status record with metadata SKIPPED Data item is out of sync with the source COPY_WORM SUCCESS Update status records with WORM result DELETE SKIPPED Data item is retained SUCCESS Delete the status record REPORT_EXCLUDED Data item is excluded <any> FAILURE Unknown
[0090]
This first classification is only based on the scan results 501, 511 and thus based on information that is available by the scanning, e.g. the data item type, timestamps of the data item and the size of the data item.
[0099] Then, the method proceeds to the next step 523 wherein an initial set of intermediate commands is generated based on the classification step 522, e.g. COPY_NEW, COPY, COPY_METADATA, DELETE, REPLACE, EXCLUDE, VERIFY, COPY_WORM. Also, other parameters needed for the execution of the commands are provided. These commands are then further forwarded to the next step 524 wherein the intermediate commands are updated based on the retrieved status records 510. For example, when there is no command generated for a data item, then a REPAIR is generated if there is no status record or if there is a status record that is not IN_SYNC. Also, the COPY_METADATA is converted to a REPAIR command if there is no status record or the status record is not IN SYNC.
[0100]
[0101] When performing a first synchronization, i.e. the base of initial synchronization, the destination storage system 120 will normally have no data items, and there will be no status list 209. During such an initial synchronization, the steps of
[0102] Alternatively, when performing a first synchronization, there might already be data items present on the destination storage system 120. These data items may for example be the result of a previously failed data migration attempt. During such an initial synchronization, the steps of
[0103] During the final cutover synchronization, the steps of
[0104] During the final cutover synchronization, the data items that need to protected from further changes may have a write once ready many, WORM, state assigned to them. This may be done based on the final status list 209 whereby the relevant data items are identified from this list it is verified whether the data item on the destination has not been altered. Then, the WORM state is updated for these data items on the destination storage thereby rendering them immutable.
[0105] The steps as described above may be performed a suitable computing system or controller that has access to the source and destination storage system. To this end, the steps may be performed from within storage system 100 or 120. The execution of the commands according to step 205 may further be performed in parallel by different computing systems to speed up the execution of the commands.
[0106] Although the present invention has been illustrated by reference to specific embodiments, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied with various changes and modifications without departing from the scope thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the scope of the claims are therefore intended to be embraced therein.
[0107] It will furthermore be understood by the reader of this patent application that the words “comprising” or “comprise” do not exclude other elements or steps, that the words “a” or “an” do not exclude a plurality, and that a single element, such as a computer system, a processor, or another integrated unit may fulfil the functions of several means recited in the claims. Any reference signs in the claims shall not be construed as limiting the respective claims concerned. The terms “first”, “second”, third”, “a”, “b”, “c”, and the like, when used in the description or in the claims are introduced to distinguish between similar elements or steps and are not necessarily describing a sequential or chronological order. Similarly, the terms “top”, “bottom”, “over”, “under”, and the like are introduced for descriptive purposes and not necessarily to denote relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances and embodiments of the invention are capable of operating according to the present invention in other sequences, or in orientations different from the one(s) described or illustrated above.