Information processing apparatus, information processing method, storage system and non-transitory computer readable storage media
09891992 ยท 2018-02-13
Assignee
Inventors
Cpc classification
G06F2211/1023
PHYSICS
G06F11/1076
PHYSICS
G06F2211/1028
PHYSICS
International classification
Abstract
An information processing apparatus can prevent performance deterioration, and maintain fault tolerance, in a storage system having storage nodes of different capacities. The apparatus includes a data writing unit to divide received data into divided data, generate a parity data usable when re-configuring the received data having an error, and write divided data and parity data in storage nodes. The apparatus includes a relocation unit to assign a relocation position of the data based on a predetermined condition and store the data in the assigned storage nodes. The apparatus includes a data reading unit to read the divided data so as not to read parity data stored in the storage nodes by identifying the parity data.
Claims
1. An information processing apparatus, comprising: a data writing unit to divide received data into a plurality of divided data, generate a plurality of parity data which is in a plurality of nodes and usable when re-configuring the received data having an error, and write divided data and parity data in a plurality of storage nodes, the divided data in one node can be relocated to the other node and assignment of the parity data can be changed; a relocation unit to assign a relocation position of the data written by the data writing unit to the plurality of storage nodes based on a predetermined condition and store the data in the assigned storage nodes in such a way as to relocate the divided data from one node to another node and to change the assignment of the location of the parity data; and a data reading unit to read the divided data so as not to read parity data stored in the plurality of storage nodes by identifying the parity data, wherein when a number of the divided data having been assigned to a given storage node has become the same as a number of parity data generated by the data writing unit, the relocation unit does not assign the divided data to the given storage node and instead assigns the divided data to a storage node that has a next system number in relation to a system number of the given storage node, wherein, when assigning a relocation position to the plurality of storage nodes, if there is a storage node of the plurality of storage nodes which has been assigned a smaller number of divided data than the number of divided data assigned to another storage node of the plurality of storage nodes and which is targeted to assign divided data of interest by the relocation unit, the relocation unit assigns the relocation position so that the parity data is stored in the storage node which is targeted to assign the divided data of interest by the relocation unit, otherwise the relocation unit assigns the relocation position so that the divided data is stored in the storage node which is targeted to assign the divided data of interest by the relocation unit, wherein the data writing unit, the relocation unit, and the data writing unit cooperatively prevent a risk that fault tolerance cannot be maintained as to the parity data.
2. The information processing apparatus according to claim 1, wherein the relocation unit assigns a relocation position so as to store a larger amount of data in a large capacity storage node among the plurality of storage nodes.
3. The information processing apparatus according to claim 2, wherein the relocation unit assigns a relocation position so as to store a larger amount of the parity data in the large capacity storage node.
4. The information processing apparatus according to claim 1, wherein, when assigning a relocation position to the plurality of storage nodes, upon data not having been assigned a relocation position yet including only the divided data or only the parity data, the relocation unit assigns a relocation position so as to store remaining data in the plurality of storage nodes.
5. The information processing apparatus according to claim 1, wherein, when reading the divided data stored in the plurality of storage nodes, upon divided data not to be able to be read existing in the divided data, the data reading unit reads the parity data.
6. A storage system, comprising the information processing apparatus according to claim 1 and the storage node being connected with each other via a network.
7. The information processing apparatus according to claim 1, wherein data loss is prevented up to a coincident number of disk failures equal to the number of the divided data.
8. An information processing method, comprising: dividing received data into a plurality of divided data, generating a plurality of parity data which is in a plurality of nodes and usable when re-configuring the received data having an error, and writing the divided data and the parity data to a plurality of storage nodes, the divided data in one node can be relocated to the other node and assignment of the parity data can be changed; assigning a relocation position of the data written by the data writing unit to the plurality of storage nodes based on a predetermined condition and store the data in the assigned storage nodes in such a way as to relocate the divided data from one node to another node and to change the assignment of the location of the parity data; and reading the divided data so as not to read parity data stored in the plurality of storage nodes by identifying the parity data, wherein when a number of the divided data having been assigned to a given storage node has become the same as a number of parity data that has been generated, the divided data is not assigned to the given storage node and instead the divided data is assigned to a storage node that has a next system number in relation to a system number of the given storage node, wherein, when assigning a relocation position to the plurality of storage nodes, if there is a storage node of the plurality of storage nodes which has been assigned a smaller number of divided data than the number of divided data assigned to another storage node of the plurality of storage nodes and which is targeted to assign divided data of interest, assigning the relocation position so that the parity data is stored in the storage node which is targeted to assign the divided data of interest, otherwise assigning the relocation position so that the divided data is stored in the storage node which is targeted to assign the divided data of interest, wherein the method prevents a risk that fault tolerance cannot be maintained as to the parity data.
9. A non-transitory computer readable storage medium storing a computer program causing a computer to execute: processing of dividing received data into a plurality of divided data, generating a plurality of parity data which is in a plurality of nodes and usable when re-configuring the received data having an error, and transmitting the divided data and the parity data to a plurality of storage nodes, the divided data in one node can be relocated to the other node and assignment of the parity data can be changed; processing of assigning a relocation position of the data written by the data writing unit to the plurality of storage nodes based on a predetermined condition and store the data in the assigned storage nodes in such a way as to relocated the divided data from one node to another node and to change the assignment of the location of the parity data; and processing of reading the divided data so as not to read parity data stored in the plurality of storage nodes by identifying the parity data, wherein when a number of the divided data having been assigned to a given storage node has become the same as a number of parity data that has been generated, the divided data is not assigned to the given storage node and instead the divided data is assigned to a storage node that has a next system number of the given storage node, wherein, when assigning a relocation position to the plurality of storage nodes, if there is a storage node of the plurality of storage nodes which has been assigned a smaller number of divided data than the number of divided data assigned to another storage node of the plurality of storage nodes and which is targeted to assign divided data of interest, assigning the relocation position so that the parity data is stored in the storage node which is targeted to assign the divided data of interest, otherwise assigning the relocation position so that the divided data is stored in the storage node which is targeted to assign the divided data of interest, wherein execution of the program prevents a risk that fault tolerance cannot be maintained as to the parity data.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
EXEMPLARY EMBODIMENT
First Exemplary Embodiment
(20) First, a storage system 1 according to the first exemplary embodiment of the present invention will be described.
(21)
(22) The access node 100 and the storage node 200 of which the storage system 1 according to this exemplary embodiment is composed are realized by the CPU 10 executing a computer program (hereinafter, referred to as program) stored in the memory 12 or the HDD 14. Alternatively, the information processing apparatus 1 may be realized by the CPU 10 executing a program stored in the storage medium 24. A program executed in the CPU 10 may be acquired from outside via the communication IF 16 or the reader/writer 22. The hardware exemplary configuration of the information processing apparatus 1 shown in
(23) Next, the access node 100 will be described.
(24) The communication unit 110 is an interface to other nodes which form a client and the storage system 1.
(25) The data processing unit 120 performs processing such as reading and writing data. The data processing unit 120 has a data transmission/reception unit 122, a data writing unit 124, a relocation unit 126 and a data reading unit 128.
(26) The data transmission/reception unit 122 sends and receives data to and from a client and other nodes. For example, the data transmission/reception unit 122 performs processing to send and receive data to and from a client, by using a protocol for file transfer such as NFS (Network File System) and CIFS (Common Internet File System).
(27) The data writing unit 124 performs processing to write data into the storage node 200. Description about details of data writing processing which the data writing unit 124 performs will be made later.
(28) The relocation unit 126 relocates data written in the storage node 200 by the data writing unit 124. Description of details of relocation processing which the relocation unit 126 performs will be made later.
(29) The data reading unit 128 performs processing to read data relocated in the storage node 200 by the relocation unit 126. Description of details of data reading processing which the data reading unit 128 performs will be made later.
(30) The node information processing unit 130 manages an operating status and configuration information of the other nodes. The node information processing unit 130 has a configuration changing unit 132 and a status monitoring unit 134.
(31) The configuration changing unit 132 updates configuration information which is stored in the storage device 140 mentioned later, when there is a change in the configuration of nodes due to performing addition, deletion or the like of a node in the storage system 1. For example, when a node is added newly, the configuration changing unit 132 makes, by updating configuration information, the node added newly available in data writing processing after the update. Also, when a node is added newly, the configuration changing unit 132 may perform relocation processing after updating configuration information in order to reduce a disproportion of data between nodes.
(32) The status monitoring unit 134 updates operation information stored in the storage device 140 when a disk failure, a node failure or the like occurs in the storage system 1. For example, the status monitoring unit 134 may transmit the updated operation information on the local node to other nodes. The status monitoring unit 134 may watch operation information on other nodes.
(33) The storage device 140 can memorize configuration information which is information such as the number, capacities and the like of physical nodes and virtual nodes, operation information which is information such as a disk failure, a node failure or the like, and relocation information which is used for relocation processing, and so on. Specifically, configuration information includes: physical node information which is physical information such as the number of nodes and the like; or capacity unit node information which is information on a virtual node defined according to the capacity of a storage node, failure unit node information which defines a minimum unit of data lost (such as a hard disk and RAID (Redundant Arrays of Inexpensive Disks)) as a virtual node, and so on. Operation information may be used in order not to use a disk or a node in which a failure has occurred in data writing processing. Operation information may be used in order to restore data having been lost due to a disk failure or a node failure.
(34) Next, the storage node 200 will be described.
(35) The data processing unit 220 has a data transmission/reception unit 222, a data writing unit 224, a relocation unit 226 and a data reading unit 228. The data processing unit 220 basically performs processing similar to the processing of the data processing unit 120 provided in the access node 100 mentioned above. Operations of the data transmission/reception unit 222, the data writing unit 224, the relocation unit 226 and the data reading unit 228 will be described simply respectively. The data transmission/reception unit 222 performs processing to send and receive data to and from the access node 100. The data writing unit 224 perform processing to write data received from the access node 100 into the data storage unit 250. The relocation unit 226 performs relocation processing to move data written in the data storage unit 250 to another storage node 200. The data reading unit 228 performs processing to read data written in the data storage unit 250.
(36) The data storage unit 250 stores metadata which holds information on a storage position of data and the like; and data transmitted from the access node 100. Here, the data storage unit 250 will be described with reference to
(37)
(38) Further, in the storage system 1, a virtual node including at least one failure unit node 254, which is referred to as a capacity unit node 252, is defined for each specific capacity. For example, it is supposed that the capacity of a data storage unit 250A provided in a storage node 200A (the capacity of the storage node A) is three times as large as the capacity of the storage node 200B. In this case, a capacity unit node 252A possessed by the data storage unit 250A provided in the storage node 200A (the capacity unit node 252 held in the storage node 200A) possesses capacity unit node 252 of the number three times as large as the number of capacity unit node 252 that the storage node 200B possesses. Further, it is enabled to read from and write to a hard disk by reading from and writing to the capacity unit node 252.
(39) Next, details about the data writing processing, relocation processing and data reading processing mentioned above will be described.
(40) =Data Writing Processing=
(41) The data writing processing in the access node 100 will be described with reference to
(42)
(43) As shown in
(44) In Step S104, the data writing unit 124 further divides the divided data block 510 into m pieces. Here, data made by being divided into m pieces is called divided data (original fragment (OF)) 520. A number (a data block number) is assigned to a data block 510 for description of processing mentioned later. The operations in Step S102 and Step S104 will be called a division processing operation.
(45) In Step S106, the data writing unit 124 generates n pieces of data of parity for error correction by using an error correction code based on the original fragment 520 divided by the division processing operation. Here, the generated n pieces of data of parity are called parity data (parity fragment (PF)) 530. The original fragment 520 and the parity fragment 530 are collectively called as a fragment (F) 540. The operation in Step S106 will be called a coding processing operation.
(46) In Step S108, the data writing unit 124 assigns transmission destinations of (m+n) pieces of fragment 540 generated by the division processing operation and the coding processing operation. An assignment method of transmission destinations of pieces of fragment 540 will be described below with reference to
(47)
(48) In Step S110, the data writing unit 124 transmits each fragment 540 to a storage node 200 having been assigned. The operations in Step S108 and Step S110 will be called a distributed writing processing operation.
(49) In this exemplary embodiment, although it has been explained that the access node 100 carries out the data writing processing operation, it may be carried out by a client. Specifically, a client may carry out the data writing processing operation by installing an application for the data writing processing operation in the client, and transmit the fragment 540 to the storage node 200 directly.
(50) Next, data writing processing in the storage node 200 will be described with reference to
(51)
(52) As shown in
(53)
(54) In Step S204, the data writing unit 224 stores the pieces of fragment 540 in the assigned capacity unit nodes. Specifically, the data writing unit 224 stores the fragment 540 in a disk by writing it in the capacity unit node 252.
(55) In Step S206, the data writing unit 224 notifies the other storage nodes 200 of position information on a fragment 540 that has been written (information which indicates the storage node 200 and the capacity unit node 252 storing the fragment 540).
(56) In Step S208, the data writing unit 224 stores storage position information on the fragment 540 into the data storage unit 250 provided in all storage nodes 200 as metadata. That is, it means that all storage nodes 200 store similar metadata. Here, the metadata will be described with reference to
(57) In Step S210, the data writing unit 224 notifies the access node 100 of completion of the writing processing. Also, in addition to the completion notification of the writing processing, the data writing unit 224 notifies the access node 100 of information representing that data block k is a data block for which relocation processing has not been carried out (relocation information).
(58) =Relocation Processing=
(59) Relocation processing will be described with reference to
(60)
(61) As shown in
(62) In Step S304, the relocation unit 126 identifies the storage position of each fragment 540. Specifically, by reading metadata stored in the storage node 200, the relocation unit 126 identifies the storage positions of pieces of fragment 540 constituting the data block 510 to be a target of relocation processing. Here, because similar metadata is stored in all storage nodes 200 as it has been described in data writing processing, the relocation unit 126 should simply read metadata stored in any one of the storage nodes 200.
(63) In Step S306, the relocation unit 126 assigns a storage position of each fragment 540 after relocation. Specifically, the relocation unit 126 determines the storage node 200 to be the new assignment destination of the fragment 540 first. Next, in the determined storage node 200, the relocation unit 126 determines to which capacity unit node 252 the fragment 540 is assigned. An assignment method after relocation of the fragment 540 will be described below with reference to
(64)
(65) Next, the relocation unit 126 finds a remainder made by dividing the data block number by the number of storage nodes, and identifies the storage node 200 having a storage node number the same as the calculated remainder. In the capacity unit nodes 252 provided in the identified storage node 200, the relocation unit 126 identifies a capacity unit node 252 having the smallest system number. The relocation unit 126 assigns pieces of fragment 540 in order of system number starting from the specified capacity unit node 252. Although the relocation unit 126 assigns pieces of fragment 540 in order of system number, it is supposed that an assignment destination is assigned to the storage node 200, not to the capacity unit node 252.
(66) A storage location of the fragment 540 after relocation is determined as mentioned above basically. However, the relocation unit 126 may determine a storage location of the fragment 540 by adding the following condition.
(67) ==Condition 1==
(68)
(69) For example, it is supposed that the number of pieces of parity fragment 530 included in the pieces of fragment 540 of which the data block 510 of the target of relocation processing is three. As shown in
(70) That is, when the fragment 540 (F8) is assigned by the relocation unit 126 based on the basic assignment method, it is assigned to the storage node 200 having the capacity unit node 252 (N12) (a portion represented by the dotted line in
(71) ==Condition 2==
(72)
(73) As a result, when the number of pieces of original fragment 520 that have been already assigned to each of other storage nodes 200 is larger than or equal to the number of pieces of original fragment 520 that have been already assigned to the storage node 200 to which a fragment 540 is being tried to be assigned next, the relocation unit 126 assigns the next original fragment 520 to the storage node 200 being tried. On the other hand, when there exists a storage node 200 with the number of pieces of original fragment 520 that have been already assigned that is smaller than the number of pieces of original fragment 520 that have been already assigned to the storage node 200 to which the fragment 540 is being tried to be assigned next, the relocation unit 126 assigns a parity fragment 530 to the storage node being tried.
(74) For example, it is supposed that F1-F4 shown in
(75) When pieces of fragment 540 that have not been assigned include only the original fragment 520 or only the parity fragment 530, respectively, the relocation unit 126 assigns remaining pieces of the fragment 540 while ignoring the condition 2.
(76) Finally, the relocation unit 126 assigns the fragment 540 assigned to each storage node 200 to a capacity unit node 252 provided in the each storage node 200. The assignment method of the fragment 540 to the each capacity unit node 252 is similar to the method in Step S202 of
(77) In Step S308, the relocation unit 126 determines the fragment 540 of a movement object. Specifically, the relocation unit 126 compares the storage position of the fragment 540 assigned in Step S306 and the storage position of the fragment 540 before relocation. As a result of comparison, the relocation unit 126 determines the fragment 540 for which movement is needed actually.
(78)
(79) In Step S310, the relocation unit 126 transmits a movement instruction of the fragment 540 to the storage node 200. Specifically, the relocation unit 126 transmits to the storage node 200 an instruction which to move the fragment 540 as decided in Step S308.
(80) The relocation processing mentioned above has been described such that it is carried out by issuing an instruction to the storage node 200 by the access node 100. However, the access node 100 does not need to issue an instruction necessarily. For example, in relocation processing, when there is an instruction of relocation processing from a certain storage node 200, it is possible to carry out relocation processing without communicating with the access node 100 by storing relocation information in the storage device 240 provided in the storage node 200.
(81) Also, in the relocation processing mentioned above, relocation processing of a fragment is carried out so that a failure of up to one storage node 200 in a plurality of storage nodes 200 can be endured. Moreover, in relocation processing, it is also possible to make failures of a plurality of storage nodes 200 be able to be endured. Specifically, by reducing the maximum number of pieces of fragment 540 per one piece of storage node 200, it becomes possible to cope with failures of a plurality of storage nodes 200. At that time, a structure by which how many failed storage nodes 200 can be handled can be changed by setting may be adopted.
(82) =Data Reading Processing=
(83) Next, data reading processing will be described with reference to
(84)
(85) As shown in
(86) In Step S404, the data reading unit 128 identifies pieces of data block 510 to be a target of reading. Specifically, the data reading unit 128 reads metadata stored in the storage node 200, and identifies the data block 510 corresponding to the data requested by the client. The data reading unit 128 should simply read metadata stored in any one piece of storage node 200 because similar metadata is stored in all storage nodes 200 as it has been described in the data writing processing and the relocation processing.
(87) In Step S406, the data reading unit 128 identifies pieces of original fragment 520 to be a target of reading. Specifically, the data reading unit 128 identifies the storage positions of the pieces of original fragment 520 constituting a data block 510 in question identified in Step S404 based on the metadata.
(88) In Step S408, the data transmission/reception unit 122 transmits a Read request of the original fragment 520 to be the target of reading to the storage node 200.
(89) Here, operations for transmitting the original fragment 520 for which a Read request has been issued to the access node 100 from the storage node 200 will be described. In data reading processing,
(90) In Step S410, the storage node 200 restores data using the received pieces of original fragment 520.
(91) As it has been described above, the storage system 1 according to the first exemplary embodiment of the present invention can prevent performance deterioration and keep fault tolerance in a storage system having storage nodes of different capacities.
(92) Specifically, the storage system 1 can prevent performance deterioration of data writing. A reason of this is that the data writing unit 124 writes data equally to storage nodes at the time of data writing processing. Another reason is that the relocation unit 126 performs relocation processing after data writing processing by the data writing unit 124 has been completed.
(93) The storage system 1 can prevent performance deterioration of data reading. A reason of this is that, on the occasion of data reading processing by the data reading unit 128, the parity fragment 530 is not read, and only the original fragment 520 is read. Another reason is that, by the relocation unit 126 performing relocation processing so that the storage node 200 that has a larger capacity may store a larger number of pieces of parity fragment 530 (that is, relocation processing is performed so that the number of pieces of original fragment 520 in each storage node 200 may become equal), a load is not made to concentrate on the storage node 200 of a large capacity at the time of data reading processing.
(94) According to the storage system 1, when the number of pieces of parity fragment 530 is n, data loss is not caused up to n coincident failures of disks. Further, according to the storage system 1, even if those n coincident failures occur in one storage node, data loss is not caused. The reason is that the relocation unit 126 performs relocation processing according to a predetermined condition so that fault tolerance may not be lost.
(95) Even in a case where the number of pieces of storage nodes 200 is not large compared with the number of fragment 540, the present invention can applied to the storage system 1. The reason is that data is stored by defining a virtual node that is the capacity unit node 252. Therefore, by increasing the number of virtual nodes, the number of nodes for storing virtually can be adjusted.
Second Exemplary Embodiment
(96) An information processing apparatus 400 according to the second exemplary embodiment of the present invention will be described.
(97) A data writing unit 424 divides received data into a plurality of divided data, generates a plurality of pieces of parity data that can be used when the received data in which an error has occurred is re-configured, and transmits the divided pieces of data and the pieces of parity data to a plurality of storage nodes 200.
(98) A relocation unit 426 assigns relocation positions of the data written by the data writing unit to the plurality of storage nodes 200 based on a predetermined condition, and stores the data in the assigned storage nodes.
(99) By identifying parity data stored in the plurality of storage node 200, a data reading unit 428 reads the divided pieces of data so as not to read the parity data.
(100) As it has been described above, in a storage system having storage nodes of different capacities, the information processing apparatus 400 according to the second exemplary embodiment of the present invention can prevent performance deterioration and keep fault tolerance.
(101) The previous description of embodiments is provided to enable a person skilled in the art to make and use the present invention. Moreover, various modifications to these exemplary embodiments will be readily apparent to those skilled in the art, and the generic principles and specific examples defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not intended to be limited to the exemplary embodiments described herein but is to be accorded the widest scope as defined by the limitations of the claims and equivalents.
(102) Further, it is noted that the inventor's intent is to retain all equivalents of the claimed invention even if the claims are amended during prosecution.
INDUSTRIAL APPLICABILITY
(103) As an example of utilization of the present invention, a grid storage system which integrates a plurality of storage devices into one virtual storage device is considered. Also, it can be thought that the present invention can be used for storage having a duplication exclusion function.
DESCRIPTION OF SYMBOLS
(104) 1 Storage system 10 CPU 12 Memory 14 HDD 16 Communication IF 18 Input unit 20 Output unit 22 Reader/writer 24 Storage medium 26 Bus 100 Access node 200 Storage node 300 Network 110, 210 Communication unit 120, 220 Data processing unit 122, 222 Data transmission/reception unit 124, 224, 424 Data writing unit 126, 226, 426 Relocation unit 128, 228, 428 Data reading unit 130, 230 Node information processing unit 132, 232 Configuration changing unit 134, 234 Status monitoring unit 140, 240 Storage device 250 Data storage unit 252 Capacity unit node 254 Failure unit node 500 Data 510 Data block 520 Divided data (original fragment). 530 Parity data (parity fragment) 540 Fragment