Data storage method and method for executing an application with reduced access time to the stored data
11561934 · 2023-01-24
Assignee
Inventors
Cpc classification
G06F16/183
PHYSICS
G06F3/0605
PHYSICS
International classification
G06F16/00
PHYSICS
Abstract
The invention concerns a storage method for storing, on data servers (3, 4), data file (5, 61 to 64) slices (51 to 58) from the execution of a plurality of processes (65 to 68) of one or more applications (83, 85), comprising: distributing the stored data file (5, 61 to 64) slices (51 to 58) over different data servers (3, 4), characterized in that: this distribution is carried out in such a way that the data file (5, 61 to 64) slices (51 to 58) likely to be subsequently accessed simultaneously by different application (83, 85) processes (65 to 68) are stored on different data servers (3, 4) so as to reduce the subsequent access, to each of all or part of these data servers (3, 4) by too many application (83, 85) processes (65 to 68) simultaneously, and in that: the determination of the data file (5, 61 to 64) slices (51 to 58) likely to be subsequently accessed simultaneously by different application (83, 85) processes (65 to 68) has been carried out, during a prior phase of executing these application (83, 85) processes (65 to 68), by observing the behavior of these application (83, 85) processes (65 to 68) in order to access these stored data file (5, 61 to 64) slices (51 to 58) over time.
Claims
1. A storage method for storing, on data servers, data file slices from the execution of several processes of one or more applications, comprising: distributing the data file slices over different data servers, where distributing results in the data file slices subsequently accessed simultaneously by different application processes being stored on different data servers so as to reduce the subsequent access, to each of all or part of these data servers by too many application processes simultaneously, wherein a determination of the data file slices accessed simultaneously by different application processes has been carried out, during a prior phase of executing these application processes, by observing the behavior of these application processes in order to access these stored data file slices over time.
2. The storage method according to claim 1, wherein: distributing the stored data file comprises distributing the stored data file slices over different data server storage spaces, where the distributing results in the data file slices subsequently accessed simultaneously by different application processes being stored on different data server storage spaces, so as to reduce the subsequent access, to each of all or part of these storage spaces by too many application processes simultaneously.
3. The storage method according to claim 1, wherein the determination of the data file slices simultaneously accessed by different application processes has been carried out, during a prior phase of execution of these processes for a single application or else application by application in the case of a plurality of applications, by observing the behavior of these processes of a single application at the same time in order to access these stored data file slices over time.
4. The storage method according to claim 1, wherein the one or more applications are portable to other types of data storage servers.
5. The storage method according to claim 1, wherein the processes of the application include repetitive calculations.
6. The storage method according to claim 5, wherein the repetitive calculations include calculations of weather forecasts.
7. The storage method according to claim 1, wherein the one or more applications are executed within a network comprising at least 5000 calculation nodes.
8. The storage method according to claim 1, wherein a maximum file slice size that can be stored in one go and a maximum number of storage spaces on which this file can be stored are associated with each file, wherein a maximum file slice size being less than 2 MB, the maximum number of storage spaces being less than 10.
9. The storage method according to claim 1, wherein the distribution of stored data file slices over different data servers or over different storage spaces of different data servers is carried out by a library of functions which: intercepts the creation of data files, and carries out the storage of the slices of these data files on the storage spaces on the data servers associated therewith during said prior phase of execution of the application processes.
10. A method for executing several processes of one or more applications, comprising: a first phase of observing the running of said processes and a manner in which said processes access, over time, stored data during said execution, during which a determination of data file slices accessed simultaneously by different application processes is carried out, a second phase of parametrization of the storage of the data by said processes on data servers and on storage spaces thereof, associating, with the data file slices storage spaces on the data servers, a third phase of distributing stored data file slices over different data servers and over the storage spaces thereof, this distribution being carried out such that the data file slices subsequently simultaneously accessed by different application processes are stored on different data servers, where appropriate on different storage spaces of these data servers so as to reduce the subsequent access, to each of all or part of these data servers, where appropriate of these storage spaces, by too many application processes simultaneously.
11. A storage method for storing, on data servers, data from the execution of several processes, comprising: distributing stored data over different data servers, distributing resulting in groups of data subsequently accessed simultaneously by different processes being stored on different data servers so as to reduce subsequent access, to each of all or part of these data servers by too many processes simultaneously, wherein determining which groups of data are accessed simultaneously by different processes has been carried out during a prior phase of executing these processes by observing the behavior of these processes in order to access these groups of stored data over time.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION OF THE INVENTION
(10)
(11) A calculation node 1 runs a calculation and occasionally needs to read and/or write data in data servers 3 and 4 using a metadata server 2. The data server 3 (OSS for “object storage server”) comprises a plurality of data disk spaces 31 to 33 (OST for “object storage target”). The data server 4 comprises a plurality of data disk spaces 41 to 43. The metadata server 2 (MDS for “metadata server”) comprises a plurality of metadata disk spaces 41 to 43 (MDT for “metadata target”).
(12) The language used by the calculation node to communicate with servers 2 to 4 is for example the language Lustre (open-source software language). The calculation node 1 sends a file open request 11 to the metadata server 2. The metadata server 2 returns a response 12 containing file attributes and identifiers. These attributes include a segmentation block size of the data file to be stored and also a list of data servers or even a list of data disk spaces. The calculation node 1 sends the data server 3 a data read request 13 or a data write request 14. The data server 3 reads or writes the data on one of the data disk spaces 31 to 33. The calculation node 1 sends the data server 4 a data read request 15 or a data write request 16. The data server 4 reads or writes the data on one of the data disk spaces 41 to 43.
(13) Nowadays, the largest computers, also referred to as supercomputers, are composed of several thousand independent calculation nodes, such as the calculation node 1, collectively executing one or more parallel applications, i.e. an application is executed on a large number of calculation nodes 1
(14) These supercomputers generally use, to access input data and to write the results, a file system which is itself parallel, generally composed of several tens of servers such as the data servers 3 and 4.
(15)
(16) The calculation node 1 has a file 5 of data to be stored. The file 5 is segmented into eight slices 51 to 58. The data storage strategy is a random strategy, i.e. the file slices are stored randomly over all the data disk spaces which are assigned to the data file 5, in this case the the data disk spaces 31 and 32 of the data server 3 and the data disk spaces 41 and 42 of the data server 4. The data server 3 stores the slices 51 and 55 on the data disk space 31 and the slices 52 and 56 on the data disk space 32. The data server 4 stores the slices 53 and 57 on the data disk space 41 and the slices 54 and 58 on the data disk space 42.
(17) During the creation of a file 5, the file system must choose which data servers 3 or 4 and which disk spaces 31 to 33 and/or 41 to 43 will be used to store its contents. The algorithms used nowadays are based on a random method (“round robin”), with the aim of randomly distributing the data over all the data servers 3 and 4 and to promote uniform filling.
(18) The principle is that an elementary data slice size (“stripe”) and a number of disk spaces 31 to 43 for each file 5 to be created are fixed in advance.
(19) With these two values fixes respectively, for example, at 1 megabyte for the size of the slice and at four for the number of disk spaces 31 and 32 and 41 and 42 on data servers 3 and 4, it is then possible to have the random distribution as depicted in
(20)
(21) Four data files 61 to 64 each containing four data slices must be stored on four data servers 100 to 103. With a random storage strategy, the storage carried out is for example the following.
(22) For the file 61, the first slice 611 is stored on the server 100, the second slice 612 is stored on the server 101, the third slice 613 is stored on the server 102, the fourth slice 614 is stored on the server 103.
(23) For the file 62, the first slice 621 is stored on the server 103, the second slice 622 is stored on the server 100, the third slice 623 is stored on the server 101, the fourth slice 624 is stored on the server 102.
(24) For the file 63, the first slice 631 is stored on the server 102, the second slice 632 is stored on the server 103, the third slice 633 is stored on the server 100, the fourth slice 634 is stored on the server 101.
(25) For the file 64, the first slice 641 is stored on the server 101, the second slice 642 is stored on the server 102, the third slice 643 is stored on the server 103, the fourth slice 644 is stored on the server 100.
(26) Scientific applications have different typical behaviors regarding access to data. A frequent typical behavior is the mode referred to as “file per process”, in which each process of the parallel application, and there are up to several tens of thousands thereof, creates a file 5 in order to store the results it has calculated therein. In general, a subsequent step aggregates these partial results in order to more easily utilize them thereafter.
(27) During the creation of such an amount of files 5 by an application, it is common for a plurality of files 5 to be created on the same disk spaces 31 to 43, given the limited number thereof compared to the number of calculation nodes 1. However, since the selection of the disk spaces 31 to 43 is made at the moment the file 5 is created, this mechanism cannot take into account the profiles of future accesses to the files5, and in particular the fact that these accesses are simultaneously carried out by a plurality of processes on their respective files 5, which has the effect of splitting the access performance across all the processes using it.
(28) The conventional solution to this problem, as explained previously, consists in distributing the data from all the files 5 by segmenting each one into slices 51 to 58 distributed in turn over each of the disk spaces 31 and 32 and 41 and 42 of servers 3 and 4. However, this mechanism does not always make to it possible to respond to every usage case, and some remain particularly problematic.
(29) Thus, the example of a parallel application with four application processes A, B, C and D. Each of these processes stores its data respectively in the files 61 to 64 which are all segmented into four slices and distributed over four data servers 100 to 103. In order to avoid the problems mentioned previously, the distribution of the data over the servers 100 to 103 is offset differently in each of the files 61 to 64, as shown in
(30)
(31) Four application processes 65 to 68 will have to access, over time, whether to read data or to write data, their respective files stored on the servers 100 to 103 as detailed in
(32) The data represent a local version of a data matrix model with four columns. Should the size of a column be of the same order of magnitude as that of a file slice, or else even just a multiple of a file slice, there is a correspondence between the columns of the matrix and the file storage slices 61 to 64.
(33)
(34) The four application processes 65 to 68 will have to access, over time, whether to read data or to write data, their respective files stored on the servers 100 to 103 as detailed in
(35) For reasons of data interdependency, at the end of the calculation, each of the processes 65 to 68 begins to write, into its result file, the data of the column corresponding to its row in the parallel calculation application. The result is while all the processes are “attacking” the same data server at the same time, in this case the server 100.
(36) The immediate consequence of this behavior is an instant throughput of the server 100 which is divided by four compared to the optimal case. This is a particularly detrimental case implementing a particular application with particularly problematic data structure sizes. However, often, in a statistically more common reality, it will be possible to often find an application writing its data on a server approximately ten times slower than it could.
(37)
(38) The table of
(39) The solution proposed by this embodiment in order to solve this problem consists here in directing the creation of the files 61 to 64 over a set of data servers 100 to 103 from prior observations of executions of the processes 65 to 68 of the application(s) in order to determine the future behavior thereof in terms of simultaneous accesses to files 61 to 64. Thus, it will be possible to position the files 61 to 64 or even the slices of the files 61 to 64 accessed simultaneously by the processes 65 to 68 of application(s) on data servers 100 to 103 which are different if possible, or at least different for the majority of the file slices, or to carry out a uniform distribution, or at least more uniform, of the files 61 to 64, or of the slices of the files 61 to 64, accessed simultaneously on the available data servers 100 to 103.
(40) The solution proposed by this embodiment of the is therefore based on the possibility of observing the behavior of an application or of processes of application(s) 65 to 68 during multiple executions and of storing it in a knowledge base in order to extract therefrom an ideal profile for the distribution of the files 61 to 64 created on the data servers 100 to 103.
(41) In the field of the scientific calculation of supercomputers, it is common for the same application to be executed numerous times during a study campaign. For example, weather prediction bodies carry out the same calculations every few hours using the most up-to-date measurements of physical parameters.
(42) Once the knowledge base is formed, an analysis can be carried out to detect the simultaneous accesses made by the processes 65 to 68 of the application(s). The accesses are classified by periods of access and by regions of files accessed, in the form of a slice number, as depicted in the table of
(43)
(44) This time, by virtue of the prior observation phase, taking into account the access needs identified over time, instead of a predetermined random strategy, a different strategy determined after the fact based on the access requirements identified previously and adapted to these pre-identified access requirements is chosen. This different strategy, perfectly suited to the processes of the applications considered here, could not have been guessed without a prior observation phase, since this type of storage corresponding to storing all the same slices of each file on the same server is not a commonly used storage since, because it is perfectly symmetrical, it would rather, on the contrary, be considered to be a more likely cause of access bottlenecks, and to be more likely to cause bottlenecks in the exchange of data between application processes and storage servers.
(45) More specifically, the data storage carried out is the same for all the slices of all the files. This storage is as follows.
(46) For the file 61, the first slice 611 is stored on the server 100, the second slice 612 is stored on the server 101, the third slice 613 is stored on the server 102, the fourth slice 614 is stored on the server 103.
(47) For the file 62, the first slice 621 is stored on the server 100, the second slice 622 is stored on the server 101, the third slice 623 is stored on the server 102, the fourth slice 624 is stored on the server 103.
(48) For the file 63, the first slice 631 is stored on the server 100, the second slice 632 is stored on the server 101, the third slice 633 is stored on the server 102, the fourth slice 634 is stored on the server 103.
(49) For the file 64, the first slice 641 is stored on the server 100, the second slice 642 is stored on the server 101, the third slice 643 is stored on the server 102, the fourth slice 644 is stored on the server 103.
(50) The strategy will consist here in deducing, from the table of
(51) In order to obtain this placement, which would not be naturally generated by the system file, a mechanism is inserted which indicates to it how to carry out this ideal placement on the data servers 100 to 103 at the moment of creation of the file by the processes 65 to 68 of the application(s). This mechanism may in particular be implemented in the form of a library of functions intercepting the file creations of the processes of the application and carrying out this operation with predetermined parameters. This library will have access to the information on ideal placement of the files 61 to 64 which were developed from the analysis of previous executions.
(52)
(53) Different applications 83, or different processes of application(s), are to designed so that their behavior is observed during the running of their execution. During this observation phase, a software for observation 82 of the behavior of these applications 83 observes their behavior in order to determine the profile of each of the applications 83. Once determined, the profile of each application 83 is stored in a space for archiving 81 the profiles of the applications 83.
(54) There will therefore be two phases in the implementation of the strategy proposed by this embodiment of the invention. In a first phase, represented in
(55)
(56) A software for the optimal parametrization 84 of the applications reads, in the space for archiving 81 the profiles of the applications 83, the profiles of the applications. Using each of these profiles of applications, this software for the optimal parametrization 84 of the applications will parametrize each of the applications in order for it to become an application 85 designed for the acceleration of its behavior, and more specifically for the reduction of its data exchange time with data storage servers.
(57) In this second phase, depicted in
(58) Naturally, this invention is not limited to the examples and embodiments described and shown, but rather is subject to numerous variations accessible to the person skilled in the art.