System and method for automatically managing storage resources of a big data platform
11507622 · 2022-11-22
Assignee
Inventors
- Dan Grebenisan (Whitby, CA)
- Yue Ma (Mississauga, CA)
- Peter Sykora (Aurora, CA)
- Gordon Manway Lam (Richmond Hill, CA)
- Sarvjot Kaur Kang (Toronto, CA)
- Sai Macherla (Toronto, CA)
Cpc classification
G06F18/214
PHYSICS
International classification
G06F7/00
PHYSICS
Abstract
There is provided a computer-implemented method for automatically managing storage resources of a distributed file system comprising: obtaining actual past storage usage data of a first directory from a plurality of directories of the distributed file system to a current time; detecting, a space quota limit for the first directory and associated with a pre-defined expected future time; determining from the actual past storage usage data, projected storage usage data of the first directory over a future time period; obtaining an aggregated correction coefficient providing an indication of aggregated projected storage usage needs of remaining other directories relative to the first directory; in response to determining an expected value of the projected storage usage data at the expected future time is inconsistent with the space quota limit, adjusting the space quota limit to a new quota limit based on the expected value weighted by the aggregated correction coefficient.
Claims
1. A computer implemented method for automatically managing storage resources of a distributed file system, the method comprising: obtaining actual past storage usage data of a first directory from a plurality of directories of the distributed file system, the actual past storage usage data representative of storage usage at the first directory over a defined time period extending from a past time to a current time; detecting a space quota limit for the first directory, the space quota limit for providing a maximum limit on total storage for the first directory, the space quota limit associated with a pre-defined expected future time for providing a maximum amount of time for expecting use of resources of the first directory; determining, in real-time, based on the actual past storage data, projected storage usage data of the first directory, by inputting the actual past storage data into a trained machine learning model for determining a storage usage trend of the first directory, the projected storage usage data representing a future storage usage for the first directory over a future time period from the current time; wherein determining the projected storage usage data of the first directory further comprises using the actual past storage usage data to: define an interpolated curve representing a function of the actual past storage usage data extending from the current time to the past time; and determine the projected storage usage data of the first directory as a function of a first derivative of the interpolated curve reflecting a rate of change of the projected usage over time and a first derivative of a moving average of the interpolated curve; obtaining an aggregated correction coefficient providing an indication of aggregated projected storage usage needs of all other remaining distributed file system directories from the plurality of directories, relative to the projected storage usage data of the first directory, by characterizing a total aggregate influence of current and forecasted trends and behaviours of said all other distributed file system directories from the plurality of directories on a single directory from the directories comprising the first directory; in response to determining an expected value of a projected storage usage data at the expected future time is inconsistent with the space quota limit, adjusting the space quota limit to a new quota limit for the first directory based on the expected value weighted by the aggregated correction coefficient; and applying the new quota limit to the first directory from the current time.
2. The method of claim 1, wherein determining the new quota limit further comprises increasing the space quota limit to the new quota limit when the space quota limit is insufficient based on the projected storage usage data indicating that the space quota limit will be reached prior to the expected future time.
3. The method of claim 1, further comprising decreasing the space quota limit to the new quota limit when the projected storage usage data at the expected future time has a value below the space quota limit by at least a pre-defined amount.
4. The method of claim 1, wherein the weighting by the aggregated coefficient is further based upon an obtained value for total disk storage availability of a cluster defined by the plurality of directories of the distributed file system, the total disk storage availability indicating total amount of disk storage currently available for use by the plurality of directories and indicative of degree of possible change between the space quota limit and the new quota limit.
5. The method of claim 4, wherein the aggregated correction coefficient is further based upon: projecting respective storage needs of each of the plurality of directories using the trained machine learning model to determine a respective projected storage usage data for each of said directories and thereby a respective expected storage usage amount at the expected future time; and, determining the aggregated correction coefficient for each of the plurality of directories indicating a ratio of possible increase or decrease of respective space quota limit for each of the directories based upon the respective expected storage usage amount and the total disk storage availability for all of the plurality of directories.
6. The method of claim 1, wherein the space quota limit and the new quota limit provide different values for restricting a maximum number of bytes of disk space allowed to be used by files under a tree rooted at the first directory for respectively the current time and the expected future time, further to said adjusting.
7. The method of claim 1, wherein prior to determining the projected storage usage data of the first directory: training a machine learning model, to provide the trained machine learning model, using space usage training data representative of space usage of the first directory and each of the remaining other directories pre-defined as being related to the first directory.
8. The method of claim 1, wherein: the first derivative of the moving average of the interpolated curve is calculated to define a second slope indicating an average rate of change of the projected storage usage over time.
9. The method of claim 8, wherein the calculated first derivative of the interpolated curve is used to project a first expected storage usage amount at the expected future time in the future time period and the calculated first derivative of the moving average is used to project a second expected storage usage amount at the expected future time in the future time period, and the new quota limit is an average of the first and the second expected storage usage amount weighted by the aggregated correction coefficient.
10. The method of claim 8, wherein the new quota limit is calculated as Q.sub.1x such that:
11. The method of claim 1, wherein prior to obtaining actual past storage usage data receiving a trigger from a scheduler, indicating a scheduled scan of each of the plurality of directories for respective actual storage usage up to the current time used for obtaining the actual past storage usage data.
12. The method of claim 1, wherein the trained machine learning model comprises multiple machine trained machine learning models, each one configured for one of the directories of the plurality of directories.
13. The method of claim 1, further comprising repeating the method of claim 1 for each additional directory of the plurality of directories to result in determining a corresponding new quota limit for said each additional directory and thereby applying the corresponding new quota limit for said each additional directory.
14. A computer implemented method for creating a predictive machine learning engine for predicting data storage usage for managing storage resources of a distributed file system, the method comprising the steps of: determining trends and behaviour in received electronic past storage usage data for each directory of a plurality of directories of the distributed file system, the data extending from a current time to a past time using a first machine learning algorithm; obtaining a space quota limit for each said directory for imposing a maximum limit on total storage used by files in each said directory, the space quota limit having an associated expected future time such that expected use of the storage resources of each said directory is limited to prior to the expected future time; using the determined trends and behaviour to predict a projected storage usage data for each said directory extending for a future time period from the current time to a future time, including an expected storage usage amount for each said directory at the expected future time; wherein determining the projected storage usage data for each said directory further comprises: defining an interpolated curve representing a function of the past storage usage data extending from the current time to the past time; and determining the projected storage usage data for each said directory as a function of a first derivative of the interpolated curve reflecting a rate of change of the projected storage usage over time and a first derivative of a moving average of the interpolated curve; comparing the expected storage usage amount to the space quota limit for each said directory to determine whether a projected need for increase or decrease of the space quota limit exists; when a pre-defined difference exists in the comparison, then: calculating a correction coefficient for each said directory which determines a weighting for the projected need based on aggregated projected need for each said directory relative to a total disk storage availability of the plurality of directories, by characterizing a total aggregate influence of the current and forecasted trends and behaviours of the plurality of directories on each single directory from the directories; adjusting the space quota limit to a new space quota limit for each said directory to reduce the pre-defined difference based on the weighting applied to the expected storage usage amount; and, generating the prediction machine learning engine for use in applying the new space quota limit to each said directory in the distributed file system having the pre-defined difference indicating the projected need for the increase or decrease.
15. A computer device for automatically managing storage resources of a distributed file system, the device comprising: a storage device storing instructions; a communications interface; at least one processor in communication with the storage device and the communications interface, the at least one processor configured to execute the instructions for: obtaining actual past storage usage data of a first directory from a plurality of directories of the distributed file system across the communications interface, the actual past storage usage data representative of storage usage at the first directory over a defined time period extending from a past time to a current time; detecting a space quota limit for the first directory, the space quota limit for providing a maximum limit on total storage for the first directory, the space quota limit associated with a pre-defined expected future time for providing a maximum amount of time for expecting use of the resources of the first directory; determining, in real-time, based on the actual past storage data, projected storage usage data of the first directory, by inputting the actual past storage data into a trained machine learning model for determining a storage usage trend of the first directory, a projected storage usage data representing a future storage usage for the first directory over a future time period from the current time; wherein determining the projected storage usage data of the first directory comprises using the actual past storage usage data to: define an interpolated curve representing a function of the actual past storage usage data extending from the current time to the past time; and determine the projected storage usage data of the first directory as a function of a first derivative of the interpolated curve reflecting a rate of change of the projected usage over time and a first derivative of a moving average of the interpolated curve; obtaining an aggregated correction coefficient providing an indication of aggregated projected storage usage needs of all other remaining distributed file system directories from the plurality of directories, relative to the projected storage usage data of the first directory, by characterizing a total aggregate influence of the current and forecasted trends and behaviours of said all other distributed file system directories from the plurality of directories on a single directory from the directories comprising the first directory; in response to determining an expected value of a projected storage usage data at the expected future time is inconsistent with the space quota limit, adjusting the space quota limit to a new quota limit for the first directory based on the expected value weighted by the aggregated correction coefficient; and applying the new quota limit to the first directory from the current time.
16. The computer device of claim 15, wherein the at least one processor is further configured for: determining the new quota limit further comprises increasing the space quota limit to the new quota limit when the space quota limit is insufficient based on the projected storage usage data indicating that the space quota limit will be reached prior to the expected future time.
17. The computer device of claim 15, wherein the at least one processor is further configured for: decreasing the space quota limit to the new quota limit when the projected storage usage data at the expected future time has a value below the space quota limit by at least a pre-defined amount.
18. The computer device of claim 15, wherein the at least one processor is further configured for: weighting by the aggregated correction coefficient further based upon an obtained value for total disk storage availability of a cluster defined by the plurality of directories of the distributed file system, the total disk storage availability indicating total amount of disk storage currently available for use by the plurality of directories and indicative of degree of possible change between the space quota limit and the new quota limit.
19. The computer device of claim 18, wherein the aggregated correction coefficient is further based upon: projecting respective storage needs of each of the plurality of directories using the trained machine learning model to determine the respective projected storage usage data for each of said directories and thereby a respective expected storage usage amount at the expected future time; and, determining the aggregated correction coefficient for each of the plurality of directories indicating a ratio of possible increase or decrease of respective space quota limit for each of the directories based upon the respective expected storage usage amount and the total disk storage availability for all of the plurality of directories.
20. The computer device of claim 15, wherein the space quota limit and the new quota limit provide different values for restricting a maximum number of bytes of disk space allowed to be used by files under a tree rooted at the first directory for respectively the current time and the expected future time, further to said adjusting.
21. The computer device of claim 15, wherein prior to determining the projected storage usage data of the first directory: training a machine learning model, to provide the trained machine learning model, using space usage training data representative of space usage of the first directory and each of the remaining other directories pre-defined as being related to the first directory.
22. The computer device of claim 15, wherein: the first derivative of the interpolated curve is calculated via the computer device; the first derivative of the moving average of the interpolated curve is calculated via the computer device for defining a second slope indicating an average rate of change of the projected storage usage over time.
23. The computer device of claim 22, wherein the new quota limit is calculated as Q.sub.1x such that:
24. The computer device of claim 22, wherein the calculated first derivative of the interpolated curve is used to project a first expected storage usage amount at the expected future time in the future time period and the calculated first derivative of the moving average is used to project a second expected storage usage amount at the expected future time in the future time period, and the new quota limit is an average of the first and the second expected storage usage amount weighted by the aggregated correction coefficient.
25. The computer device of claim 24, wherein prior to obtaining actual past storage usage data receiving a trigger from a scheduler, indicating a scheduled scan of each of the plurality of directories for respective actual storage usage up to the current time used for obtaining the actual past storage usage data.
26. The computer device of claim 15, wherein the trained machine learning model comprises multiple machine trained machine learning models, each one configured for one of the directories of the plurality of directories.
27. A non-transitory computer readable medium having stored thereon computer program code configured, when executed by one or more processors, to cause the one or more processors to perform a method for automatically managing storage resources of a distributed file system, the method comprising: obtaining, in real-time, actual past storage usage data of a first directory from a plurality of directories of the distributed file system, the actual past storage usage data representative of storage usage at the first directory over a defined time period extending from a past time to a current time; detecting a space quota limit for the first directory, the space quota limit for providing a maximum limit on total storage used by the first directory, the space quota limit associated with a pre-defined expected future time for providing a maximum amount of time for expecting use of the resources of the first directory; determining, in real-time, based on the actual past storage data, projected storage usage data of the first directory, by inputting the actual past storage data into a trained machine learning model for determining a storage usage trend of the first directory, the projected storage usage data representing a future storage usage for the first directory over a future time period from the current time; wherein determining projected storage usage data of the first directory comprises using the actual past storage usage data to: define an interpolated curve representing a function of the actual past storage usage data extending from the current time to the past time; and determine the projected storage usage data of the first directory as a function of a first derivative of the interpolated curve reflecting a rate of change of the projected usage over time and a first derivative of a moving average of the interpolated curve; obtaining an aggregated correction coefficient providing an indication of aggregated projected storage usage needs of all other remaining distributed file system directories from the plurality of directories, relative to the projected storage usage data of the first directory, by characterizing a total aggregate influence of the current and forecasted trends and behaviours of said all other distributed file system directories from the plurality of directories on a single directory from the directories comprising the first directory; in response to determining an expected value of the projected storage usage data at the expected future time is inconsistent with the space quota limit, adjusting the space quota limit to new quota limit for the first directory based on the expected value weighted by the aggregated correction coefficient; and applying the new quota limit to the first directory from the current time.
28. The computer readable medium of claim 27 further comprising: repeating the method steps of claim 27 for each additional directory of the plurality of directories to result in determining a corresponding new quota limit for said each additional directory and applying the corresponding new quota limit for said each additional directory.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DESCRIPTION OF THE EMBODIMENTS
(9) The present disclosure provides methods and systems for managing storage demand(s) of big data platforms having distributed file management. While various embodiments of the disclosure are described below, the disclosure is not limited to these embodiments, and variations of these embodiments may well fall within the scope of the disclosure.
(10) Reference will now be made in detail to embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
(11) As used in the present disclosure, the term “computer” or “computer device” is intended to encompass any suitable computerized processing device. For example, this may include any computer or processing device such as, a blade server, general-purpose personal computer (PC), Mac®, workstation, UNIX-based workstation, or any other suitable device. Moreover, although
(12) As used herein, the following terms expressly include, but are not to be limited to:
(13) “Metadata” means information about a file, such as its size, location, creation time, modification time, access permissions, redundancy parameters, and the like. Metadata is all forms of data that describe a file, as opposed to being the data in the file itself. In general, the size of this information is much smaller than the size of the file itself.
(14) “Data” means the actual content of a file, as opposed to file metadata.
(15) “File system” refers to a component of an operating system responsible for managing files.
(16) “Distributed file system” is a file system which runs on more than one computer (e.g. also referred to as a host). Distributed file system may be a client/server-based application that allows clients to access and process data stored on a server of the distributed file system as if running on a local machine.
(17) A storage management computing device may refer to a device for managing data storage on one or more data devices.
(18) As disclosed herein, the use of “or” means “and/or” unless stated otherwise. Furthermore, the use of the term “including,” as well as other forms such as “includes” and “included,” is not limiting. In addition, terms such as “element” or “module” or “component” encompass both elements and components comprising one unit, and elements and components that comprise more than one subunit, unless specifically stated otherwise. Additionally, any section headings used herein are for organizational purposes only, and are not to be construed as limiting the subject matter described.
(19) Generally, the present disclosure provides computer implemented methods and systems for managing storage demands for big data platforms (e.g. distributed file systems).
(20) While various embodiments of the disclosure are described below, the disclosure is not limited to these embodiments, and variations of these embodiments may well fall within the scope of the disclosure which is to be limited only by the appended claims.
(21) Referring to
(22) Further, as illustrated in
(23) The distributed file system 104 comprises a plurality of project storage directories 105-1, 105-2, . . . , 105-N (collectively project storage directories 105), also referred to as projects or directories herein. Typically, every organization unit (e.g. accounting department, human resources department, or engineering team) has a dedicated directory or node on a distributed file system (e.g. HDFS) and an associated group of users (e.g. user groups 154-1, 154-2, . . . 154-N, collectively user group 154) that has access to a particular organization unit (e.g. human resources department).
(24) For example, as illustrated, a first project storage directory 105-1 may be accessed by a first user group 154-1, while a second project storage directory 105-2 may be accessed separately by a second user group 154-2 and project storage directory 105-N may be accessed by the Nth user group 154-N.
(25) In one example, there may be a number of teams of developers, each one being associated to one or more projects, represented as User Group 1 to User Group N. Each project may be assigned a particular workspace in the distributed file system 104. The workspaces shown as project storage directories 105-1 . . . 105-N may be HDFS directories, each one serving as a master place where all of the development and the associated data of that project is stored on.
(26) For example, each HDFS directory shown as project storage directory 105 is the data storage workspace assigned to a project, and where the storage limits for each directory are allocated, monitored and enforced by the environment 100.
(27) Generally, each project storage directory 105 may have an associated maximum storage amount (e.g. number of bytes allowable for storage of files within that directory). As described above, in typical distributed file systems once the maximum storage amount for a particular directory is reached, further activity for that particular directory is restricted (e.g. no further files may be stored on the particular directory and in some cases, further access to the directory may also be limited).
(28) In the present disclosure and as illustrated in
(29) The storage management device 102 (also further illustrated in
(30) The storage management device 102 comprises one or more processors 134 communicative with one or more tangible, non-transitory memories (e.g. data repository 106) that store data and/or software program instructions. Accordingly, the processors 134 execute computer program instructions (e.g. an operating system and/or modules 120, 122, 124, 126, 128, 130, and 132) to perform any of the methods described herein.
(31) In operation, the disk usage monitoring module 120 is configured to monitor and receive metadata from the data repository 106 which comprises directory details 114, storage history data 116, and system settings data 118 for the directories 105. The storage history data 116 comprises actual historical data of storage use (e.g. total number of bytes used for project storage directories 105-1 . . . 105-N over a prior time period) for each of the storage directories 105. The system settings data 118 comprises actual total capacity of the distributed file system 104 for storing files, and the maximum allowable storage amount attributed to each of the project storage directories 105.
(32) The data repository 106 stores detailed configurations and directory/file metadata such as naming conventions of each directory (e.g. directory details 114), including the allocated storage limits (e.g. system settings data 118). The data repository 106 also stores the actual measured storage usage history of each directory (e.g. storage history data 116).
(33) The trends and behaviour module 126 is configured to communicate with the disk usage monitoring module 120 and/or data repository 106 and generate forecasted storage demand data that characterizes an expected demand for one or more of the project storage directories 105 (e.g. 105-1) during a future time interval, such as a future time period from the current time up to an expected future time, e.g. the expected time duration of use of each of the storage directories 105. The trends and behaviour module 126 includes a trained machine learning model that is used for forecasting a future storage demand (i.e. projected or expected storage usage) at each directory 105-1 . . . 105-N, as described in further detail below.
(34) In at least one aspect, the trained machine learning model provided by the trends and behaviour module 126 comprises multiple machine trained machine learning models, each one configured specifically for a respective one of the directories of the plurality of directories.
(35) For example, user group 154-1 is expected to use the first storage directory 105-1 for a specific project X estimated to last up to the expected duration of expected future use time (e.g. project duration). Further and as described herein, the machine learning engine 124 and particularly the trends and behaviour module 126 may compute the expected storage demand based on computations from actual demand data 121 provided by the disk usage monitoring module 120 (based on portions of data from the data repository 106) for the one or more project storage directories 105 during prior time interval, such as a prior time interval, e.g. extending from the current time to a past time (P). For example, the past time (P) may correspond to the first use of the storage directory 105-1. Thus, actual demand data 121 indicates from the storage history data 116, actual storage usage values (e.g. number of bytes as a function of time) from the past time up to the present time.
(36) Further, in at least some aspects as described herein, the trends and behaviour module 126 may compute the expected or forecasted storage demand for each of the directories 105 by first establishing a curve (e.g. a best fitting curve) of the actual demand data 121. Based on the established curve, the forecasted storage demand may further be defined as a function of at least one of: a computed first derivative of the curve projected to at least the expected future time (e.g. metadata characterizing a time in the future when the project for the particular directory is expected to last until and thereby expected storage usage of resources for the particular directory) and a computed first derivative of a moving average of the curve projected to the expected future time. In some aspects, the forecasted storage demand data for a directory may be based on a relationship (e.g. an average or a median curve) of the computed first derivative of the curve projected to at least the expected future time and the computed first derivative of the moving average of the curve projected to the expected future time.
(37) As will be described, the projected expected storage demand for a directory (e.g. 105-1) may be used by the decision manager module 128 to compare to the maximum allowable storage amount for the directory (e.g. 105-1), which may be provided by the system settings data 118.
(38) Additionally, the coordinator and optimizer module 130 is configured to track and monitor the expected projected demands of each of the directories (e.g. 105-1 . . . 105-N) as computed by the trends and behaviour module 126 and based on the forecasted demands as compared to the total allocated storage capacity for the entire shared distributed file system 104 (e.g. total storage amount allocated for the directories 105-1 . . . 105-N) and relative to the forecasted demands of the plurality of directories 105 considered as a whole, subsequently determines a forecasted demand weighting for each of the directories (also referred to as an aggregated correction coefficient), based on the correlation of predictions of all project storage directories 105 and total cluster disk availability. For example, as will be described, the coordinator and optimizer module 130 takes into account whether, based on forecasted demands, each of the storage directories 105 requires an increased quantity in allocated storage amount, a decreased quantity in allocated storage amount, or no change in the allocated storage amount. The coordinator and optimizer module 130 then determines the forecasted demand weighting for each of the storage directories 105, as a function of the expected forecasted demands of all of the remaining other directories and the total allocated storage capacity for the entire distributed file system 104.
(39) The decision manager module 128 is then configured to communicate with at least the modules 126 and 130 to determine an adjusted maximum storage amount (e.g. new space quota limit) for each directory 105-1, . . . 105-N and provide same to the disk quota setting module 122 which is configured to provide the adjusted storage amount to the data repository 106 for updating the system settings 118 and to apply the new space quota limit for each directory 105 having its maximum storage amount adjusted. In this way, the adjusted maximum storage amount for each directory 105 as computed by the decision manager module 128 accounts for both forecasted storage demands (e.g. as provided by module 126) and the relative forecasted storage needs of all directories within a cluster of the distributed file system 104 (e.g. as provided by module 130).
(40) For example, as described herein, the trends and behaviour module 126 may establish based on the projected storage demands, a desired adjusted storage amount for each directory. In one example, the desired adjusted storage amount for a particular directory (e.g. 105-1) is equal to forecasted storage demand data at the pre-defined expected future time, as generated according to the methods described herein.
(41) As also described herein, the coordinator and optimizer module 130 establishes the forecasted demand weighting factor as input to the decision manager module 128 and thus the decision manager module 128 is configured to apply the forecasted demand weighting factor to the desired adjusted storage amount (or forecasted storage demand data) to obtain the new space quota limit for each said directory 105 and apply same, via the disk quota setting module 122, to the distributed file system 104 for subsequent enforcement thereof.
(42) The disclosed embodiments are not limited to these examples of actual or forecasted storage demand data.
(43) Referring again to
(44) Referring again to
(45) Referring again to
(46) Referring again to
(47) User interface 146 may also support user interactions with the distributed file system 104 such as initial configuration (e.g. storage size settings) of project storage directories 105. The user interface 146 also presents updates and receives feedback for details of an existing project, different parameters of system settings (e.g. relating to system settings data 118), assigns projects to user groups 154, and other administrative tasks requiring administrator of user input. The content delivery system 108 and the client device 110 including the user interface 146 could be local or web-based served from a web server, application server, or a cloud contained. In one aspect, the user interface 146 provides a graphical interface via the display unit 148 for presentation to a user, e.g. an administrator of the computing environment 100 such as to configure various system settings.
(48) Referring now to
(49) As illustrated, the storage management device 102 comprises one or more processors 134, and one or more input devices 156. Input devices may be a keyboard, a key pad, buttons, pointing device, microphone, a camera or an IR sensor (receiver). The storage management device 102 further comprises one or more output devices 158 as well as at least one an optical output device. Output devices may include a speaker, light, bell, vibratory device, etc. An optical output device may be a display screen, or an IR transmitter or a projector. The storage management device 102 may have more than one display screen. It is understood that a display screen used in the storage management device 102 may be configured as an input device as well, for example, a gesture based device for receiving touch inputs according to various known technologies (e.g. in relation to input capabilities: resistive touchscreen, a surface acoustic wave touchscreen, a capacitive touchscreen, a projective capacitance touchscreen, a pressure-sensitive screen, an acoustic pulse recognition touchscreen, or another presence-sensitive screen technology; and in relation to output capabilities: a liquid crystal display (LCD), light emitting diode (LED) display, organic light-emitting diode (OLED) display, dot matrix display, e-ink, or similar monochrome or color display).
(50) The storage management device 102 further comprises one or more communications units 136 (e.g. antenna, induction coil, external buses (e.g. USB, etc.) for communicating via one or more communication networks to one or more other computing devices, e.g. 104, 108, and 110.
(51) The storage management device 102 further comprises one or more storage devices 160. The one or more storage devices 160 may store instructions and/or data for processing during operation of the storage management device 102. The one or more storage devices 160 may take different forms and/or configurations, for example, as short-term memory or long-term memory. Storage devices 160 may be configured for short-term storage of information as volatile memory, which does not retain stored contents when power is removed. Volatile memory examples include random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), etc. Storage devices 160, in some examples, also include one or more computer-readable storage media, for example, to store larger amounts of information than volatile memory and/or to store such information for long term, retaining information when power is removed. Non-volatile memory examples include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memory (EPROM) or electrically erasable and programmable (EEPROM) memory.
(52) The storage devices 160 store instructions and/or data for the storage management device 102, said instructions when executed by the one or more processors 134 configure the storage management device 102 to perform various operations and methods as described herein.
(53) Instructions may be stored as modules such as the scheduler 132 for triggering performing forecasting of storage demand data, the machine learning engine 124 for performing the forecasting of expected storage demands (e.g. via the trends and behaviour module 126) of each of the directories 105 of the distributed file system 104 of
(54) Instructions may further be stored for the coordinator and optimizer module 130 configured for determining a forecasted demand weighting factor (also referred to as the aggregate correction coefficient) for each one of the directories (e.g. 105-1) based on the forecasted storage demands of the remaining other directories (e.g. 105-2 . . . 105-N). Instructions may further be stored for the decision manager module 128 which utilizes the forecasted storage demand data as provided by the trends and behaviour module 126 to determine a desired storage amount for a particular directory (e.g. 105-1) and then apply the forecasted demand weighting factor thereto to generate the new space quota limit for the particular directory. Instructions may further be stored as the disk quota setting module 122 configured to retrieve the new space quota limit for the particular directory (e.g. 105-1) and apply it to the data repository 106 and/or directly to the distributed file system 104 for enforcement of storage amounts of 105-1 to be limited to a maximum defined by the new space quota limit.
(55) Other modules are not shown such as an operating system, software applications, etc.
(56) Communication channels 162 may couple each of the components 134, 136, 156, 158, and 160 for inter-component communications, whether communicatively, physically and/or operatively. In some examples, communication channels 162 may include a system bus, a network connection, an inter-process communication data structure, or any other method for communicating data.
(57) The storage management device 102 may be a tablet computer, a personal digital assistant (PDA), a laptop computer, a tabletop computer, a portable media player, an e-book reader, a watch, a personal computer or a workstation, or a computer system or computer platform including one or more cloud computing or virtual machine(s) or computing container(s) running on such a computer system or platform, or another type of computing device. In at least some aspects, the data repository 106 may include structured or unstructured data records identifying and characterizing one or more project storage directories 105-1 . . . 105-N, and associated user groups 154-1 . . . 154-N.
(58)
(59) At step 202, the storage management device 102 communicates with the distributed file system 104 and specifically, the project storage directories 105 to obtain actual past storage usage data of a particular directory, such as a first directory 105-1 of the distributed file system 104. This information may also be continuously stored and updated within storage history data 116. In one aspect, step 202 may be triggered by a scheduler 132 or by receiving an instruction from a user of the storage management device 102 (e.g. via the input device(s) 156) to initiate a storage demand forecast.
(60) For simplicity, one or more embodiments of the present disclosure, describe tracking past storage usage and forecasting future storage usage of “a first directory”, by way of example of a particular directory. The first directory 105-1, is a non-limiting example and the present disclosure is not limited to these embodiments. For example, it would be understood by a person skilled in the art, that the systems and methods described herein may be similarly applied to any other particular directory (e.g. 105-2, 105-3 . . . 105-N) of the project storage directories 105 configured to operate as described herein.
(61) At step 204, the storage management device 102 further communicates with the distributed file system 104 to detect a space quota limit characterizing a current allowable storage capacity for the first directory 105-1. The space quota limit defines a maximum limit on total storage for the first directory 105-1 (e.g. total number of bytes used by files within the first directory). Additionally, the space quota limit for the first directory 105-1 is associated with a pre-defined expected future time which characterizes a maximum amount of time for expecting use of the resources of the first directory 105-1. For example, the expected future time, may characterize a project timeframe for which user group 154-1 is expected to access and/or store resources of the directory 105-1.
(62) At step 206, the storage management device 102 utilizes a trained machine learning model (e.g. as provided by the trends and behaviour module 126) to determine, in real-time, projected storage usage data of the first directory representing a future storage usage for the first directory over a future time period from the current time. The actual past storage data is input into a trained machine learning model (e.g. trends and behaviour module 126) for determining a storage usage trend of the first directory and projecting same to determine a forecasted or projected storage usage demand data up to at least the expected future time.
(63) In at least one aspect, training a machine learning model, to provide the trained machine learning model for the trends and behaviour module 126 includes using past space or storage usage training data representative of space usage data points of the first directory and selected ones of the remaining other directories pre-defined as being related to the first directory over a pre-defined time past period.
(64) At step 208, the storage management device 102 determines an aggregated correction coefficient (e.g. via the coordinator and optimizer module 130) characterizing a forecasted demand weighting factor providing an indication of aggregated projected storage usage needs of remaining other directories (e.g. 105-2 . . . 105-N) of the plurality of directories relative to the first directory (e.g. 105-1). For example, such a weighting factor may indicate that several of the directories (e.g. 105-2, 105-3, and 105-4) also have increased forecasted storage demands as compared to the currently allowable storage amount for said directories (e.g. 105-2, 105-3, and 105-4) and therefore, since there is limited available overall storage capacity in the entire distributed file system 104 then each of the directories 105-1, 105-2, 105-3, and 105-4 may only be increased up to a portion of the forecasted storage usage demand data at the expected value (e.g. aggregate correction coefficient of 0.8 assigned to each of 105-1 . . . 105-4).
(65) In one aspect, the weighting by the pre-defined aggregation coefficient is further based upon an obtained value for total disk storage availability of a cluster defined by the plurality of directories of the distributed file system. For example, the total disk storage availability indicates the total amount of disk storage currently available for use by the plurality of directories 105 and indicative of degree of possible change between the space quota limit and the new quota limit.
(66) In a further aspect, the aggregated correction coefficient in step 208 is further calculated from first projecting respective storage needs of each of the plurality of directories (e.g. 105-1 . . . 105-N) using the trained machine learning model of the trends and behaviour module 126 to determine a respective projected storage usage data for each of said directories and thereby a respective expected storage usage amount at the expected future time. Subsequently, the aggregated correction coefficient for each of the plurality of directories is determined and indicates a ratio of possible increase or decrease of respective space quota limit for each of the directories 105 based upon the respective expected storage usage amount and the total disk storage availability for all of the plurality of directories 105.
(67) At step 210, the storage management device 102 is configured to determine whether to adjust the space quota limit to a new quota limit in response to an inconsistency (e.g. a difference beyond a defined value) between an expected value of the projected storage usage data at the expected future time and the space quota limit, e.g. as currently allocated to the first directory 105-1. The new space quota limit is calculated for the first directory 105-1 (e.g. via the decision manager module 128) as a function of the expected value (e.g. indicative of forecasted storage demand at the end of the project timeframe requiring use of the first directory 105-1) and weighted by the aggregated correction coefficient. In some aspects, only when the difference between the expected value of the projected storage usage data at the expected future time relative to the space quota limit exceeds a pre-defined threshold amount then the new quota limit is calculated.
(68) For example, when the inconsistency indicates that the space quota limit is insufficient based on the projected storage usage data predicting that the space quota limit will be reached prior to the expected future time needed for accessing the directory, then the machine learning engine 124 increases the space quota limit to the new quota limit.
(69) In yet another example, when the projected storage usage data at the expected future time has a value below the space quota limit by at least a pre-defined amount, then the machine learning engine 124 decreases the space quota limit to the new quota limit. Conveniently, in this manner, this increases the space available for another directory within the cluster.
(70) At step 212, the storage management device 102 is configured to apply, e.g. via the disk quota setting module 122 the new quota limit as determined in step 210 to the first directory 105-1 (e.g. as applied to the data repository 106 for subsequent access by the distributed file system 104 or applied in real-time directly to the storage directory 105-1 for immediate enforcement).
(71) In one or more embodiments, operations performed by the storage management device 102 of
(72) Referring to
(73) At step 302, the module 126, operating as a trained machine learning model, obtains actual past storage demand data (e.g. actual demand data 121) for a particular directory (e.g. first directory 105-1). The actual past storage demand data comprises past storage usage data representing storage usage at the first directory 105-1 over a defined time period extending from a past time to the current time. The past storage usage data may be obtained for example by periodically obtaining electronic storage readings (e.g. determining total number of bytes used up by each of the files in the first directory) from the first directory 105-1. In addition, in one aspect, the past storage demand data comprises data indicating a maximum storage amount for the first directory 105-1, the expected future time of the first directory and/or total available storage size for the entire cluster (e.g. all of the directories 105). Additionally, the actual demand data 121 may include other directory 105-1 details as provided by data such as directory details 114, storage history data 116 and/or system settings data 118, including but not limited to: directory name and identification information (e.g. as provided by directory details 114), and date and time information representing the particular future time period desired to forecast (e.g. as provided by system settings data 118).
(74) Preferably, the past storage demand data is continually obtained (e.g. in real-time) such that as more storage usage data points occur for the first directory, they are used as input to the machine learning model of the trends and behaviour module 126, thereby improving the accuracy of the forecasted storage usage demand data.
(75) At step 304, the module 126 calculates an interpolated curve representing a function of the actual past storage usage data extending from the current time to the past time (e.g. using a regression model and/or other machine learning prediction algorithm). In one aspect, the interpolated curve is a smoothing curve calculated using moving average of small number of data points and represents the dynamic trend of the disk usage for the first directory.
(76) At step 306, the module 126 calculates a moving average curve of the interpolated curve of step 304. The moving average curve is calculated using a time window with a pre-specified number of points and based on the interpolated curve. The moving average curve provides a smoothing operation such that the greater the number of points the smoother the curve.
(77) At step 308, two slopes (or rates of changes) are calculated from the interpolated curve and the moving average of the interpolated curve respectively. A first derivative of the interpolated curve (dU/dt) is calculated defining a first slope indicating a rate of change of the projected storage usage over time, where U is the disk usage. Additionally, a first derivative of the moving average of the interpolated curve (dUavg/dt) defining a second slope indicating an average rate of change of the projected storage usage over time.
(78) At step 310, the trends and behaviour module 126 determines the projected storage usage data of the first directory (e.g. at the expected future time defining the duration of the expected use of the first directory) as a function of the first derivative of the interpolated curve and the first derivative of the moving average.
(79) At step 312, the trends and behaviour module 126 determines whether the projected storage usage data at the expected future time is inconsistent with the allocated maximum storage amount for the first directory. If inconsistent, i.e. the projection of dU/dt reaches a pre-defined threshold (e.g. the allocated maximum storage amount for the first directory) before the project timeframe, the trends and behaviour module 126 raises a flag to feedback decision manager module 128 to instruct adjusting the size of the allocated storage for the first directory.
(80) If the projected amounts from step 310 doesn't reach the pre-defined threshold (e.g. the allocated maximum storage amount for the first directory) before the project timeframe defining the maximum timeframe, then no flag is raised and the maximum allowable storage amount (e.g. space quota limit) is not adjusted. In one aspect, the prediction indicates increasing the space quota limit to the new quota limit when the space quota limit is insufficient based on the projected storage usage data indicating that the space quota limit will be reached prior to the expected future time.
(81) In one aspect, the calculated first derivative of the interpolated curve (step 308) is used to project a first expected storage usage amount at the expected future time in the future time period and the calculated first derivative of the moving average (step 308) is used to project a second expected storage usage amount at the expected future time in the future time period, and the new quota limit provided in step 312 is an average of the first and the second expected storage usage amount further weighted by the aggregated correction coefficient (see step 210 of
(82) In at least one aspect, the space quota limit and the new quota limit provide different values for restricting a maximum number of bytes of disk space allowed to be used by files under a tree rooted at the first directory 105-1 for respectively the current time and the expected future time (e.g. future time 30 in
(83) Any suitable machine learning model may be used for the purposes described herein (e.g. for one or more modules of the machine learning engine 124), including any existing machine learning models known to those skilled in the relevant arts or any suitable yet to be developed machine learning model. In some embodiments, the machine learning model is a supervised regression model such as a support vector regression (SVR) model. In other embodiments, the machine learning model is a neural network (NN) architecture such as a convolutional neural network (CNN), or recurrent neural network (RNN) including for example, a long short-term memory (LSTM) model.
(84) In one exemplary aspect, the interpolated curve of the actual past storage usage data extending from a current time to a past time as provided by the trends and behaviour module 126 is modelled as a polynomial regression. Furthermore, in one aspect, linear regression of a first derivative of the interpolated curve (e.g. calculating the first derivative of the interpolated curve and/or the first derivative of the moving average of the interpolated curve of the first directory) is performed by the trends and behaviour module 126 and utilized to predict an estimated projected storage usage data of the first directory.
(85) Additionally, in at least one aspect, the coordinator and optimizer module 130 utilizes a supervised neural network to model the expected project demands of the directories 105 such as to determine the aggregated correction coefficient, as described herein.
(86) Further example flowcharts of the various operation of the machine learning engine 124 of
(87) Referring to
(88) At block 408, the trends and behaviour module 126, computes a rate of change of storage usage 23 (see
(89) For further detection of the trend, at block 410, a moving average curve 24 is determined from the interpolated function 22 of the curve. The moving average curve 24 further smooths the trend of the actual disk usage, and also provides a base of reference to compare it against the most recent rate of change dU/dt.
(90) Based on the moving average of the interpolated curve 24, at block 410, the trends and behaviour module 126 computes a rate of change of the moving average curve 25, or dUavg/dt. This rate of change provides a base of reference of the general trend of the disk usage.
(91) The next step performed in block 410 by the trends and behaviour module 126 is to compare the rate of change of the moving average dUavg/dt 25, with a recent or instantaneous rate of change of the interpolated curve dU/dt—also known as rate of change of storage usage 23 and with the existing space allocated quotas (e.g. space quota limit 27 and/or space quota threshold 26). In at least some aspects, the new space quota limit 33 may represent the adjusted version of one or both of the space quota limit 27 and the space quota threshold 26. In at least some aspects, the space quota threshold 26 in
(92) At block 410, the comparison also includes additional consideration to the expected time allocated to a project for the particular directory e.g. project storage directory 105 of
(93) In one example, at block 412, there may be no need for the trends and behaviour module 126 to emit any event or alarm, or to take any further action. The module 126 will at block 414 simply wait for the next trigger to execute the operation again at block 406.
(94) Referring again to
(95) For a base reference, the moving average of the curve 24 is computed and also the rate of change of the moving average 25 (e.g. slope) is computed by the trends and behaviour module 126 at block 410.
(96) The trends and behaviour module 126 analyzes all of the computed values (e.g. perform operations 406, 408 and 410) discussed above. If the predicted storage usage demand defining a predicted quota is higher or lower than the currently set pre-defined quota limit as computed in block 416, then the decision manager module and the coordinator and optimizer module are flagged at block 418. For example, while the dUavg/dt value has a smaller increase compared with the first example described previously and illustrated by
(97) Notably, in
(98) In the example depicted in
(99) Based on the newly adjusted and recommended value of the space quota (e.g. depicted as new space quota limit 33), the decision manager module 126 will adjust to the new space quota limit 33 for the particular directory 105-1.
(100) The exact value of the new space quota limit 33 characterizing the adjustment can be as a function of, e.g. between the predicted first time 28 value (predicted by the forecasted trend of the rate of change of storage usage curve 23 calculated as dU/dt) and the predicted second time 29 value (predicted by the forecasted trend of the rate of change of moving average of storage usage curve 25—dUavg/dt). This depends of the initial settings of the system, and to a certain degree, the self-learning of the system.
(101) For example, referring to
(102) In the case of
(103) The decision manager module 126 will make a decision of whether to automatically adjust (or not) space quota limits 27 of the workspace triggered by the trends and behaviour module 126. If there are no other constrains, e.g. total availability of the disk space at the cluster level, or other workspaces having spike increases in the disk usage at the same time, competing for the shared resources of the cluster, then the decision manager module 126 will adjust the quota to a new value (e.g. the new space quota limit 33).
(104) Preferably, the weight coefficient (also referred to as the k-factor) of the new value of the quota (e.g. for a particular directory 105-1) is either preset in the data repository 106 of
(105) The decision manager module 126, in one embodiment, and as depicted in block 422 of
(106) In the embodiment of
(107) Conveniently, in at least some aspects, this approach is advantageous in distributed file systems as the number of managed workspaces is large, for example in a large enterprise, or when the development dynamics of each workspace changes fast, and thus the method of computing and applying new storage quota limits (e.g. new space quota limit 33 as applied in block 424 of
(108) Further conveniently, another exemplary advantage of the currently disclosed systems and methods, in at least one aspect is avoiding projects associated with one or more directories of the distributed file system from unexpectedly reaching their allocated storage quotas and lead to halting further storage operations on the directories of the distributed file system which can disrupt the business process of that project.
(109) Referring to block 426 of
(110)
(111) Subsequently, at block 514, the coordinator and optimizer module 130 computes a new weight coefficient (also referred to as an aggregated correction coefficient) to be applied to the proposed limit at block 516, e.g. the new space quota limit depending on the current and predicted state of all other distributed file system directories 105 relative to the available storage.
(112) Based on the new space quota limit adjusted by the aggregated correction coefficient, the decision manager module 126 applies, in block 518, the new space quota limit on the affected directory (e.g. 105-1, 105-2, . . . ), and then at block 520, it notifies the emailer module 142 and the messenger module 144 to inform the directory's user group 154 and/or client device 110 about the change.
(113) One feature of the modules of the machine learning engine 124 shown in
(114) Referring again to
(115) As shown in block 522, a number of messages may be sent to the user group 154, and in case no acknowledgement is received back from the respective user group 154, before an alarm count threshold is triggered at block 524 and the decision manager module 128 removes the applied quota on the affected directory 105 and release the potential locked space back to the cluster.
(116) In at least some aspects, the coordinator and optimizer module 130 may have the role of correlating the findings of the trends and behaviour module 126, across all the workspaces in the project storage directories 105 configured in the cluster, correlating also with the total amount of available storage resources of the cluster, and any action suggested by the trends and behaviour module 126, with the current and forecasted status of all other workspaces (e.g. storage directories 105) of the cluster, before the decision of an action is taken by the decision manager module 126. Conveniently, in this aspect, since the decision to adjust the quota on a particular workspace or directory 105, is not taken independently, based only on the stats and trends of that particular workspace, but in conjunction with the forecasted storage trends, and current status of all workspaces and the cluster itself, the forecasted new space quota limit accurately reflects upcoming demands for the distributed file system 104 as a whole.
(117) In at least some aspects, the value of the computed aggregated correction coefficient as calculated by the coordinator and optimizer module is based upon the total aggregated trend of the space usage of the whole cluster (e.g. distributed file system 104) and it can be interpreted as a prediction factor for the trend and behaviour of the whole cluster provided by the distributed file system 104. Put another way, the aggregated correction coefficient may characterize the total aggregated influence of the current and forecasted trends and behaviours of all the other configured directories (e.g. 105-2 . . . 105-N) on a single directory (e.g. 105-1), as computed by the machine learning engine 124. In yet a further aspect, this aggregated correction coefficient characterizes an adjustment factor, before the decision manager module 128 sets the new space quota limit 33, by adjusting the trend and behavior of each individual directory as computed by the module 126, with this aggregated correction coefficient factor.
(118) The aggregated correction coefficient as applied by the decision manager module 128 to a proposed new space quota limit (e.g. as computed by the trends and behaviour module 126) provides an individual adjustment for a particular directory (e.g. 105-1) and ensures the overall trend and behaviour of the whole cluster (e.g. all of the remaining directories 105-2 . . . 105-N) is taken in consideration, and thereby preferably avoids an unexpected or premature reaching of the quota limits that would otherwise happen if only individual decisions would have been taken.
(119) In one embodiment, the storage size of each workspace associated with each directory (105-1 . . . 105-N) is characterized as a function of time, e.g. ƒ.sub.1(t) and based on the current quota usage (e.g. actual past storage usage data) and rate of change of quota usage (see equation (1) below). The coordinator and optimizer module 130 is configured to track and monitor the expected project demands of each of the directories 105 considered individually and as a whole, and it determines the aggregated correction coefficient (e.g. also referred to as a global coefficient factor for adjustment for adjusting the projected storage amount) as well as the adjustment coefficients for the rate of change of directory usage and the rate of change moving average of the interpolated curve (also referred to as k-factors or weight coefficients).
(120) The coordinator and optimizer module 130 calculates a storage size as a function of time, ƒ(t), for every project in directories 105-1, 105-2, . . . , 105-N. The storage size as a function of time can be approximated as a linear function. An example of computing the storage size ƒ(t) as a function of time for a first directory (e.g. 105-1) is shown in equation (1).
ƒ.sub.1(t)=A.sub.1t+B.sub.1 (1)
(121) Where
(122)
and k.sub.1+k.sub.2=1. k.sub.1 and k.sub.2 are the adjustment coefficients for the rate of change curves (e.g. to locate a median curve been the curves calculated for
(123)
are respectively the first derivative of the interpolated curve and the first derivative of the moving average of the interpolated curve of the first directory 105-1. The storage size ƒ(t) curve may be representative of the adjusted curve 34 shown in
(124) In one example, the adjustment coefficients, k.sub.1 and k.sub.2, may each be set to a value of 0.5. The decision manager module 128 may also be configured to adjust the adjustment coefficients further depending on the volatility of the first directory.
(125) For example, the decision manager module 128 may set k.sub.1 to a value close to 1 when there is a volatility in the storage demand of the project whereas k.sub.2 may be set close to 1 when there is indication of stability and steadiness in the storage operation of the first directory.
(126) As mentioned previously, the system settings data 118 stores each project's storage quota (for example shown as space quota limit 27 n
(127)
(128) Where G.sub.ka is the global adjustment coefficient (also referred to as the aggregate correction coefficient) provided by the coordinator and optimizer module 130 used to adjust a predicted space quota limit to define the new space quota limit. Also, T.sub.D1 is the project timeframe (e.g. expected future time 30 shown in
(129) However, Q is frequently monitored and calculated through the disk usage monitoring module 120, which is triggered by instructions from the scheduler 132; therefore, equation (2) is updated to keep a record of time as shown in equation (3).
(130)
(131) The coordinator and optimizer module 130 communicating with modules 126 and 128 computes Q.sub.1x, which represents the predicted storage quota of directory 105-1. The value of Q.sub.1x can be greater than Q.sub.1x equal to Q.sub.1 or less than Q.sub.1x where Q.sub.1 is the storage quota allocated to the workspace in directory 105-1 and stored in system settings data 118.
(132) Decision manager module 128 compares the value of Q.sub.1x to the existing storage quota Q.sub.1 (also referred to as space quota limit 27). If Q.sub.1x is equal to Q.sub.1x the decision manager module 126 takes no action. However, if Q.sub.1x is smaller than Q.sub.1x it reduces the storage quota of Q.sub.1 and releases the freed up quota to the cluster pool. When Q.sub.1x is greater than Q.sub.1x the decision manager module 128 allocates more storage quota to Q1 from the free cluster pool (e.g. allocated to all directories 105 as a whole). The decision manager module 128 also communicates and updates the system settings data 118 accordingly via disk quota setting module 122 when there are changes.
(133) As a result of the aforementioned quota adjustment, in at least one aspect, the distributed file system 104 is modelled in the trends and behaviour module 126 as a polymorphic matrix referred to as the system matrix and represented as:
(134)
(135) Throughout the operation of updating quotas, the machine learning engine 124 through communication with the disk usage monitoring module 120 calculates the free storage space of the system using equation (3) at any point in time the disk usage monitoring module is executed.
S.sub.x=S.sub.total−S.sub.margin−(Q.sub.1x+Q.sub.2x+ . . . +Q.sub.nx) (3)
(136) Where S.sub.total is the total storage capacity of the system 104, S.sub.margin is the minimum amount of storage for the system to operate properly, and Q.sub.1x+Q.sub.2x+ . . . +Q.sub.nx is the sum of all storage quotas of all projects 105-1, 105-2, . . . 105-N.
(137) Since each project has its own project timeframe (e.g. expected future time 30 shown in
T.sub.D=[T.sub.Dix,T.sub.Djx,T.sub.Dkx, . . . ,T.sub.Drx,T.sub.Dsx].
(138) The trends and behaviour module 126 evaluates, at each trigger of disk usage monitoring module 120, the system matrix based on the most immediate timeframe T.sub.Dix for every function in the system matrix. There is a function for every project in the system. The machine learning engine 124 predicts not only the necessary quota allocation of project i, corresponding to T.sub.Dix, but it needs to correlate the impact of all predictions on the whole system and to adjust the actual prediction for each individual project, based on the overall dynamics of the system 100 as a whole.
(139) Therefore, the decision manager module 128 considers the following system matrix when updating the storage quota of the projects.
(140)
(141) The workspace of project i will be dropped at the end of the lifecycle T.sub.Di, and all its allocated storage Q.sub.i is released back to the cluster pool. The released space is not computed anymore, and its parameters are removed from the system matrix and from the sorted array of T.sub.D.
(142) In at least one aspect, once the decision manager module 128 issues instructions to release quota of project i to the cluster pool, the disk usage module 120 communicates with disk quota setting module 122 to update the available distributed file system 104 storage. Subsequently, in said aspects the trends and behaviour module 126 computes the system matrix using the next immediate timeframe, T.sub.Dix, as presented in the following matrix.
(143)
(144) It is understood that the results of the function g.sub.n(T.sub.Dx) could be very different at each iteration of the system matrix at any given time.
(145) The coordinator and optimizer module 130 also computes the storage quota at cluster level and is responsible for calculating the global coefficient of adjustment G.sub.ka (also referred to as the aggregated correction coefficient) used in computing storage quota in equation (3) above. The coordinator and optimizer module 130 first calculates the total storage allocation, Q.sub.total, at the cluster level, at a particular point in time, t.sub.x, whenever the disk usage module 120 is triggered by the scheduler 132. Q.sub.total is calculated as in equation (4).
Q.sub.total(tx)=((Q.sub.1x+Q.sub.2x+ . . . +Q.sub.nx) (4)
(146) The coordinator and optimizer module 130 also calculates the total storage usage, S.sub.total, at the cluster level at any particular point in time, t.sub.x, as in equation (5).
S.sub.total(tx)=(B.sub.1x+B.sub.2x+ . . . +B.sub.nx) (5)
(147) As a result, the maximum allocation possible in the cluster, at any point in time is Q.sub.free_max(tx) and is calculated as in equation (6).
Q.sub.free_max(tx)=S.sub.total−S.sub.margin (6)
(148) Afterwards, the coordinator and optimizer module 130 computes the storage quota predication function at the cluster level using the weight coefficients, w, as in equation (7), knowing that Q.sub.predicted(T.sub.Djx)=F.sub.cluster(T.sub.Djx).
F.sub.cluster(T.sub.Djx)=(w.sub.1A.sub.1+w.sub.2A.sub.2+ . . . +w.sub.nA.sub.n)T.sub.Djx+S.sub.total(T.sub.
(149) Where w.sub.r is the weight coefficient of the rate of change for a particular workspace r and is calculated using allocated storage quotas as in equation (8).
(150)
(151) In another embodiment, the coordinator and optimization module 130 calculates w.sub.r using actual storage usage as in equation (9).
(152)
(153) When the coordinator and optimizer module 130 finds Q.sub.predicted(T.sub.Djx)≤Q.sub.free_max(T.sub.Djx), it decides that no global coefficient of adjustment is necessary and assigns a value of 1 to G.sub.ka. Otherwise, the global coefficient of adjustment needs to compensate A.sub.r of each workspace of directory 105-r such that:
G.sub.ka(w.sub.1A.sub.1+w.sub.2A.sub.2+ . . . +w.sub.nA.sub.n)T.sub.Djx+S.sub.total_usage(T.sub.
(154) Based on equation (10), the coordinator and optimizer module 130 calculates the global coefficient of adjustment as in equation (11).
(155)
(156) In another embodiment, the coordinator and optimizer module 130 calculates G.sub.ka by monitoring the overall dynamics of the cluster A.sub.cluster by monitoring the Hadoop HDFS root as in equation (12). The HDFS root can be configured just as any other workspaces in directories 105-1, 105-2, . . . , 105-N.
F.sub.cluster(T.sub.Djx)=A.sub.clusterT.sub.Djx+S.sub.total_usage(T.sub.
(157) Similar to equation (10), when Q.sub.predicted(T.sub.
G.sub.kaA.sub.clusterT.sub.Djx+S.sub.total_usage(T.sub.
(158) The coordinator and optimizer module 130 calculates G.sub.ka as in equation (14).
(159)
(160) In one aspect, the aggregated correction coefficient G.sub.ka (also referred to as the global coefficient of adjustment) would be further adjusted by a workspace or project directory (e.g. 105) specific correction factor based on the dynamics of each workspace including for example, the volatility of the storage specific to each workspace or project directory (e.g. 105).
(161) The coordinator and optimizer module 130 can also adjust the global coefficient of adjustment (also referred to as the aggregate correction coefficient) based on the determined volatility of the workspace of each project 105, and/or by running regression scans on the storage usage history of each workspace by communicating with data repository 106.
(162) While this specification contains many specifics, these should not be construed as limitations, but rather as descriptions of features specific to particular implementations. Certain features that are described in this specification in the context of separate implementations may also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation may also be implemented in multiple implementations separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination may in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.
(163) Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems may generally be integrated together in a single software product or packaged into multiple software products.
(164) Various embodiments have been described herein with reference to the accompanying drawings. It will, however, be evident that various modifications and changes may be made thereto, and additional embodiments may be implemented, without departing from the broader scope of the disclosed embodiments as set forth in the claims that follow. Further, other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of one or more embodiments of the present disclosure. It is intended, therefore, that this disclosure and the examples herein be considered as exemplary only, with a true scope and spirit of the disclosed embodiments being indicated by the following listing of exemplary claims.