GENERAL DESCRIPTION LANGUAGE DATA SYSTEM FOR DIRECTED ACYCLIC GRAPH AUTOMATIC TASK FLOW

Abstract

The present invention provides a general description language data system for directed acyclic graph automatic task flow, including: Step definition layer, Workflow definition layer and Template definition layer; The Step definition layer is the description of a single task, for the input and output declarations of each docker image or other executor, comprises name, type, file and parameters. The Workflow definition layer is a workflow composed of one or more Steps, the dependency topology of these Steps needs to be defined, and shared parameters can also be defined. The Template definition layer is based on a Workflow definition layer. The Template definition layer pre-sets the parameters, and supplies the descriptions, checkers or data source definitions of the parameters. The data center of the present invention is used with the task execution tool, and a programming language needs to be used to implement the corresponding tool.

Claims

1. A general description language data system for directed acyclic graph automatic task flow, comprises Step definition layer, Workflow definition layer and Template definition layer; the Step definition layer is the description of single task, the input and output declarations of each docker image or other executor, comprises name, type, file, and parameters; the Workflow definition layer is a workflow composed of one or more Steps, the dependency topology of these Steps needs to be defined and shared parameters can also be defined; the Template definition layer is based on a Workflow definition layer, the Template definition layer pre-sets the parameters, and supplies the descriptions of the parameters, the checkers of the parameters or the data source definitions of the parameters.

2. The system according to claim 1, wherein comprises a TypeDef definition layer, which abstracts the definitions of general or complex composite types.

3. A parse method use the general description language data system for directed acyclic graph automatic task flow according to claim 1, the method comprises the following steps: recursive analysis: pull the input files and all files that the input files depend on from the data center to the local; then the parser recursively traverses each value of each input file, if it is an external link beginning with {circumflex over ( )}, then download the corresponding file of the link through the data center Client-side and repeat this step for the new file until all dependent links are ready; syntax tree analysis: based on the a priority and coverage relationship that each definition layer of value has, construct and apply the coverage layer by layer from the bottom layer; parse the Template definition layer, traverse the specific variables and values in the Template definition layer, and index to a certain input and output value of a Step in the Workflow definition layer for override operation; object loading: obtain an object tree the workflow object is the root node, the workflow object contains all the Step objects through the steps property, and the Step objects contain all the TypeDef objects through the inputs/outputs properties.

4. The method according to claim 3, further comprises the following steps: first step, parses the type definition file whose Class is TypeDef, and all objects of TypeDef definition layer are constructed from the contents of the file and stored in memory as a K:V mapping; second step, constructs the Step objects and parses all files whose class is Step; constructs the Step objects from the content of the files, the inputs/outputs properties of the Step objects contain several TypeDef objects, if the Step applies a variable of a custom type, take the loaded object from the TypeDef K:V mapping to replace the object in Step, and perform the value override operation; third step, constructs the Workflow object, parses the file whose class is Workflow, and constructs the Workflow object from the content of the file, the steps property of the Workflow object contains all the Steps involved in the workflow, which is stored in the mapping mode of StepName: StepObject; Workflow object fetches all its dependent Step objects from the Step definition layer and stores them in its own steps property, and overrides the values in the Workflow definition layer with the values in the Step object according to the content of the file.

5. The method according to claim 4, further comprises the following steps: the parser topolofically sorts the Workflow objects; the method of the topological sorting algorithm comprises the following steps: Step A: find a FollowMap through the ref link marked in the inputs of each Step, and the FollowMap is a mapping of the list of <depended on Step: dependent Step>; Step B: after getting the FollowMap, invert the mapping of FollowMap to get LeaderMap, which is the mapping of <stepName: Step list on which this Step depends>; Step C: introduce the concept of Distance, which is abbreviated as Dis in the flowchart, which means the dependent distance to be run, and the default is 1; Step D: traverse all Step objects, if a Step object has not been checked, traverse the LeaderSteps of the Step object, if a Step object does not have a Leader, it means that the Step object has no dependencies, and it is deemed to have been checked and set Dis to 1; if a Step object do have a Leader, it means that the Step object is dependent, the Dis of this Step is added with the Dis of its LeaderSteps.

6. The method according to claim 3, wherein recursive analysis comprises the following steps: the type of user input is obtained, and check to make sure whether it is consistent with the type declared by the type keyword; when the input type is inconsistent, tries to force the data input by the user to the declared type; if the conversion fails, a type error will be thrown; the specific steps are: {circle around (1)} if it is declared as str string type, user input data is 123, c type is int integer type; a. check the type, int and str are inconsistent; b. attempt to cast: the integer 123 can be converted to the string “123”; c. input check passed and a warning about type conversion is thrown; {circle around (2)} if it is declared as int integer type, the user input data is 123, and the type is int integer type; a. check the type, int and int are consistent; b. pass the check; {circle around (3)} if it is declared as int integer type, the user input data is abc, and the type is str string type; a. check the type, int and str are inconsistent; b. attempt to cast: the string abc cannot be converted to an integer; c. the check fails and an error is thrown, the type check fails.

7. The method according to claim 3, wherein the method of Step definition comprises the following steps: data center: the data center is a simple C/S architecture service, which manages the index through the Server-side database, and the file system manages the specific data content; the Client-side performs simple parsing, uploading, and downloading; upload workflow: the user submits a description language file to the Client-side, the Client-side reads the file content, obtains the specific type, name, and version parameters by analyzing the class, name, and version fields, then requests the Server-side with the file content; download workflow: the user carries the type, name, and version parameters to access the Server-side; the Server-side indexes the database through the corresponding parameters, returns a NotFound error if it does not exist a result; obtains the specific file address if it is exist a result, and accesses the file system through the file address and obtains the file content, and returns the result to the Client-side.

8. The method according to claim 7, wherein that, in the uploading workflow, the Server-side indexes the database through corresponding parameters, and if a file with the same type, name, and version already exists, the parameter check failure is returned; if it does not exist, a new file address is generated, and the detailed information is added to the database; then the Server-side accesses the file system to store the file in the new file address, then returns the result to the Client-side.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0064] FIG. 1 is a diagram showing the specific hierarchical structure and reference relationship of languages in the data system in an embodiment.

[0065] FIG. 2 is a diagram showing data template coverage of language and parsing behavior table in the data system in an embodiment.

[0066] FIG. 3 is a diagram showing the specific use process of the language in the data system in an embodiment.

[0067] FIG. 4 is a diagram showing the workflow of uploading and downloading in the data center in an embodiment.

[0068] FIG. 5 is a diagram showing the main parsing process in an embodiment.

[0069] FIG. 6 is a diagram showing the main idea of the topological sorting algorithm in an embodiment.

DESCRIPTION OF THE EMBODIMENTS

Embodiment 1

[0070] FIG. 1 illustrates the specific hierarchical structure and reference relationship of the languages in the data system. The specific description of the workflow is carried out through four layers, the TypeDef definition layer, the Step definition layer, the Workflow definition layer and the Template definition layer. The following is an embodiment and introduction of each definition layer.

[0071] The TypeDef definition layer is not necessary. If users need to use special custom types, they need to write the TypeDef definition layer. The TypeDef definition layer mainly abstracts the definitions of general or complex composite types for easy reference and management; the Step definition layer is the description of single task, for the input and output declarations of a docker image or other executor, it is necessary to specifically declare the name, type, file, and parameters of each input and output item; the Workflow definition layer is a workflow composed of one or more Steps. The dependency topology of these Steps needs to be defined, and shared parameters can also be defined; the Template definition layer is based on a Workflow definition layer. The Template definition layer pre-sets the parameters, and supplies the descriptions of the parameters, the checkers of parameters or the data source definitions of the parameters. A Template definition layer must declare a unique Workflow definition layer; references among layers are implemented through url, for example (type:{circumflex over ( )}typedef/common/version/1.0.0) is indicated that the type of the variable introduces a TypeDef definition layer named common version 1.0.0.

Embodiment 2

[0072] The xwlVersion describes the version of the description language, used to distinguish version iterations brought about by the continuous addition of functions; class describes the type of this file, there are four types (TypeDef definition layer, Step definition layer, Workflow definition layer, Template definition layer); version describes the definition of Version; author describes the author's information; doc describes the annotations for the file; name describes the name of the file, the author needs to keep the name unique when writing the same type of file;

[0073] A substructure named TypeAndValueDefine is defined in the description language, which contains the type, name, value, and several properties, which are used to define a variable in detail. The following are three representative examples of TypeAndValueDefine:

TABLE-US-00001 First example $name: type: int[ ] const: 1 value: 1 ref: $xxx/$xxx required: true doc: this is a type and value defing demo symbols: [1, 2, 3] Second example $oneDir: type: dir autoSyncInterval: 300 automatic upload time interval (unit: s) autoSyncIgnore: [“run_charm_[0-9]*/”, “calc[0-9]*/”] Third example $oneObject: type: object serializer: # obj object type needs to define codec saveType: file fileExt: json encoder: {circumflex over ( )}py:json.dumps decoder: {circumflex over ( )}py:json.loads

[0074] The specific steps are:

[0075] first step: define the name of the substructure at the outermost layer, the outermost is the name of the substructure, and follows the language principle to start with $;

[0076] second step: define the general keywords needed to describe the properties of the substructure: type keyword is type description, supports int, double, boolean, string, array, file, dir, dict, record, obj, and can be identified as a list by adding [] suffix; const and value are mutually exclusive keywords, indicating the value represented by the substructure definition, value is a variable value, const is an immutable value; ref is a keyword mutually exclusive with value/const to identify the source of the ref value as a reference to another TypeAndValueDefine substructure; the required keyword is whether the TypeAndValueDefine substructure must have a value, the default is true; the doc keyword is the description; the symbols keyword is an enumerated value range, which is used when the TypeAndValueDefine substructure needs to limit value range;

[0077] third step: define the special keywords that describe the substructure in a specific type.

[0078] In the second example, the substructure is a definition of type folder. The autoSyncInterval keyword can be defined as the time interval for automatic synchronization; the autoSyncIgnore keyword is a list of file names that are ignored by default, and supports regular syntax.

[0079] In the third example, the substructure is a definition with a type of custom object. The serializer keyword can be defined as the codec definition required by the object definition; the saveType keyword is the storage method, which can be file/string; the fileExt is the suffix name of the stored file, used when the saveType is file; the encoder keyword is the encoder url, the encoder linked to needs to be an executable method that accepts an object and returns string data; the decoder keyword is the decoder url, and the connected decoder needs to be an executable method that accepts a string data, returns an object. The codec follows the external link guidelines, using the {circumflex over ( )} prefix, and py: identifies it as a python method.

Embodiment 3

[0080] The following is an embodiment of a TypeDef definition named common (the general information part will not be repeated):

TABLE-US-00002 xwlVersion: 1.0.0 class: TypeDef doc: a tructure type def author: ziqi.jiang name: common version: 1.0.0 typeDefs: $jobArgs: doc: Contains some info about compute args type:record fields: :cores: type: int :memory: type: int

[0081] The specific steps are:

[0082] first step: define the typeDefs keyword at the outermost layer. The typeDefs keyword contains some TypeAndValueDefine substructures. For example, the definition declares a record data named struct, the fields is a subkey declaration of the record type, and it contains two properties, cores and memory;

[0083] second step: in the TypeAndValueDefine substructure that uses the type definition, it is declared in the type through a fixed format link. The following is a TypeAndValueDefine substructure example that uses the typeDef: [0084] &use_typedef_demo: [0085] type:{circumflex over ( )}typedef/commmon/jobArgs/version/1.0.0

Embodiment 4

[0086] The Step definition layer contains a specific description of a calculation task. The following is an embodiment of a Step definition layer (the general information part has been omitted)

TABLE-US-00003 entryPoint: {circumflex over ( )}py:/home/job/run/loader.py jobArgs: type: {circumflex over ( )}typedef/common/jobArgs/version/1.0.0 value: cores: 16 memory: 24000 inputs: $in_arg1: type: file outputs: $out_arg1: type: double

[0087] First step: define four main keywords describing the Step definition layer property: entryPoint, jobArgs, inputs, outputs. The entryPoint is the execution entry of the Step definition layer, such as the loader.py file located in the /home/job/run directory and executed with python in the embodiment. The jobArgs is the execution parameter of Step definition layer, the referenced TypeDef definition layer is used in the embodiment, and the default value of 24000 MB for 16 cores is given.

[0088] Second step is to define the input and output items: inputs/outputs are the input and output parameters list of the Step definition layer, and there are several TypeAndValueDefine substructures inside.

Embodiment 5

[0089] The Workflow definition layer contains several Step declarations and the parameter dependencies among Steps. Here is an embodiment of a Workflow definition layer (the general information part has been omitted):

TABLE-US-00004 vars: $share_arg1: type: string steps: $step1: run: {circumflex over ( )}step/demo/version/1.0.0 jobArgs: cores: 2 memory: 3000 in: $arg1: ref: {vars/$share_args1} out: [$output_list1, $output2] $step2: run: {circumflex over ( )} step/scatter_gather_demo/version/1.0.0 scatter maxJob: 100 minJob: 0 zip: $scatter_in_arg1: {ref: $step1/$output_list1} jobIn: $in_arg1: {ref: zip/$item} $in_arg2: ~ gather: failTol: 0.1 retryLimit: 1 jobOut: [$output] unzip: $gather_outputs: {ref: jobOut/$output} outputs: $out_wf_arg1: {ref: $step2/gather_outputs}

[0090] The specific steps are:

[0091] first step: define the shared variable pool vars that needs to be reused at the outermost layer: the vars keyword is a group used to define the pool for the shared variables in the file. If multiple steps in the workflow need to share a group of inputs, it can be referenced by the ref keywords; the step keywords are the Step objects used in the workflow and their dependent topology, and the internal declaration is step name and Step object as key-value pair;

[0092] second step: define the Steps used and their topological relationship.

[0093] Under the steps keyword, there are two step declarations named step1 and step2.

[0094] In the declaration of step1, the run is the specific definition url of the step, which is represented by an external link starting with {circumflex over ( )} that follows the guidelines, which means to introduce the 1.0.0 version definition named demo; the jobArgs keyword maps to the jobArgs defined in Step, a default value is assigned to it here; the in keyword is the declared input parameter, and the value of a parameter named arg1 is declared here to refer to the value of share_arg1 in the shared variable. The naming in in needs to be consistent with the name of the input item in the input in the Step definition layer; the out keyword is a parameter that is enabled in the workflow, and the name needs to be consistent with the name of the output item in the outputs in the Step definition layer.

[0095] In the declaration of step 2, an automatic concurrency step declared using the scatter-gather primitive is shown. The jobArgs can be omitted when the default value is not assigned; the scatter keyword declares that this is a concurrent step;

[0096] The scatter keyword distributes each element in the received input list to the same number of subtasks as the input list through the zip mapping. Under the scatter definition: the maxJob/minJob keywords are the concurrent number range of the task; the zip is concurrent batch parameters mapping of this task. There are several TypeAndValueDefine substructures under zip. Since the task definition is oriented to a single input, a parameter mapping needs to be defined to indicate how multiple parameters received are mapped to the subtask input items that need to be concurrent. For example, this embodiment declares an array type named scatter_in_arg1, which accepts the result of the step1 task named output_list1; the jobIn keyword is the original input of the step, and there are several TypeAndValueDefine substructures inside. The name must be consistent with the input name of the Step definition layer inputs. For example, in_args1 here declares that the value comes from scatter_in_arg1 in the zip map. It means that each element in the list received by scatter_in_arg1 will be distributed to the in_arg1 item of each sub job when it is running.

[0097] The gather keyword aggregates the output results of multiple subtasks through unzip mapping into an output list. Under the definition of gather: the failTol is the failure tolerance rate of the subjob, which is a decimal in the range of 0-1. If the proportion of failed tasks is greater than this decimal, the step is considered to have failed and retrying is abandoned; the retryLimit is the maximum number of failed retries allowed. If some subtasks fail and the proportion of failures is less than the fault tolerance rate, the retry will not exceed retryLimit; the jobOut is the output item in the original Step definition layer that is enabled, and the name needs to be consistent with the output item in the Step definition layer; the unzip is the mapping of parameter aggregation. For example, in this embodiment, the unzip declares a definition named gather_outputs that aggregates the outputs items of all subtasks.

[0098] The output keyword at the outermost layer means the final output of the workflow. For example, in this embodiment, an output named out_wf_arg1 is defined, and its value is derived from the aggregate result of step2 gather_outputs.

Embodiment 6

[0099] The Template definition layer is used to specify a set of preset values as a data template to be applied to the workflow. The following is an embodiment of a Template definition layer: [0100] workflow:{circumflex over ( )}workflow/some_workflow/version/1.0.0 [0101] values: [0102] vars/$share_arg1: {value: 233} [0103] $step2/$in_arg2: {const: 1.0}

[0104] The specific steps are:

[0105] First step: define the target Workflow definition layer keyword applied by the Template definition layer: workflow defines the url for the workflow to which the Template definition layer is applied;

[0106] Second step: define the pre-filled values for the workflow before: the values are used to fix some values that need to be filled, and only support data in the form of value/const. As in the above embodiment, the defined value named share_args1 in the shared variable vars is filled with the variable value 233, and the defined value named in_arg2 in the step2 input is the immutable value 1.0.

[0107] FIG. 2 illustrates the data template coverage of language and parsing behavior table in the data system in an embodiment. Data is divided into value (variable value) and const (immutable constant) based on its property. Based on the source, the data is divided into five types: typedef, step, workflow, template, and inline. When a Workflow is executed, the final data needs to be parsed from the data source of the multi definition layers. When a definition has multiple data sources, three behaviors of ignore, overwrite, and conflict will occur. The parsing of data follows the following principles: const data cannot beoverrode; inline>template>workflow>step>typeDef; when two values meet, they are overwritten by priority (inlineValue can override inlineValue); two consts will conflict; high-level value and low-level const conflict.

[0108] FIG. 3 illustrates the specific usage process of the language. When writing and using the language: computer engineers need to describe the existing algorithm through the Step definition layer of the language. First, write the custom types that may be needed according to the existing algorithm requirements, define the TypeDef definition layer and publish it to the data center. Then write and publish the Step definition layer describing the existing algorithm. If the Step definition layer needs to be use, reference the custom TypeDef definition layer through the url. Scientific computing solution experts write Workflow definition layer. In the Workflow definition layer, refer to the required Step definition layer by quoting the url, and connect the output of each Step to the input of the next Step object one by one in Workflow definition layer for arrangement. Finally, write the Template definition layer to fill in the default values of the specific usage scheme. The task performer only needs to select the corresponding Workflow definition layer and Template definition layer, and pass them to the language interpreter. The language interpreter will parse layer by layer from top to bottom and obtain the corresponding data from the data center by referring to the url for parsing. Finally, the parsed complete data is passed to the task execution tool for task submission.

Embodiment 7

[0109] The specific process of publishing Step definition layer is as follows.

[0110] Data center:

[0111] the data center is a simple C/S architecture service, which manages the index through the Server-side database and the file system manages specific data content; the Client-side performs simple parsing, uploading, and downloading. FIG. 4 shows the workflow of uploading and downloading in the data center.

[0112] Upload workflow:

[0113] the user submits a description language file to the Client-side. The Client-side reads the content of the file, obtains the specific type, name, and version parameters by analyzing the class, name, and version fields, then requests the Server-side with the file content; the Server-side indexes the database through the corresponding parameters; if a file with the same type, name, and version already exist, the parameter check failure will be returned; if it does not exist, a new file address is generated, and the detailed information is added to the database. Then the Server-side accesses the file system to store the file in the new file address, and then returns the result to the Client-side.

[0114] Download workflow:

[0115] The user carries the type, name, and version parameters to access the Server-side; the Server-side indexes the database through the corresponding parameters, returns a NotFound error if it does not exist a same file; obtains the specific file address if it is exist a same file, and accesses the file system through the file address and obtains the file content, and returns the result to the Client-side.

[0116] The advantage of the above scheme is that using the file system to store description language files instead of directly storing them in the database not only preserves the original granularity of the data, but also ensures the integrity of the description language files. Using the file system to store larger description language files also improve the performance of the database. When requesting files in batches, it can index addresses faster and use multi-threading to speed up file reading.

[0117] Parser:

[0118] The parser is an independent and offline data analysis tool, mainly through recursive analysis, syntax tree analysis, object loading, application linking, and application of values layer by layer and other steps to parse the complete definition. FIG. 5 shows the main parsing process.

[0119] Before parsing the content, first, pull the input files and all files that the input files depend on from the data center to the local. The parser recursively traverses each value of the first input file. If it is an external link beginning with {circumflex over ( )}, then download the corresponding file of the link through the data center Client-side, and this step will be repeated for the new file until all dependent links are ready.

[0120] Since each layer of the description language has a priority and coverage relationship, in order to realize the data logic of the layer coverage, it is necessary to construct and apply the coverage layer by layer from the bottom layer. First step, parses the type definition file whose Class is TypeDef. All objects of TypeDef definition layer are constructed from the contents of the file and stored in memory as a K:V mapping.

[0121] Second step, constructs the Step objects and parses all files whose class is Step. The Step objects are constructed from the content of the files. The inputs/outputs properties of the Step objects contain several TypeDef objects. If the Step applies a variable of a custom type, the loaded object is taken from the TypeDef K:V mapping to replace the object in Step and the value override operation is performed.

[0122] Third step, constructs the Workflow object, parses the file whose class is Workflow, and constructs the Workflow object from the content of the file. The steps property of the Workflow object contains all the Steps involved in the workflow, which is stored in the mapping form of StepName: StepObject. Workflow object fetches all its dependent Step objects from the Step definition layer and stores them in its own steps property, and overrides the values in the Workflow definition layer with the values in the Step object according to the content of the file.

[0123] Finally, the Template definition layer is parsed, the specific variables and values in the Template definition layer are traversed, and a certain input and output value of a Step in the Workflow definition layer is indexed for override operation.

[0124] After the parsing is completed, an object tree will be obtained; the workflow object is the root node. The workflow contains all the Step objects through the steps property and the Step objects contain all the TypeDefs objects through the inputs/outputs properties.

[0125] Besides constructs a well-defined object tree and hierarchical assignment, the second important algorithm of the parser is to perform topological sorting on Workflow objects. The user defines the dependencies among different Steps, and the topological sorting algorithm can solve the most efficient operation solution of the step. FIG. 6 shows the main idea of the topological sorting algorithm.

[0126] Find a FollowMap through the ref link marked in the inputs of each Step. And the FollowMap is a mapping of the list of <Dependened on Step: Dependent Step>.

[0127] After getting the FollowMap, invert the FollowMap mapping to get the LeaderMap, which is the mapping of <stepName: Step list on which this Step depends>.

[0128] Introduce the concept of Distance, which is abbreviated as Dis in the flowchart, which means the distance of dependence from being run. The default is 1 (can be run directly).

[0129] Traverse all Step objects, if a Step object has not been checked, traverse the LeaderSteps of the Step object, if a Step object does not have a Leader, it means that the Step object has no dependencies, and it is deemed to have been checked and Dis is set to 1. If a Step object do have a Leader, it means that the Step object is dependent, the Dis of the Step is added with the Dis of its LeaderSteps, and so on.

[0130] The core of the recursive idea draws on the topological sorting algorithm of the mathematics. FollowMap and LeaderMap are two forms of adjacency matrix. The starting point is determined by LeaderMap, and the starting point Step is set to 1 through the concept of Dis. The Dis of the intermediate point Step is the Dis sum of the Steps from the starting point and the path Step to this point. By sorting Dis, we can get the most efficient running sequence. And when a Step is executed, we only need to recursively subtract the Dis of the subsequent nodes according to the FollowMap to update the running sequence of the current state.

[0131] Taking the above-mentioned ideal embodiments based on this application as enlightenment, through the above description, relevant staff can make various changes and modifications without departing from the scope of the technical idea of this application. The technical scope of this application is not limited to the content in the decryption, and its technical scope must be determined according to the scope of the claims.

GENERAL DESCRIPTION LANGUAGE DATA SYSTEM FOR DIRECTED ACYCLIC GRAPH AUTOMATIC TASK FLOW

Assignee

Inventors

Cpc classification

Classification Explorer

G06F8/315

PHYSICS

Classification Explorer

G06Q10/06

PHYSICS

Classification Explorer

G06F8/42

PHYSICS

Classification Explorer

G06F8/10

PHYSICS

International classification

Classification Explorer

G06F8/30

PHYSICS

Classification Explorer

G06F8/41

PHYSICS

Abstract

Claims

Description