Parallel program generating method and parallelization compiling apparatus
10698670 ยท 2020-06-30
Assignee
Inventors
Cpc classification
International classification
Abstract
There is provided a parallel program generating method capable of generating a static scheduling enabled parallel program without undermining the possibility of extracting parallelism. The parallel program generating method executed by the parallelization compiling apparatus 100 includes a fusion step (FIG. 2/STEP026) of fusing, as a new task, a task group including a reference task as a task having a conditional branch, and subsequent tasks as tasks control dependent, extended-control dependent, or indirect control dependent on respective of all branch directions of the conditional branch included in the reference task.
Claims
1. A computer-implemented method for generating, from a sequential program, a parallel program executable in a system including a plurality of arithmetic processing units to perform arithmetic processing in parallel, the method comprising: dividing the sequential program into a plurality of tasks; first analyzing the plurality of tasks to determine data dependency and control dependency of each of the plurality of tasks; second analyzing an earliest executable condition of each of the plurality of tasks based on the data dependency between respective tasks and the control dependency of each task obtained from the first analyzing; and determining, based on results of the second analyzing, as a task group to be fused, a task group including, among the plurality of tasks, a reference task as a task having a conditional branch, and all subsequent tasks as tasks control dependent, extended-control dependent, or indirect control dependent on respective of all branch directions of the conditional branch included in the reference task, and fusing, as a new task, the task group to be fused, wherein the earliest executable conditions for an i-th task MTi are: the conditional branch of a i-th task MTi on which the i-th task MTi is control dependent branches to a path including the i-th task MTi; and a k-th task MTk (ki) on which the i-th task MTi is data dependent is fully completed, or non-execution of the k-th task MTk is determined.
2. The method according to claim 1, further comprising: scheduling to assign each of a plurality of tasks including the new task to each of the plurality of arithmetic processing units based on the data dependency; and generating the parallel program based on the scheduling results.
3. The method according to claim 1, wherein the determining includes identifying a task group including the reference task, and all first subsequent tasks as tasks control dependent or extended-control dependent on respective of all the branch directions of the conditional branch included in the reference task; adding, to the task group, all second subsequent tasks as tasks control dependent or extended-control dependent on respective of all branch directions of conditional branches included in the task group; repeating the adding until tasks control dependent or extended-control dependent on any of the branch directions of the conditional branches included in the task group run out; and determining the task group to be a task group to be fused.
4. The method according to claim 1, further comprising: determining whether a plurality of tasks control dependent, indirect control dependent, or extended-control dependent on one branch direction of the conditional branch included in the reference task included in the task group to be fused satisfy a predetermined condition including such a parallelly executable condition as to have no control dependency, indirect control dependency, extended-control dependency, and data dependency on one another; and when the predetermined condition is determined not to be satisfied, fusing the task group to be fused as the new task, or when the predetermined condition is determined to be satisfied, duplicating the conditional branch included in the reference task, making the plurality of tasks having no control dependency, indirect control dependency, extended-control dependency, and data dependency on one another follow respective of a plurality of conditional branches including the duplicated conditional branch, and combining each of the plurality of conditional branches with the plurality of tasks, each of which is made to follow each of the plurality of conditional branches to generate a plurality of task groups, determining the plurality of task groups as a new plurality of task groups to be fused, and fusing, as the new task, each of the plurality of tasks groups to be fused.
5. The method according to claim 1, wherein the second analyzing includes simplifying the earliest executable condition of each of the plurality of tasks by excluding a case that includes an earliest executable condition that is also included as the earliest executable condition of another case.
6. A parallelization compiling apparatus configured to generate, from a sequential program, a parallel program executable in a system including a plurality of arithmetic processing units to perform arithmetic processing in parallel, the parallelization compiling apparatus comprising at least one processor configured to function as: a task division element which divides the sequential program into a plurality of tasks, a dependency analysis element which analyzes the plurality of tasks divided by the task division element to determine data dependency and control dependency of each of the plurality of tasks; an earliest executable condition analysis element which analyzes an earliest executable condition of each of the plurality of tasks based on the data dependency between respective tasks and the control dependency of each task obtained from the dependency analysis element; and a fusion element which determines, based on results of the earliest executable condition analysis element, as a task group to be fused, a task group including, among the plurality of tasks, a reference task as a task having a conditional branch, and all subsequent tasks as tasks control dependent, extended-control dependent, or indirect control dependent on respective of all branch directions of the conditional branch included in the reference task, and fuses the task group to be fused as a new task, wherein the earliest executable conditions for an i-th task MTi are: the conditional branch of a i-th task MTi on which the i-th task MTi is control dependent branches to a path including the i-th task MTi; and a k-th task MTk (ki) on which the i-th task MTi is data dependent is fully completed, or non-execution of the k-th task MTk is determined.
7. A non-transitory computer-readable medium having stored thereon computer-readable instructions to cause a computer to execute a process to generate, from a sequential program, a parallel program executable in a system including a plurality of arithmetic processing units to perform arithmetic processing in parallel, the process comprising: dividing the sequential program into a plurality of tasks; first analyzing the plurality of tasks divided to determine data dependency and control dependency of each of the plurality of tasks; second analyzing an earliest executable condition of each of the plurality of tasks based on the data dependency between respective tasks and the control dependency of each task obtained from the first analyzing; and determining, based on results of the second analyzing, as a task group to be fused, a task group including, among the plurality of tasks, a reference task as a task having a conditional branch, and all subsequent tasks as tasks control dependent, extended-control dependent, or indirect control dependent on respective of all branch directions of the conditional branch included in the reference task, and fusing, as a new task, the task group to be fused, wherein the earliest executable conditions for an i-th task MTi are: the conditional branch of a i-th task MTi on which the i-th task MTi is control dependent branches to a path including the i-th task MTi; and a k-th task MTk (ki) on which the i-th task MTi is data dependent is fully completed, or non-execution of the k-th task MTk is determined.
8. The non-transitory computer-readable storage medium according to claim 7, the process further comprising: scheduling to assign each of a plurality of tasks including the new task to each of the plurality of arithmetic processing units based on the data dependency; and generating the parallel program based on the scheduling results.
9. The non-transitory computer readable storage medium according to claim 7, wherein the determining includes identifying a task group including the reference task, and all first subsequent tasks as tasks control dependent or extended-control dependent on respective of all the branch directions of the conditional branch included in the reference task; adding, to the task group, all second subsequent tasks as tasks control dependent or extended-control dependent on respective of all branch directions of conditional branches included in the task group; repeating the adding until tasks control dependent or extended-control dependent on any of the branch directions of the conditional branches included in the task group run out; and determining the task group to be a task group to be fused.
10. The non-transitory computer readable storage medium according to claim 7, the process further comprising: determining whether a plurality of tasks control dependent, indirect control dependent, or extended-control dependent on one branch direction of the conditional branch included in the reference task included in the task group to be fused satisfy a predetermined condition including such a parallelly executable condition as to have no control dependency, indirect control dependency, extended-control dependency, and data dependency on one another; and when the predetermined condition is determined not to be satisfied, fusing the task group to be fused as the new task, or when the predetermined condition is determined to be satisfied, duplicating the conditional branch included in the reference task, making the plurality of tasks having no control dependency, indirect control dependency, extended-control dependency, and data dependency on one another follow respective of a plurality of conditional branches including the duplicated conditional branch, and combining each of the plurality of conditional branches with the plurality of tasks, each of which is made to follow each of the plurality of conditional branches to generate a plurality of task groups, determining the plurality of task groups as a new plurality of task groups to be fused, and fusing, as the new task, each of the plurality of tasks groups to be fused.
11. The non-transitory computer-readable storage medium according to claim 7, wherein the second analyzing includes simplifying the earliest executable condition of each of the plurality of tasks by excluding a case that includes an earliest executable condition that is also included as the earliest executable condition of another case.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
DESCRIPTION OF THE PREFERRED EMBODIMENTS
(22) Referring to
(23) (Configuration of Compiling Apparatus)
(24) A parallelization compiling apparatus 100 is an apparatus configured to receive, as input, a sequential program P1 sequentially executable in a single processor system and a configuration file CF, and output binary code PB parallelly executable in a multiprocessor system 200. The parallelization compiling apparatus 100 corresponds to an example of a computer of the present invention.
(25) In order to implement this function, the parallelization compiling apparatus 100 includes an arithmetic processing element 110, a reading device 120, a writing device 130, and a storage device 140.
(26) The arithmetic processing element 110 is configured to include a central processing unit (CPU) or the like to read a parallelization compiler C from the storage device 140 such as a memory in order to function as a task division element 111, a dependency analysis element 112, a fusion element 113, a scheduling element 114, a generation element 115, a fusion-target task group determination element 116, and a condition determination element 117, which perform arithmetic processing for parallelization processing to be described later according to the parallelization compiler C. The arithmetic processing element 110 performs the arithmetic processing according to the parallelization compiler C to perform a series of processing from STEP001 to STEP032 in
(27) The reading device 120 is a device that reads information from an external storage medium, which is a CD disk, a DVD disk, or a Blu-ray disk. Instead of the reading device 120, for example, the sequential program P1 and the configuration file CF may be input externally through an input device, such as a keyboard, or a communication device, or the sequential program P1 and the configuration file CF may be read from an external storage medium (USB memory) or the like connected to the parallelization compiling apparatus 100.
(28) The sequential program P1 is source code written in a high-level language such as Fortran or C language.
(29) The configuration file CF is a file in which information necessary to generate a parallel program running on the multiprocessor system 200, such as information on the number of processing elements that constitute the multiprocessor system 200, the type of processor such as a CPU that makes up the processing elements, the memory capacity and access time of a local memory, the memory capacity and access time of a common storage area in the multiprocessor system 200, and the OS installed in the multiprocessor system 200.
(30) When the parallelization compiling apparatus 100 and the multiprocessor system 200 use a common storage device, the parallelization compiling apparatus 100 may refer to the storage device to acquire information recorded in the configuration file CF.
(31) The writing device 130 is a device that writes information to an external storage medium, which is a CD-R disk, a DVD-R disk, or a Blu-ray disk. Instead of the writing device 130, for example, the binary code PB may be output to the outside through a communication device, or the binary code PB may be written to an external storage medium (USB memory) or the like connected to the parallelization compiling apparatus 100.
(32) The binary code PB is an execution program executable by each of the first processing element PE1 to the n-th processing element PEn in the multiprocessor system 200. The first processing element PE1 to the n-th processing element PEn execute the binary code PB to obtain the same processing results as those in the case where (binary code) of the sequential program P1 is executed in the single processor system.
(33) The storage device 140 is composed of memory units (a main memory unit, an auxiliary memory unit, and the like), such as a ROM, a RAM, and an HDD, and an I/O circuit. The storage device 140 includes at least a non-volatile memory. The RAM is a readable and writable volatile memory, the ROM is a read-only nonvolatile memory, and the HDD is a readable and writable nonvolatile memory. In the ROM and the HDD, programs and the like read and executed by the arithmetic processing element 110 are prestored. The RAM is used as a storage area for temporarily storing a program when the arithmetic processing element 110 executes the program stored in the ROM and the HDD, or as a storage area for temporarily storing working data. In addition to or instead of the RAM, the HDD may be used as a storage area for temporarily storing the program, or as a storage area for temporarily storing working data.
(34) In the nonvolatile memory of the storage device 140, the pre-installed parallelization compiler C and the configuration file are stored.
(35) (Configuration of Multiprocessor System)
(36) The multiprocessor system 200 includes PE1 to PEn as n processing elements interconnected with one another by an interconnection network such as bus connection or cross-bus connection, a centralized shared memory 210, and an input/output device 220 for the multiprocessor system. Each of the processing elements corresponds to an example of each of arithmetic processing units of the present invention.
(37) The k-th processing element PEk (k=1, . . . , n) includes a central processing unit CPU, a local data memory LDM, a data transfer unit DTU, a distributed shared memory DSM, and a local program memory LPM.
(38) The configuration of the k-th processing element PEk (k=1, . . . , n) may be different from this configuration as long as the processing element can perform predetermined arithmetic processing. For example, the k-th processing element PEk (k=1, . . . , n) may include a cache memory in addition to or instead of the local data memory LDM and the local program memory LPM. The k-th processing element PEk (k=1, . . . , n) may also include a register for clock frequency or power-supply voltage control. Further, the k-th processing element PEk (k=1, . . . , n) may include an accelerator instead of or in addition to the central processing unit CPU. Conversely, for example, all or some of components (LDM, LPM, DSM, DTU) other than the CPU may be omitted. Further, the k-th processing element PEk (k=1, . . . , n) may have a configuration different from the others.
(39) The central processing unit CPU is a general-purpose processor.
(40) The local data memory LDM is a memory unit (composed of a RAM and the like) accessible only from the processing elements including the LDM.
(41) The data transfer unit DTU is a unit for managing data transfer between processing elements, between the k-th processing element PEk and the centralized shared memory 210, or between the k-th processing element PEk and the input/output device 220 of the multiprocessor system.
(42) The distributed shared memory DSM as one of the components of each processing element is a memory unit (composed of a RAM and the like) accessible from other processing elements, but the DSM may not be necessarily provided.
(43) The local program memory LPM stores a program to be executed by the k-th processing element PEk including the LPM (e.g., a program for part of the binary code PB assigned to the k-th processing element PEk).
(44) Some processing elements may include a signal processor (Digital signal processor, which is abbreviated as DSP) or a dynamically reconfigurable processor (abbreviated as DRP) instead of the CPU.
(45) The processing elements PE1 to PEn may be grouped into a processing group PG as a hierarchical group. The details of this grouping technique are disclosed in Reference Literature 1, M. Miyazawa, M. Okamoto, and H. Kasahara, A Subroutine Parallel Processing Scheme for Hierarchical Macro-dataflow Computation, Proc. 48.sup.th National Convention of IPSJ, 1994.
(46) Note that the central processing unit CPU, and each of the processing elements PE1 to PEn or the processing group PG correspond to an example of each of the arithmetic processing units of the present invention.
(47) The centralized shared memory 210 is composed of memory media (composed of a RAM, a ROM, an HDD, and the like) accessible from each of the processing elements PE1 to PEn.
(48) The input/output device 220 of the multiprocessor system may be a unit for reading information from an external storage medium, which is a CD drive, a DVD drive, or a Blu-ray drive. Instead of the input/output device 220 of the multiprocessor system, for example, the binary code PB may be externally input to the multiprocessor system 200 through a communication device, or the binary code PB may be input to the multiprocessor system 200 by being directly written to a memory unit (the centralized shared memory 210 or the local program memory LPM) of the multiprocessor system. Further, as will be understood, the input/output device 220 has the functions of reading data, on which arithmetic processing is performed in the multiprocessor system, and outputting the arithmetic processing results.
(49) Particularly, when the multiprocessor system 200 is used to control a control target such as a vehicle, the input/output device 220 of the multiprocessor system has the functions of reading information data (e.g., the number of revolutions and temperature of a vehicle engine, and the like), which indicate the states of a control target necessary for the control, as binary data in real time, and outputting control information data in real time to control the control target after being subjected to arithmetic processing by a parallel program.
(50) As described above, the multiprocessor system including the processing elements PE1 to PEn or the processing group PG, in which the processing elements are grouped, and further including a shared storage device and the input/output device corresponds to a system configured to execute a parallel program generated by a parallelization compiler of the present invention. Note that the system of the present invention is not limited to the multiprocessor system integrated on one semiconductor chip or a system equipped with plural arithmetic processing units in one housing. The system may also be a system configured by interconnecting plural computers as arithmetic processing units through communication.
(51) (Parallel Program Generation Processing)
(52) Referring next to a flowchart of
(53) First,
(54) When reading the sequential program P1 and the configuration file CF through the reading device 120, the arithmetic processing element 110 performs lexical analysis and syntax analysis on the sequential program P1 (
(55) Based on the results of the lexical analysis and syntax analysis, the task division element 111 divides the sequential program P1 into three kinds of coarse grain tasks (macro tasks), i.e., a basic block (BB) including an assignment statement and a conditional branch, a repetition block (RB) including repeated execution, and a subroutine block (SB) including a function (
(56) The task division element 111 analyzes the execution cost including the execution time of each task (
(57) For example, as illustrated in
(58) Further, pro3 (mp_prop_count(2); mp_prop_clock_start(2)) measures the processing start time of MT2 together with the number of processing times, pro4 (mp_prop_clock_end(2)) measures the processing end time of MT2, pro5 (mp_prop_count(3); mp_prop_clock_start(3)) measures the processing start time of MT3 together with the number of processing times, and pro6 (mp_prop_clock_end(3)) measures the processing end time of MT3. MT2 and MT3 are subsequent tasks branching from MT1, and the sum of the numbers of processing times of MT2 and MT3 coincides with the number of processing times of MT1. Further, the probability of branching from MT1 to MT2 can be calculated by dividing the number of processing times of MT2 by the number of processing times of MT1. The probability of branching to MT3 can be calculated in the same way. Further, like in the case of MT1, the execution times of MT2 and MT3 can be determined by subtracting the measurement value of each processing start time from the measurement value of each processing end time, respectively.
(59) Thus, the execution times (execution costs) of all the other tasks can be measured in the same way.
(60) If the execution time and probability of branching of a conditional branch in each branch direction can be measured, the execution costs of various task groups can be calculated based on the data. The calculation of the task execution costs is described in Reference Literature 2 (M. Miyazawa, M. Okamoto, and H. Kasahara, Hierarchical Parallelism Control Scheme for Multigrain Parallelization, Trans. of IPSJ, 2003).
(61) The execution cost may also include power consumption used to execute a task in addition to the number of executions and the execution time. The measured execution cost of each task can be used to calculate the execution cost of a task group (a group of plural tasks).
(62) After executing STEP006 for execution cost analysis, the task division element 111 performs inline expansion on a subroutine block SB as needed when the subroutine block SB includes a particularly large execution cost. This inline expansion is not illustrated in
(63) After completion of the execution cost analysis, the dependency analysis element 112 analyzes the control flow and data dependency of each task divided in the task division processing (
(64) An example of the MFG thus generated is illustrated in
(65) Each task is any one of the basic block (BB) (or the pseudo assignment statement block (BPA)), the repetition block (RB), and the subroutine block (SB). Each solid-line edge indicates data dependency from a post-processing task (a task to be post-executed in the sequential program) to a pre-processing task (a task to be pre-executed in the sequential program). Each broken-line edge indicates a control flow from the pre-processing task to the subsequent processing task. Note that a small circle in each node indicates a conditional branch.
(66) For example, in
(67) Further, in
(68) Although the arrow of each edge in the MFG of
(69) The dependency analysis element 112 analyzes the earliest executable conditions of tasks on the MFG (
(70) First, description is made in such a case that the control dependency and data dependency are analyzed on the MFG in
(71) Since it is also determined whether the basic block BB3 is executed or not according to the branch direction of a conditional branch included in the basic block BB2, the basic block BB3 is control dependent on one branch direction BB23 of the conditional branch included in the basic block BB2. In this case, the basic block BB3 is indirect control dependent on the one branch direction BB12 of the conditional branch included in the basic block BB1.
(72) Further, since the basic block BB5 is executed regardless of which of the directions is the branch direction of the conditional branch of the basic block BB1, the basic block BB5 is not control dependent on all the branch directions BB12 and BB15 of the conditional branch included in the basic block BB1.
(73) The basic block BB6 is data dependent on the basic block BB3. However, even when it is determined that the basic block BB2 is not executed (and hence the basic block BB3 is not executed) by determining the conditional branch of the basic block BB1 in one branch direction BB15, the basic block BB6 can be executed. Thus, BB6 is extended-control dependent on the one branch direction BB15 of the conditional branch included in the basic block BB1.
(74) Further, since the basic block BB6 can be executed even when it is determined that the basic block BB3 is not executed by determining the conditional branch of the basic block BB2 in one branch direction BB24, the basic block BB6 is extended-control dependent on the one branch direction BB24 of the conditional branch of the basic block BB2.
(75) The MFG represents the control flow and data dependency between tasks in the sequential program, but does not represent parallelism. In order to extract parallelism, the earliest executable conditions need to be analyzed based on the analysis results of the control dependency of each task and the data dependency between respective tasks described thus far. The earliest executable conditions for a task are conditions for making the task executable at the earliest time. Here, the following relationships between respective tasks are established (see Reference Literature 3 (D. Inaishi, K. Kimura, K. Fujimoto, W. Ogata, M. Okamoto, and H. Kasahara, A Cache Optimization with Earliest Executable Condition Analysis, Proc. 58.sup.th National Convention of IPSJ, 1999)).
(76) (1) When the i-th task MTi is control dependent on one branch direction of a conditional branch included in the j-th task MTj (ji), the i-th task MTi can be executed when the branch direction of conditional branch of the j-th task MTj is determined even when the execution of the j-th task MTj is not completed.
(77) (2) When the i-th task MTi is data dependent on the k-th task MTk (ki), the i-th task MTi cannot be executed until completion of the execution of the k-th task MTk.
(78) To organize this, it can be represented that the earliest executable conditions for the i-th task MTi are the following (3) and (4).
(79) (3) The conditional branch of the j-th task MTj on which the i-th task MTi is control dependent branches to a path including the i-th task MTi.
(80) (4) The k-th task MTk (ki) on which the i-th task MTi is data dependent is fully completed, or the non-execution of the k-th task MTk (ki) is determined.
(81) For example, the earliest executable conditions for the basic block BB6 (corresponding to the MTi) in the macro flow graph (MFG) of
(82) (5) The execution of the basic block BB1 (corresponding to the MTj) is determined (because the execution of the basic block BB6 is determined regardless of which of the directions is the branch direction of the basic block BB1).
(83) (6) The basic block BB3 (corresponding to the MTk) on which the basic block BB6 is data dependent is completed, or the basic block BB3 on which the basic block BB6 is data dependent is determined not to be executed.
(84) Here, in terms of the MFG of
(85) Then, the case where the branch direction of the conditional branch in the basic block BB2, on which the basic block BB3 is control dependent, is determined to be the branch direction BB24 to execute the basic block BB4 includes the case where the execution of the basic block BB1 is determined because of the assumption that the branch direction of the conditional branch of the basic block BB1 is determined to be the branch direction BB12 to execute the basic block BB2.
(86) Further, the case where the branch direction of the conditional branch in the basic block BB1, on which the basic block BB3 is indirect control dependent, is determined to be the branch direction BB15 to execute the basic block BB5 includes the case where the execution of the basic block BB1 is determined.
(87) Thus, the earliest executable conditions for the basic block BB6 illustrated in the MFG of
(88) The basic block BB3 is completed, or the branch direction of the conditional branch of the basic block BB1 is determined to be the branch direction BB15 to execute the basic block BB5, or the branch direction of the conditional branch of the basic block BB2 is determined to be the branch direction BB24 to execute the basic block BB4. Note that the earliest executable conditions may not be necessarily simplified in this way.
(89) As described above, when the same earliest executable condition analysis as that performed on the basic block BB6 is made on the other tasks, the earliest executable conditions for respective tasks are represented in a table illustrated in
(90) Based on the results of the earliest executable condition analysis in
(91) For example, the arithmetic processing element 110 generates an MTG illustrated in
(92) Like in the MFG; each node in the MTG indicates each task, a small circle in each node indicates a conditional branch in each task, each solid-line edge indicates data dependency, and each broken-line edge indicates control dependency or extended-control dependency. Further, as described in
(93) Further, there are two kinds of arcs that bundle respective edges. The solid-line arc indicates that respective edges bundled with the arc are in an AND relationship, i.e., that tasks respectively subsequent to two or more broken-line edges bundled with the solid-line arc are executable simultaneously in parallel with each other, and the broken-line arc indicates that respective edges bundled with the arc are in an OR relationship, i.e., that tasks respectively subsequent to two or more broken-line edges bundled with the broken-line arc are in a selective relationship at respective conditional branches.
(94) For example, in the MTG illustrated in
(95) For example, as can be seen from the MFG of
(96) The basic block BB12 in the MTG of
(97) Then, as can be seen from the MFG of
(98) As described above, since these edges are in the OR relationship, the edges are bundled with a broken-line arc. Note that the direction of each edge whose arrow is omitted in the MTG is set downward. Further, edges with arrows represent the original control flow.
(99) Further, for example, as can be seen from the MFG of
(100) Then, the fusion-target task group determination element 116 performs fusion-target task group determination processing for determining a task group as a fusion target from the MTG (
(101) Referring to
(102) The fusion-target task group determination element 116 refers to the MTG to identify, as a reference task, a task which is not data dependent on the other tasks, is not control dependent, extended-control dependent, and indirect control dependent on any of the branch directions of conditional branches included in the other tasks, and includes one conditional branch (
(103) As an example of identifying, as a reference task, a task which is not data dependent on the other tasks, is not control dependent, extended-control dependent, and indirect control dependent on any of the conditional branches included in the other tasks, and includes one conditional branch, the arithmetic processing element 110 can perform processing to refer to the MTG illustrated in
(104) Then, the fusion-target task group determination element 116 refers to the MTG to identify, as a task group, the reference task and first subsequent tasks as all tasks respectively control dependent or extended-control dependent on all branch directions of the conditional branch included in the reference task (
(105) In the example of
(106) Then, the fusion-target task group determination element 116 identifies the reference task and the first subsequent tasks as a task group (
(107) The fusion-target task group determination element 116 refers to the MTG to determine whether there is a task control dependent or extended-control dependent on any of the branch directions of the conditional branches of the tasks included in this identified task group (
(108) The fusion-target task group determination element 116 refers to the MTG illustrated in
(109) As described above, when the determination result in
(110) After
(111) For example, the fusion-target task group determination element 116 refers to the MTG illustrated in
(112) When the determination result in
(113) The above description is made on the fusion-target task group determination processing in
(114) Further, for example, instead of the processing STEP202 to STEP210 in
(115) Next, the fusion element 113 fuses the task group extracted in
(116) For example, the fusion element 113 refers to the MTG in
(117) As can be seen from the MTG in
(118) Next, the scheduling element 114 performs static scheduling processing to conform to the above-described configuration file CF (including information on the kinds and number of PEs, the grouping situation, the memory situation, and the like) together with the MTG generated via
(119) For example, when the number of PEs in the multiprocessor system is five in the configuration file CF, the scheduling element 114 can assign five tasks to the respective PEs. Further, if the number of PEs indicated in the configuration file CF is two, the arithmetic processing element 110 makes an assignment based on the execution costs of the five tasks to minimize the execution cost difference between the two PEs. For example, the scheduling element 114 can assign the block1 and BB5 to PE1 as a first PE, and the block2, BB11, and BB13 to PE2 as a second PE.
(120) In the example described above, the number of parallelly executable tasks is not large, which is three to five. However, when the number of PEs that constitute the multiprocessor system increases as the number of parallelly executable tasks increases, the scheduling processing is not so simple based on the number of PEs as mentioned above. In this case, there is a need to consider various conditions, and hence the processing is generally complicated.
(121) Here, as the scheduling method, the method disclosed in Patent Literature 1 using static scheduling to assign each task to any processing element PE or processing group PG depending on the task hierarchy can be adopted.
(122) Further, in the multiprocessor system, especially in a multiprocessor system formed on a semiconductor chip, a mechanism using software to make the operating voltage of each processing element or the like in the system variable is often provided. This is to optimize the operating voltage in the multiprocessor system according to the execution situation of each individual task in order to reduce the power consumption. The arithmetic processing element 110 may use the estimated power consumption as an execution cost to select the operating voltage of each of the processing elements and the like that constitute a multiprocessor system appropriate for the execution of each task based on this execution cost, and insert an instruction to operate the processing element PE or the processing group PG at the operating voltage. Note that the details of the selection of appropriate operating voltage are described in Japanese Patent No. 4082706.
(123) Further, the scheduling element 114 may perform cache optimization by trying global cache optimization between groups having dependency. Note that the global optimization is described in Japanese Patent No. 4177681.
(124) Here, the voltage control and cache optimization, and the like can be realized relatively easily by using a runtime library or the like according to an automatic parallelizing API standard interpretation system and the platform of the multiprocessor system 200 disclosed in Patent Literature 1.
(125) Based on the scheduling results, the generation element 115 generates a parallel program P2 (
(126) Based on the information described in the configuration file CF, the generation element 115 uses a back-end compiler that supports various PEs in the multiprocessor system to generate binary code PB from the parallel program (source code) P2 (
(127) According to the processing mentioned above, the parallel program P2 (and the binary code PB) parallelly executable by the multiprocessor system 200 is generated. Then, the arithmetic processing element 110 ends the series of parallelization processing in the flowchart of
(128) As described above, although it becomes apparent that the technique of the present invention can extract more parallelly executable tasks than those in the conventional technique, further more parallelly executable tasks can be extracted depending on the state of the original sequential program and the configuration of the multiprocessor system. The following describes another example.
(129) This example is to analyze whether there is a possibility that any further parallelly executable task exists in the task group to be fused, which is generated in STEP018 of
(130) In STEP020 to STEP024 of
(131) An example of a processing flow in such a case as illustrated in
(132) After completion of the fusion-target task group determination processing in
(133) When parallelly executable tasks or task groups do not exist in the above-described reference task and a task group in which subsequent tasks, such as the first subsequent tasks and the second subsequent tasks are added to the reference task, i.e., in a subsequent task group included in the task group to be fused as described thus far, the above-described task group including the reference task, the first subsequent tasks, and the second subsequent tasks is fused as one task so that the tasks having conditional branches can be hidden in one fused task. In other words, since the results in this case are the same as those of the processing flow in
(134) When such tasks or task groups exist, the fusion element 113 duplicates the conditional branch included in the reference task by an increment obtained by subtracting one from the number of parallelly executable tasks or task groups. For example, when the number of parallelly executable tasks or task groups is three, the fusion element 113 duplicates the conditional branch included in the reference task by an increment corresponding to 31=2 to make three conditional branches including the conditional branch included in the reference task exist. Then, the parallelly executable tasks are made to follow respective of the conditional branch included in the reference task and the duplicated conditional branches to fuse the respective conditional branches and the tasks subsequent to the conditional branches so that the conditional branches and the parallelly executable tasks can be executed by (plural) processing elements PE corresponding in number to those in the multiprocessor system, respectively, and hence the degree of parallelism can be increased.
(135) The details of duplication processing for conditional branches are described in Japanese Patent Application Laid-Open No. 2014-160453. However, in the present invention, more parallelly executable tasks can be extracted by combining this processing with the processing flow for generation of the parallel program in
(136) In the following, the conditional branch included in the reference task may be called a target conditional branch as appropriate.
(137) The condition determination element 117 determines whether predetermined conditions for duplication are satisfied to determine whether to fuse tasks after being duplicated or to fuse the tasks without being duplicated (
(138) The predetermined conditions include at least such a parallelly executable condition that plural tasks or task groups control dependent, indirect control dependent, or extended-control dependent on one branch direction among plural branch directions of the target conditional branch do not have data dependency on one another. Thus, when the plural tasks or task groups control dependent, indirect control dependent, or extended-control dependent on the one branch direction do not have data dependency on one another, the one branch direction is called a target branch direction below. Note that the term one branch direction here is an expression used in a state before the earliest executable condition analysis is performed, for example, in a state of being expressed in the MFG of
(139) As illustrated in
(140) Further, the basic block BB12 is extended-control dependent on a different branch direction (bundled with the broken-line arc) in the OR relationship with the branch direction of the conditional branch included in the basic block BB7 toward the basic block BB8. The branch direction of the conditional branch of the basic block BB8 toward the basic block BB9 is extended-control dependent on the different branch direction in the OR relationship.
(141) When the condition determination element 117 determines that the predetermined condition is satisfied in
(142) Here, the duplication of a conditional branch is described in more detail before specific description of the duplication of a conditional branch or a reference task.
(143) A reference task including a conditional branch to be duplicated generally includes, in addition to the conditional branch, a set of statements on which the conditional branch is data dependent, i.e., a set of statements for setting conditions to determine the branch directions of the conditional branch. In the present invention, such a set of statements is called a condition setting statement group. Further, a set of statements on which the conditional branch is not data dependent, i.e., a set of statements which can be potentially executed in parallel with the conditional branch may also be included. Similarly, such a set of statements is called a statement group with the potentiality of parallel execution.
(144) Then, when (only) the conditional branch is duplicated and assigned to a different PE together with subsequent tasks (groups) at the time of static scheduling, statements other than the conditional branch in the reference task (the above-described condition setting statement group and statement group with the potentiality of parallel execution) are executed by one PE, and the execution results (data) are transferred to the PE that process the duplicated conditional branch and subsequent tasks. Therefore, in this case, the time required to transfer the execution results is added to the processing time of the parallel program.
(145) Further, when the entire reference task is duplicated, since respective PEs perform processing for the condition setting statement group, data transfer between PEs is unnecessary, and hence the time for data transfer is not added to the processing time of the parallel program. However, since the duplicated reference tasks are executed by all PEs to which the duplicated reference tasks are assigned, this case has a slight disadvantage because power consumption is likely to increase. Further, when there is the statement group with the potentiality of parallel execution in the reference task, and there is no data dependency relationship between the conditional branch and the condition setting statement group, the statement group with the potentiality of parallel execution is executable in parallel with the conditional branch and the condition setting statement group. Therefore, if the statement group with the potentiality of parallel execution is assigned to a PE different from the PE that execute each duplicated reference task, the processing time of the parallel program can be reduced.
(146) Thus, instead of duplicating only the conditional branch, only the conditional branch and the condition setting statement group are set as a new task and duplicated. In this case, since each PE configures the condition settings of the conditional branch in a minimum of time without the need for data transfer, the processing time of the parallel program can be reduced compared with the case where the reference task is duplicated.
(147) In view of such circumstances, in addition to the simple duplication of only the conditional branch, the duplication of the conditional branch and condition setting statement group, and the duplication of the conditional branch caused accordingly by the duplication of the reference task (including the case where there is no statement group with the potentiality of parallel execution) correspond to examples of the duplication of a conditional branch in the present invention.
(148) Here, the description is returned to the description of the duplication of the conditional branch in
(149) For example, in the example illustrated in
(150) The fusion element 113 duplicates the conditional branch included in the basic block BB7 as the reference task illustrated in
(151) Then, the fusion element 113 makes each task or task group, which is control dependent, indirect control dependent, or extended-control dependent on the target branch direction of the target conditional branch, follow any one of the reference tasks (
(152) For example, in
(153) Thus, when there is one task (called a first task in this paragraph) control dependent, indirect control dependent, or extended-control dependent on both of two branch directions (these two branch directions are called a first branch direction and a second branch direction in this paragraph) of conditional branches, the arithmetic processing element 110 may duplicate the first task to generate a second task to make the first task follow the first branch direction of one conditional branch and the second task follow the second branch direction of the other conditional branch. Alternatively, the arithmetic processing element 110 may make the first task follow the first branch direction of one conditional branch and the second task follow the second branch direction of the one conditional branch. Specifically, in the above case, since the basic block BB12 is control dependent on both of the two branch directions, the arithmetic processing element 110 duplicates BB12 to generate one task identical to BB12 included in FTG2 to include the duplicated task in a task group FTG3 subsequent to the other branch direction. Then, the execution of STG1 and the execution of STG2 can be assigned to different processing elements in the multiprocessor system.
(154) This case is described in some detail. As illustrated in
(155) Then, when the branch direction of the conditional branch included in the reference task ST1 in the STG1 is determined to be a branch direction to execute FTG1, the branch direction of the conditional branch included in the reference task ST2 in STG2 is also determined to be a branch direction to execute FTG2, while when the branch direction of the conditional branch included in the reference task ST1 is determined to be another branch direction different from the branch direction to execute FTG1, the branch direction of the conditional branch included in the reference task ST2 is also determined to another branch direction in the same manner to execute FTG3.
(156) Although FTG3 is control dependent on a branch direction different from the target branch direction (the branch direction on which FTG2 is control dependent) of the conditional branch included in the reference task ST2, FTG3 can also be made to follow a branch direction different from the target branch direction (the branch direction on which FTG1 is control dependent) of the conditional branch included in the reference task ST1 of the task group STG1, and even this case does not run counter to the basic contents of the present invention.
(157) Thus, two (parallelly executable) task groups STG1 and STG2 as illustrated in
(158) Although the duplication of a conditional branch is described with reference to
(159) The fusion-target task group determination processing (
(160) After this, the arithmetic processing element 110 returns to the processing in
(161) For example, as described based on
(162) Next, like in the processing flow of
(163) While the present invention (the parallelization processing flows illustrated in
(164) In this case, the basic block BB5 is control dependent on the branch direction BB1.sub.5 of the conditional branch of the basic block BB1, and the basic block BB11 is control dependent on the branch direction BB7.sub.11 of the conditional branch of the basic block BB7.
(165) Like in the case of the MFG of
(166) Using the results of the earliest executable condition analysis, the dependency analysis element 112 generates an MTG (STEP016 in
(167) Here, STEP020 to STEP024 in
(168) As can be seen from the above description, among the four tasks groups FTG1 to FTG4 subsequent to BB7, since FTG1 and FTG2 have no data dependency relationship, both are parallelly executable. Similarly, since FTG3 and FTG4 have no data dependency relationship, both are also parallelly executable. Then, FTG1/FTG2 and FTG3/FTG4 are made to follow respective branch directions in an OR relationship of the conditional branch of BB7.
(169) Therefore, in
(170) Next, the arithmetic processing element 110 (the fusion-target task group determination element 116) returns the processing to
(171) Subsequently, the arithmetic processing element 110 (the condition determination element 117 and the fusion element 113) determines, in
(172) While the fusion-target task groups including the reference task BB7 are described above, the same processing can also be performed on the fusion-target task groups including the reference task BB1. In other words, the MTG in
(173) To summarize the above, the MTG in
(174) (Operational Advantages)
(175) According to the parallelization compiler C of the embodiment, the parallelization compiling apparatus 100 (computer) can apply, to sequential programs represented by the MFGs of
(176) Further, MTGs generated by applying the fusion technique of Patent Literature 1 as the conventional technique to the MFGs in
(177) First, in comparison among
(178) Further, like among
(179) As can be seen from the above summary, according to the parallelization compiler of the present invention, the possibility of extracting parallelly executable tasks is not undermined unlike the conventional technique.
(180) (Variations)
(181) In
(182) (Comparison with Patent Literature 1)
(183) Although the superiority of the present invention over Patent Literature 1 is fully described above, the superiority of the present invention over Patent Literature 1 is described concisely once again with reference to an MFG of a simple sequential program composed of four tasks as illustrated in
(184) Note first that the technique of Patent Literature 1 and the technique of the present invention both have the same purpose in terms of eliminating the need for scheduling processing upon program execution by fusing a conditional branch and all tasks subsequent to all branch directions of the conditional branch, respectively, into one task and assigning the task to one processing element in the multiprocessor system (static scheduling). As described above, since fusing such a task group including the conditional branch into one task makes the conditional branch in the task invisible by the fusion, this is called hiding of the conditional branch.
(185) However, there is a difference between Patent Literature 1 and the present invention in terms of the range of tasks to be fused. The latter has a big advantage of being easy to extract the task group to be fused and any other task having no dependency relationship as parallelly executable tasks. The fundamental principle is described based on the above-described examples.
(186) According to the technique of Patent Literature 1 (the technique described in Description of the Related Art in this specification), task BB101 having processing for branching to different tasks is identified as a start task in the MFG of
(187) Then, in the technique of Patent Literature 1, four tasks, i.e., the identified start task BB101, the identified end task SB104 in the processing using the start task as the start point, and all tasks SB102 and SB103 to be executed after the execution of the start task BB101 and before the execution of the end task SB104 are fused as new one task. In other words, tasks to be fused in Patent Literature 1 are a task group TG1 surrounded by the dot-and-dash line in
(188) On the other hand, according to the present invention, the earliest executable condition analysis in
(189) In the MFG illustrated in
(190) However, since the subroutine block SB104 is executed even though the branch direction of the conditional branch of the basic block BB101 is either the branch direction BB101.sub.102 or the branch direction BB101.sub.103, the subroutine block SB104 is not control dependent and indirect control dependent on the branch directions BB101.sub.102 and BB101.sub.103 of the conditional branch of the basic block BB101. Further, since the subroutine block SB104 is not data dependent on the subroutine blocks SB102 and SB103, the subroutine block SB104 is extended-control dependent on neither of the branch directions BB101.sub.102 and BB101.sub.103 of the conditional branch of the basic block BB101.
(191) Based on the earliest executable condition analysis, the MTG illustrated in
(192) Then, in STEP202 of
(193) On the other hand, as described above, the subroutine block SB104 is not control dependent, indirect control dependent, and extended-control dependent on any of the branch directions of the conditional branch of the basic block BB101. Therefore, the subroutine block SB104 is not included in the task group to be fused in determining the task group to be fused in
(194) Then, in
(195) In comparison between the MTGs after the fusion in
EXPLANATION OF NUMERAL REFERENCES
(196) 100 . . . compiling apparatus, 110 . . . arithmetic processing element, 120 . . . reading device, 130 . . . writing device, 140 . . . storage device, 200 . . . multiprocessor system, 210 . . . centralized shared memory, 220 . . . input/output device, C . . . parallelization compiler, PE1 . . . first processing element, PE2 . . . second processing element, PEn . . . n-th processing element, P1 . . . sequential program, CF . . . configuration file, P2 . . . parallel program, PB . . . binary code.