METHOD, DEVICE, AND SYSTEM FOR CREATING A MASSIVELY PARALLILIZED EXECUTABLE OBJECT

20180095738 · 2018-04-05

    Inventors

    Cpc classification

    International classification

    Abstract

    The present invention provides a method, system and device for optimizing machine code to be executed on a device that comprises one or more busses and a plurality of processing elements. The machine code is configured to execute a task on the device comprising a plurality of subtasks. The method includes the steps of identifying for at least one subtask one or more processing elements from the plurality of processing elements that are capable of processing the subtask, identifying one or more paths for communicating with the one or more identified processing elements, predicting a cycle length for one or more of the identified processing elements and/or the identified paths, selecting a preferred processing element from the identified processing elements and/or selecting a preferred path from the identified paths, and generating the machine code sequence that comprises instructions that cause the device to communicate with the preferred processing element over the preferred path and/or to execute the subtask on the preferred processing element.

    Claims

    1. A method for optimizing machine code to be executed on a device that comprises one or more busses and a plurality of processing elements, wherein the machine code is configured to execute a task on the device comprising a plurality of subtasks, wherein the method comprises the steps of: identifying for at least one subtask one or more processing elements from the plurality of processing elements that are capable of processing the subtask; identifying one or more paths for communicating with the one or more identified processing elements; predicting a cycle length for one or more of the identified processing elements or the identified paths; selecting a preferred processing element from the identified processing elements or selecting a preferred path from the identified paths; and generating a machine code sequence that comprises instructions that cause the device to communicate with the preferred processing element over the preferred path or to execute the subtask on the preferred processing element.

    2. The method of claim 1, wherein the step of identifying one or more processing elements includes the step of: dividing the task into subtasks; identifying logical dependencies of the subtasks; and identifying one or more processing elements from the plurality of processing elements that are capable of processing the subtasks based on the dependencies.

    3. The method of claim 2, wherein for independent subtasks a corresponding number of processing elements are identified, which are capable of processing said subtasks in parallel.

    4. The method of claim 2, wherein subtasks having conditional relationships to each other are converted into parallel sub-tasks constituting single parallel cases to each other.

    5. The method of claim 1, wherein the cycle length for an identified processing element or an identified path is predicted based on at least one of: a branch prediction method, in particular based on former predictions or selections of preferred paths; or a brute force method, wherein the cycle length for each identified path is evaluated.

    6. The method of claim 1, wherein selecting the preferred processing element or selecting the preferred path is based on at least one of: a priority of the task, wherein for small subtasks processing elements or paths with a short cycle length are selected; or a dependency of a subtask, wherein processing elements or paths are selected such that independent or semi-independent subtasks can be carried out in parallel on several processing elements.

    7. The method of claim 1, wherein after the step of generating the machine code or during executing the machine code, the method comprises the further steps of: identifying for at least one other subtask comprised by the task one or more processing elements from the plurality of processing elements that are capable of processing the subtask; identifying one or more paths for communicating with the one or more identified processing elements; predicting a cycle length for one or more of the identified processing elements or the identified paths; and selecting a preferred processing element from the identified processing elements or selecting a preferred path from the identified paths.

    8. The method of claim 1, wherein the cycle length for an identified processing element or an identified path is predicted based on at least one of: a predicted forward transfer time for transferring an instruction and input data to the identified processing element on the identified path; a predicted return transfer time for transferring output data from the identified processing element on the identified path; or a predicted processing time for processing a subtask on the identified processing element.

    9. The method of claim 8, wherein the predicted cycle length is a sum of the predicted forward transfer time, the predicted return transfer time and the predicted processing time.

    10. The method claim 1, wherein predicting the cycle length is based on at least one of: a current availability or utilization of the one or more busses; or the current availability or utilization of the one or more identified processing elements.

    11. The method of claim 1, wherein the method further comprises the steps of: beginning processing of a subtask on the selected processing element; updating the predicted cycle length of the subtask to obtain a predicted remaining cycle length of the subtask; cancelling the processing of the subtask on the selected processing element when it is determined that the predicted remaining cycle length is higher than a predicted cycle length for processing the subtask in a different processing element; and assigning the subtask to said different processing element.

    12. The method of claim 1, wherein the method further comprises the steps of: determining a threshold time for the processing of a subtask; beginning processing of the subtask on the selected processing element; checking whether the actual processing time for the subtask is higher than the threshold time; cancelling the processing of the subtask if the actual processing time is higher than the threshold time; and assigning the subtask to a different processing element.

    13. A device, comprising: one or more busses; one or more control elements; a plurality of processing elements; wherein at least one of the control elements is configured to generate a machine code configured to execute a task on the plurality of processing elements in parallel; a first multicore processor comprising one or more first processing elements and at least one control element; at least one second multicore processor comprising one or more second processing elements, wherein the first and second multicore processors are located on a first board and being connected to each other by a point to point cable or a board to board connection; and at least one third multicore processor comprising one or more third processing elements being connected to the first multicore processor via a Field Programmable Gate Array (FPGA).

    14. The device of claim 13, wherein the device further comprises one or more fourth processing elements being connected to the Field Programmable Gate Array (FPGA) via a network.

    15. The device of claim 13, wherein the Field Programmable Gate Array (FPGA) is configured to realize a communication between the at least one third multicore processor and the first multicore processor.

    16. The device of claim 13, wherein each of the first, second and third multicore processors is connected to the Field Programmable Gate Array (FPGA) via at least one respective XIO link.

    17. The device of claim 13, wherein the multicore processors comprise each a ring bus.

    18. The device of claim 13, wherein the task comprises a plurality of subtasks, and wherein the device for generating the machine code is configured to: identify for at least one subtask one or more processing elements from the plurality of processing elements that are capable of processing the subtask; identify one or more paths for communicating with the one or more identified processing elements; predict a cycle length for one or more of the identified processing elements or the identified paths; select a preferred processing element from the identified processing elements or selecting a preferred path from the identified paths; and generate the machine code sequence that comprises instructions that cause the device to communicate with the preferred processing element over the preferred path or to execute the subtask on the preferred processing element.

    19. The device of claim 13, wherein at least one of the control elements is configured to predict a cycle length based on at least one of: a predicted forward transfer time for transferring an instruction and input data to the processing element; a predicted return transfer time for transferring output data from the processing element; or a predicted processing time for processing the subtask in a processing element.

    20. The device of claim 13, wherein at least one of the control elements is configured to: begin execution of the subtask on the selected processing element; update a predicted cycle length of the subtask to obtain a predicted remaining cycle length of the subtask; cancel the processing of the subtask on the selected processing element when it is determined that the predicted remaining cycle length is higher than a predicted cycle length for processing the subtask in a different processing element; and re-assign the subtask to said different processing element.

    21. The device of claim 13, wherein the device further comprises one or more busy tables comprising information about capabilities or current availability or utilization of the plurality of processing elements, wherein at least one of the control elements is configured to regularly update the information in the one or more busy tables.

    22. The device of claim 13, further comprising a prediction module that is configured to predict future subtasks based on previously processed subtasks.

    23. The device of claim 22, wherein the device is configured to cancel one or more predicted future subtasks in favor of executing current subtasks if one or more new subtasks arrive after beginning execution of one or more predicted future subtasks.

    24. The device of claim 13, wherein the one or more busses, the one or more control elements, and at least some of the plurality of processing elements are located inside the same chip housing.

    25. The device of claim 13, wherein the device is disposed within a server system.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0118] FIG. 1 shows a schematic device;

    [0119] FIG. 2 shows a schematic representation of a bus system with ring structure;

    [0120] FIG. 3a shows a schematic representation of a bus system with two ring busses according to the invention;

    [0121] FIG. 3b shows a further schematic representation of a bus system according to the invention;

    [0122] FIG. 4 shows a schematic representation of a further bus system, with indicated pointers to current and future active elements;

    [0123] FIG. 5 shows a schematic representation of a further bus system;

    [0124] FIG. 6 shows a schematic representation of a bus system with TDMA structure that operates bi-directional;

    [0125] FIG. 7 shows a schematic representation of a bus system with TDMA structure with branches and that operates bi-directional;

    [0126] FIG. 7a shows a schematic representation of the bus system of FIG. 7, with a global token in a primary branch;

    [0127] FIG. 7b shows a schematic representation of the bus system of FIG. 7, with a global token in a secondary branch and optionally a local token in a different secondary branch;

    [0128] FIG. 8 shows a schematic representation of a bus system with TDMA structure that operates bi-directional in which not all but some elements share the same busses; and

    [0129] FIG. 9 shows a schematic representation of a further bus system with ring structure.

    DETAILED DESCRIPTION

    [0130] FIG. 1 shows a schematic representation of a device which comprises a bus system. The bus system comprises a plurality of multicore processors 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142. At least one of these processors, e.g. the processor 120 which is indicated as CPU 1 Master comprises a control element (not shown). This control element is adapted to generate a machine code which is configured to be executed on a plurality of further processing elements. The control element is one processor core of a multicore processor 120. Consequently, the further processor cores (not shown), e.g. 8 further cores, constitute processing elements. The cores of the multicore processor 120 are connected by a ring bus.

    [0131] The multicore processor 120 is located on the same PCB board 140 as the further multicore processor 122. The multicore processors 120 and 122 are directly communicating with each other. In order to do so, they are connected by copper wires located on the same PCB 140 board like the processors 120 and 122.

    [0132] The processors 120 and 122 are connected each to a Fabric 110. The Fabric 110 comprises an FPGA located on a PCB which is preferably separate to the PCB 140. The connections between the FPGA and the processors 120, 122 are XIO links. Such an XIO link can be a packet-based, high-performance computer bus. Preferably, a specific protocol is running on the XIO link, which is configured to support the (e.g. which supports the parallelisation). The XIO links comprise serialized General-purpose inputs/outputs (GPIOs). The FPGA is configured to deserialize these GPIOs. Accordingly the FPGA has a Serializer/Deserializer (SerDes) function. The Serializer/Deserializer (SerDes) function can comprise a pair of functional blocks used for high-speed communications to compensate for limited input/output. These blocks can convert data between serial data and parallel interfaces in each direction. The SerDes advantageously provides data transmission over a single/differential line, in order to minimize the number of I/O pins and interconnects.

    [0133] The connection 150 between the processors 120, 122 and the Fabric 110 comprises one or more copper cables. Moreover, there is usually arranged an additional PCB connector element (not shown) in the connection 150, i.e. between the PCB, on which the processors are located, and the Fabric 110. The PCB connector element has the function to combine the connections (i.e. the cables) of the processors 120 and 122.

    [0134] The processors 120 and 122 form a massively parallel processor array (MPPA). The method is applied on this MPPA. Hence, the MPPA constitutes a device (i.e. a bus system). The bus system comprising the two processors 120 and 122 can correspond to that one shown in FIG. 3a.

    [0135] Said MPPA is furthermore connected via the Fabric 110 to further two or more MPPAs comprising the multicore processors 124, 126, 128 and 130. Each of the processors 124, 126, 128 and 130 is connected to the FPGA by a XIO link, as described above in context of processors 120, 122.

    [0136] Each of the MPPAs can carry out the method according to the invention, e.g. independently from each other. However, it is also possible that the method is carried out on the combination of the three MPPAs in a unified manner. Hence, the combination of the MPPAs can also constitute a device (i.e. a bus system) according to the invention. In this case the multicore processor 120 can be the master CPU, i.e. it comprises one core which is the control element according to the invention.

    [0137] The processors 120 to 130 can be Cell processors. Of course, it is also possible, that such a combination of MPPAs comprises different processors. Hence, the different performance of the different processors and the respectively different processor cores (i.e. processing elements) and/or the different costs of transport due to different connections between the processors can be considered when optimizing the machine code according to the invention.

    [0138] Furthermore, the combination of MPPAs is connected via a network 120, e.g. a 40 Gbit optical fiber, to further Fabrics, as e.g. Fabric 160. These Fabrics are again connected to further MPPAs, which correspond to those explained above. However, also different MPPAs or computing entities could be connected via the network 120. It is possible that the method is carried on the network lever, i.e. on the combination of all MPPAs connected to the network in a unified manner. The complete entity of elements shown in FIG. 1 preferably forms a blade system.

    [0139] FIG. 2 shows a schematic representation of a bus system 210 with a ring topology. The multicore processors 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142 of FIG. 1 can comprise each such a bus system. The bus system 210 comprises a first ring bus 212 which is adapted to transport instructions and data in a counter-clockwise direction and a second bus 214 which is adapted to transport instructions and data in a clockwise direction. Attached to the busses 212 and 214 is a processing core 220, which acts as a control element. Furthermore, there is a plurality of elements of various functionality 222-234 connected to the busses 212, 214. The elements 222-234 comprise a random access memory (RAM) 222, a flash memory 224, a mass storage controller 226, a network interface controller 228, an I2C bus 230, a Peripheral Component Interconnect Express bus (PCIe) 232 and further miscellaneous devices 234.

    [0140] The ring busses 212, 214 are set up as direct connections between the connected elements, operated in a time-shifted manner. For the system of FIG. 2, the elements 220-234 are connected to both busses 212, 214. There are, however, no direct connections between the busses. Similarly, the systems shown in FIG. 5 and FIG. 9 do not comprise any direct connections between the busses. In other examples of the invention, the busses can comprise direct connections.

    [0141] Successively, the connected elements are allowed to write, i.e., the active status is passed from one element to the next and read or write operations can only be performed by the element that is active at a given point in time. In some examples, more than one subtask can be transported in one clock cycle. Also, more than one dataset can be attached to one subtask (SIMD). Depending on the number of bus rings, the number of connected elements and the starting position and direction of the pointer, it can happen that more than one ring addresses the same element at one point in time. For this case, a FIFO buffer can be provided that absorbs the additional instructions and data. In FIG. 2, the FIFO buffer 235 is shown only for the other miscellaneous element 234, but in a similar way, FIFO buffers can be provided for all processing elements.

    [0142] FIG. 3a shows a schematic representation of a bus system 310 comprising two rings 312, 314. Each of the rings is provided by a multicore processor, e.g. a cell processor. The multicore processor comprising the bus system 312 contains the control element according to the invention. In this example, the two rings 312, 314 are connected via a FIFO buffer 320. Hence, the multicore processors comprising the respective rings 312, 314 can communicate directly with each other. Alternatively the two rings 312, 314 are connected via an FPGA 320 and can communicate indirectly via this FPGA. The bus system 310 can be adapted to carry out the method according to the invention.

    [0143] FIG. 3b shows a schematic representation of a bus system 340 comprising four ring busses 312, 314, 352 and 354. Accordingly, the bus system 340 comprises the bus system 310, as shown in FIG. 3a. Each ring bus 312, 314, 352 and 354 constitutes a multicore processor, e.g. a cell processor. A plurality of different elements can be connected to the ring busses, as e.g. Inter-Integrated Circuits (I2C), Network interface controllers (NIC), Random-access memories (RAMs), Storages, buffers, as e.g. First In, First Out buffers (FIFO, etc.). The elements of the ring busses (e.g. the RAMs, storages, NICs, I2C, etc.) are only examples. Hence there can also be more, less or different elements connected to the ring busses and/or comprised by the multicore processors.

    [0144] The bus system 340 comprises two sets, each comprising two multicore processors, i.e. the set of processors 312, 314 (as also shown in FIG. 3a) and the set of processors 352, 354. The processors of one set are directly communicating with each other, as explained above, over respective FIFO buffers 312 and 362. The sets are connected to each other by (and preferably communicating via) an FPGA 361. In order to do so, each processor is connected by a respective XIO link 363, 364, 365 and 366 to the FPGA 361. However, the each link 363, 364, 365 and 366 may also comprise a plurality of XIO links, preferably in parallel. The interface between each of the processors and the respective XIO link 363, 364, 365 and 366 is provided by FIFO buffers of the processors.

    [0145] One of the processors, e.g. the processor 312 can comprise the control element according to the invention. The two sets together, i.e. the four processors 312, 314, 362, 364, can be adapted to carry out the method. In other words, when carrying out the method according to the invention, each processing element (i.e. processor core) comprised by the four processors 312, 314, 362, 364 can be used in parallel.

    [0146] However, the bus system 340 preferably comprises still a third set (not shown), wherein the third set corresponds to the first and second set shown in FIG. 3b and explained above. Accordingly, the three sets together, i.e. the six processors comprised by the three sets, can be adapted to carry out the method. Of course, in the same way also four, five, six or even more sets can be connected together, wherein also each set can comprise more than two processors, in order to carry out together the method according to the invention.

    [0147] The FPGA 361 additionally comprises at least one transceiver and/or General Purpose Input/Output (GPIO) which can provide a network interface 367. The interface 367 can be a small form-factor pluggable (SFP) or an enhanced small form-factor pluggable (SFP+), e.g. a Quad Small Form-factor Pluggable (QSFP or QSFP+), or a plurality of unkeyed 8P8C modular connectors (e.g. RJ45). The interface 367 might also comprise a combination of enhanced small form-factor pluggable (SFP+) and unkeyed 8P8C modular connectors (e.g. RJ45). Alternatively, the FPGA can also be connected to at least one further FPGA (not shown) which comprises the network interface as described above.

    [0148] FIG. 4 shows a schematic representation of a ring bus 412, wherein the pointer to the current active element is indicated as P0 and pointers to the next active elements are indicated as P1 to P7. In this example, a processing core 420, which acts as a control element, a RAM 422, a Flash 424, a storage 426, an NIC 428, an I2C bus 430, a PCIe 432 and other elements 434 are connected to the ring bus 412, wherein the other elements 434 are connected to the ring bus 412 via a FIFO buffer 435. The ring bus 412 is configured to transport data in a clock-wise direction and also the pointer passes through the ring in a clockwise direction. In the shown example, the elements are separated by the distance of one clock cycle. Other examples may provide that the pointer position passes through the ring with different time increments which may or may not be equal in length. The forwarding of the pointer can be decided e.g. based on static priorities that are assigned to the different elements.

    [0149] FIG. 5 shows a schematic representation of a further bus system 510 in accordance with the present invention.

    [0150] The operation of the invention shall be illustrated with the following example: Assuming that primary processing element 520a acts as a control element and sends a subtask that can be processed on one of the secondary processing elements 536-550: According to a prior art processing method, based on a previous successful result stored in one of the lookup tables, the subtask would be sent to secondary processing element 540 using first ring 512, which requires 14 clock cycles. After processing in the secondary processing element 540, which requires 4 clock cycles, the output data would be returned to primary processing element 520a on the first ring 512, which takes another 3 clock cycles. It takes a further 13 clock cycles before the active slot is returned to the primary processing element 520a. This yields a total cycle time of 14+4+13+3=34 clock cycles. According to the present invention, ideally it would be determined that the predicted cycle time is only 3+4+0+3=10 clock cycles if the subtask is sent to the secondary processing element 540 via the second ring 514, and returned to the primary processing element 520a via the first ring 512 without any bus waiting time because by set-up the ring 514 may have an exact matching offset to ring 512. In this example, the method according to the present invention leads to a reduction of the cycle time to less than a third of the cycle time according to the prior art approach.

    [0151] The n connected elements correspond to n different pointer positions.

    [0152] FIG. 6 shows a schematic representation of a further bus system 610 in accordance with the present invention. The bus system 610 is set-up with two bi-directional busses 612, 614 with a linear topology and a time division multiple access scheme. FIG. 6 shows three elements 620, 622, 640 that are connected to both linear busses 612, 614. In general, there can be a number of n such elements connected to both busses. Several of these elements can act as control elements, with the other elements acting as processing elements, controlled by the control elements. In addition to the control and processing elements other elements e.g. a RAM controller could be connected to the busses 612, 614, too.

    [0153] Alternatively, the bus system 610 can also be set up using a token passing scheme where the token is passed from one station to the next, wherein the next station is defined based on the addresses of the bus interfaces of the elements connected to the bus.

    [0154] In a further example of the invention, the pointer can be pushed or pulled by a connected control element to receive or send data to or from any other connected element.

    [0155] FIG. 7 shows a schematic representation of a non-exclusive bus system 710 that comprises three linear parts 712a, 712b, 712c that are connected through a branch 713. Connected to the bus system 710 are: two control elements 720a, 720b and a RAM 722 that are connected to the first linear part 712a, two processing elements 730, 732 that are connected to the second linear part 712b and two processing elements 740, 742 that are connected to the third linear part 712c of the bus system 710. In addition to second and third linear part 712b, 712c shown in FIG. 7, there can be any number of additional linear parts which are also connected to the first linear part 712a. These additional linear parts can comprise the same number of connected elements.

    [0156] For example the RAM component 722 has a total of three physical neighbors: control element 720b, processing element 730 of the second part 712b and processing element 740 of the third part 712c. Therefore, access to this bus system 710 should be managed with a token passing scheme where the neighbor relations are defined based on the addresses of the connected elements. It should be noted that linear parts 712b and 712c can be active at the same time. Temporary or second-level tokens are used to assign the active slot within one linear part. Knowledge about the current state and the predicted future availability of the linear parts can be used by the cycle prediction method and by the decision which processing elements the subtasks are assigned to.

    [0157] In a preferred example, to allow for the use of more than one token per bus 712a,b,c there is a primary branch part and a plurality of secondary branch parts. This is illustrated in FIGS. 7a and 7b, where the first linear part 712a forms a primary branch and second and third linear part 712b, 712c form a secondary branch part.

    [0158] To avoid conflicts, there can only be one global token 750 which always has traversing priorities. The global token 750 is indicated in FIGS. 7a and 7b as a big star, the local token 752 as a small star. If the global token 750 is present on the primary branch part, as shown in FIG. 7a, there cannot be any local tokens on any of the secondary branch parts. However, if the global token 750 is present on one of the secondary branch parts, as shown in FIG. 7b, it is possible to allow for local tokens 752 in all or some of the other secondary branch parts which cannot leave their individual secondary branch parts.

    [0159] FIG. 8 shows a schematic representation of a non-exclusive bus system 810 comprising two bi-directional busses 812, 814. A first control element 820a, a second control element 820b and a RAM 822 are connected to both the first bus 812 and the second bus 814. A number of n processing elements 830, 832 are connected only to the second bus 814 and a number of n processing elements 840, 842 are connected only to the first bus 812. This set-up can be repeated n times such that there is a total of m*n processing elements connected to the bus system. The set-up shown in FIG. 8 has the advantage that for example, the communication between the control elements 820a, 820b and RAM 822 can occur both through the first bus 812 and the second bus 814. This enables a total bandwidth that is twice as high compared to the bandwidth for communicating with the processing elements 830, 832, 840, 842, which may be accessed less often than the RAM. In this way, the architecture is adapted to the typical load scenarios. Another benefit is that communication with more than one SPE can occur simultaneously.

    [0160] Access to the busses 812, 814 can be implemented with a simple time division multiple access scheme. Alternatively, for example a token passing scheme or a combination of the two can be used.

    [0161] With regard to the examples explained above, it has to be noted that said examples may be combined with each other. Furthermore, it is understood, that the bus systems shown in the drawings can comprise further elements and further busses that are not shown in the drawings. In particular, branches as shown in FIG. 7 could also connect ring busses with linear parts. Furthermore, different busses that are connected via a bridge or share at least one element could use different access schemes.

    [0162] FIG. 9 shows a schematic representation of a non-exclusive bus system 910 comprising ring busses 912, 214 with a processing core 920 and a RAM 922 connected to them. Furthermore, the processing core 920 and the RAM 922 are connected through a direct connection 921. Further elements can be connected to the ring busses 912, 914, but are not shown in FIG. 9.

    [0163] It should be noted that in other examples of the invention, the ring busses shown in FIGS. 1 and 9 can also be implemented with access protocols where the active time slot is passed from a first element to the next element when the first element is finished with accessing the bus. This can be implemented e.g. as a token ring access scheme, where an element passes the token to the next element when it has finished accessing the bus.