PROCESSING DEVICE FOR A PARALLEL COMPUTING SYSTEM AND METHOD FOR PERFORMING COLLECTIVE OPERATIONS
20230054136 · 2023-02-23
Inventors
- Solal AMOUYAL (Hod Hasharon, IL)
- Alex MARGOLIN (Hod Hasharon, IL)
- Ben-Shahar Belkar (Hod Hasharon, IL)
Cpc classification
G06F9/5066
PHYSICS
International classification
Abstract
The disclosure relates to a parallel computing system comprising a plurality of processing devices for performing an application. Each processing device is configured to obtain a local result, wherein a global result of a collective operation depends on the local results of the plurality of processing devices, and to distribute the local result of the processing device to one or more of the other processing devices, in response to determining that the global result is based only on the local result of the processing device, that is a likelihood that the global result is based only on the local result of the processing device is greater than a likelihood threshold value, or that the global result is based only on the local result of the processing device and a further local result of a further processing device of the plurality of processing devices.
Claims
1. A processing device for a parallel computing system, wherein the parallel computing system comprises a plurality of processing devices for performing an application, including one or more collective operations, wherein the processing device is configured to: obtain a local processing result, wherein a global processing result of a collective operation depends on the local processing results of the plurality of processing devices; and distribute the local processing result of the processing device to one or more of the other processing devices in response to determining that: (a) the global processing result of the collective operation is based only on the local processing result of the processing device; (b) a likelihood that the global processing result of the collective operation is based only on the local processing result of the processing device is greater than a likelihood threshold value; or (c) the global processing result of the collective operation is based only on the local processing result of the processing device and a further local processing result of a further processing device of the plurality of processing devices.
2. The processing device of claim 1, further configured to broadcast the local processing result of the processing device to the other processing devices in response to determining that the global processing result of the collective operation is based on the local processing result of the processing device.
3. The processing device of claim 2, wherein the collective operation is a logical or bitwise “AND” operation or a logical or bitwise “OR” operation.
4. The processing device of claim 1, wherein the collective operation is a logical or bitwise “XOR” operation, and wherein the processing device is further configured to broadcast the local processing result of the processing device to the other processing devices (201) in response to determining that the global processing result of the collective operation is based only on the local processing result of the processing device and the further local processing result of the further processing device.
5. The processing device of claim 4, further configured to receive the further local processing result from the further processing device and to perform the logical or bitwise “XOR” operation based on the local processing result of the processing device and the further local processing result of the further processing device.
6. The processing device of claim 1, further configured to distribute the local processing result of the processing device to a selected subset of the other processing devices for performing the collective operation with the selected subset of the other processing devices, wherein, for each processing device of the selected subset of the other processing devices, a likelihood that the global result of the collective operation is based only on the local processing result of the processing device is greater than a likelihood threshold value.
7. The processing device of claim 6, wherein the parallel computing system is configured to adjust the selected subset during a run-time of the application.
8. The processing device of claim 6, further configured to store, for each collective operation of the application, at least one of the global processing result of the collective operation or an identifier of the processing device providing the global processing result of the collective operation.
9. The processing device of claim 8, further configured to determine the likelihood that the global result of the collective operation is based only on the local processing result of the processing device based on a comparison between the local processing result and one or more global processing results stored for one or more preceding collective operations of the application.
10. The processing device of claim 6, wherein the collective operation is a maximum operation for obtaining a maximum of the local processing results of the plurality of processing devices or a minimum operation for obtaining a minimum of the local results of the plurality of processing devices.
11. The processing device of claim 6, further configured to execute an iterative loop and to terminate the iterative loop based on a conditional statement depending on the global processing result of the collective operation.
12. The processing device of claim 11, further configured to broadcast the local processing result of the processing device to the other processing devices in response to determining that the global processing result of the collective operation is based only on the local processing result of the processing device and the global processing result of the collective operation triggers the processing device to terminate the iterative loop.
13. The processing device of claim 11, further configured to store, for the iterative loop, at least one of a number of iterations before terminating the iterative loop or a threshold value defined by the conditional statement for terminating the iterative loop.
14. The processing device of claim 13, further configured to: determine a likelihood that the conditional statement of the iterative loop is fulfilled in a further iteration of the iterative loop; broadcast the local processing result of the processing device to the other processing devices; and terminate the iterative loop in response to determining that the likelihood that the conditional statement is fulfilled in a further iteration of the iterative loop is greater than a likelihood threshold value.
15. The processing device of claim 14, further configured to determine the likelihood that the conditional statement of the iterative loop is fulfilled in a further iteration of the iterative loop based on the stored number of iterations for terminating one or more preceding iterative loops.
16. The processing device of claim 14, further configured, in case the conditional statement of the iterative loop is not fulfilled in a further iteration of the iterative loop, to continue executing the iterative loop in response to determining that the likelihood that the conditional statement is fulfilled in a further iteration of the iterative loop is greater than a likelihood threshold value.
17. The processing device of claim 11, wherein the collective operation is a sum operation of the plurality of local processing results.
18. A parallel computing system comprising a plurality of processing devices, wherein each processing device in the plurality of processing devices is configured to: obtain a local processing result, wherein a global processing result of a collective operation depends on the local processing results of the plurality of processing devices; and distribute the local processing result of the processing device to one or more of the other processing devices of the plurality of processing devices in response to determining that: (a) the global processing result of the collective operation is based only on the local processing result of the processing device; (b) a likelihood that the global processing result of the collective operation is based only on the local processing result of the processing device is greater than a likelihood threshold value; or (c) the global processing result of the collective operation is based only on the local processing result of the processing device and a further local processing result of a further processing device of the plurality of processing devices.
19. The parallel computing system of claim 18, wherein the plurality of processing devices are configured to define a tree topology.
20. A method for performing an application, including one or more collective operations, in a parallel computing system having a plurality of processing devices, wherein, for each processing device, the method comprises: obtaining a local processing result, wherein a global processing result of a collective operation depends on one or more of the local processing results of the plurality of processing devices; and distributing the local processing result of the processing device to one or more of the other processing devices in response to determining that: (a) the global processing result of the collective operation is based only on the local processing result of the processing device; (b) a likelihood that the global processing result of the collective operation is based only on the local processing result of the processing device is greater than a likelihood threshold value; or (c) the global processing result of the collective operation is based only on the local processing result of the processing device and a further local processing result of a further processing device of the plurality of processing devices.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0031] In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042] In the following, identical reference signs refer to identical or at least functionally equivalent features.
DETAILED DESCRIPTION OF EMBODIMENTS
[0043] In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.
[0044] For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.
[0045]
[0046] As illustrated in
[0047] A collective operation is a concept in parallel computing, according to which data is simultaneously sent to or received from many processing nodes, such as the processing devices 201 of the parallel computing system 200. A “broadcast operation” is an example of a collective operation for moving data among the plurality of processing devices 201. A “reduce operation” is an example of a collective operation that executes arithmetic or logical functions on data distributed among the plurality of processing devices 201. In an embodiment, the parallel computing system 200 may implement the Message Passing Interface (MPI) framework, i.e. a known parallel communications library for providing data communications between the plurality of processing devices 201 of the parallel computing system 200. Although in the following MPI terminology may be used for ease of explanation, MPI as such is not a requirement or limitation of the various embodiments disclosed herein.
[0048] As will be described in more detail further below, generally, a processing device 201 of the parallel computing system 200 is configured to obtain a local processing result, wherein a global processing result of the collective operation depends on the local processing results of the plurality of processing devices 201. Thus, as used herein, the global processing result is the final result of the collective operation, whereas a local processing result is a result initially known to, i.e. available at the respective processing device 201 only. For instance, each processing device 201 may be configured to perform a local data processing operation for obtaining the local processing result. A local processing result may be, for instance, an integer value, a real value, a logical “TRUE” or “FALSE” value or the like.
[0049] The processing device 201 of the parallel computing system 200 is further configured to distribute the local processing result of the processing device 201 to one or more of the other processing devices 201, if one of the following conditions is met: (a) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 irrespective of the local processing results of the other processing devices 201; or (b) a likelihood that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 is larger than a likelihood threshold value; or (c) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 and a further local processing result of a further processing device of the plurality of processing devices 201 irrespective of the local processing results of the other processing devices 201. To verify whether one of these conditions (a), (b) or (c) is met, the processing device 201 is configured to check whether: (a) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 irrespective of the local processing results of the other processing devices 201; (b) a likelihood that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 is larger than a likelihood threshold value; or (c) it is certain that the global processing result of the collective operation is based only on, i.e. uniquely determined by the local processing result of the processing device 201 and a further local processing result of a further processing device of the plurality of processing devices 201 irrespective of the local processing results of the other processing devices 201.
[0050] In the embodiment shown in
[0051] Instead of distributing the local processing results, e.g. the “TRUE” or “FALSE” values, each processing device, i.e. node 201 of the parallel computing system 200 of
[0052] In case the processing device P3 knows that the global processing result of the collective operation is based only on, i.e. uniquely determined by its local processing result, the processing device P3, as illustrated in
[0053] Thus, the embodiment shown in
[0054] In another embodiment, the plurality of processing devices 201 of the parallel computing system 200 are configured to perform a reduce or all reduce operation in the form of a logical or bitwise “XOR” operation, such as the MPI_LXOR and the MPI_BXOR operation. In this embodiment, the processing node 201 is configured to broadcast its local processing result, such as a “TRUE” or “FALSE” value, to the other processing devices 201, if it is certain that the global processing result is based only on, i.e. uniquely determined by the local processing result of the processing device and a further local processing result of a further processing device 201. For instance, in the example shown in
[0055]
[0056] As can be taken from
[0057] In case a processing device 201, such as, by way of example, the processing device P4 illustrated in
[0058]
[0059] In many parallel computing applications, iterative schemes comprise the core of the overall algorithm. During the lifetime of a specific parallel computing application, the number of iterations varies mildly. Therefore, if a specific parallel computing application requires on average about 100 iterations at each step, it is very unlikely to have less than 80 at a specific step. However, conventionally, a blocking reduction operation is used at every iteration nonetheless, even if one can predict with near perfect certainty the outcome of the convergence test for the first 80 iterations. Furthermore, the local error value of a single process can be high enough to invalidate the global convergence test. This case is illustrated in
[0060]
[0061] As can be taken from
[0062] Instead of distributing the local processing results, e.g. the real values, each processing device 201 of the parallel computing system 200 of
[0063]
[0064]
[0065] In an embodiment, each processing device 201 is configured to store for the iterative loop the number of iterations before terminating the iterative loop and/or a threshold value defined by the conditional statement for terminating the iterative loop. Moreover, in an embodiment, each processing device 201 may be configured to determine the likelihood that the conditional statement of the iterative loop is fulfilled in a further iteration of the iterative loop, wherein the processing device 201 is configured to broadcast the local processing result of the processing device 201 to the other processing devices 201 and to terminate the iterative loop, if the likelihood that the conditional statement is fulfilled in a further iteration of the iterative loop is larger than a likelihood threshold value. Thus, in an embodiment, the parallel computing system 200 including the plurality of processing devices 201 may implement a branch prediction algorithm. In an embodiment, each processing device 201 is configured to determine the likelihood that the conditional statement of the iterative loop is fulfilled in a further iteration of the iterative loop based on the number of iterations recorded for terminating one or more preceding iterative loops. In an embodiment, in case the conditional statement of the iterative loop is not fulfilled in a further iteration of the iterative loop, each processing device 201 may be further configured to continue executing the iterative loop, if the likelihood that the conditional statement is fulfilled in a further iteration of the iterative loop is larger than the likelihood threshold value.
[0066] Thus, in an embodiment, the parallel computing system 200 is configured to automatically recognize the blocking collective reduction call followed by a conditional statement on the returned reduced value as one global operation. This concatenation enables to take advantage of previously unused information to reduce the communication time. Moreover, as already described above, statistics may be gathered on the number of iterations required for the conditional statement to be activated. If the number of iterations is high enough, the blocking collective may be transformed into a non-blocking collective call, which will enable an overlap of communication and computation. Furthermore, branch prediction may be applied to the overall global collective reduction operation, i.e. the outcome of the conditional statement will be assumed to be false and the processing device 201 may resume with the computation of the next iteration while the communication is done simultaneously. If the conditional statement is found to be false as initially assumed, the processing device 201 is not interrupted and continues with its computation. In the other case where the conditional statement is true, the processing device 201 may retrace its step and exit the iterative scheme.
[0067]
[0068] The person skilled in the art will understand that the “blocks” (“units”) of the various figures (method and apparatus) represent or describe functionalities of embodiments of the disclosure (rather than necessarily individual “units” in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit =step).
[0069] In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
[0070] The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
[0071] In addition, functional units in the embodiments of the disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.