MULTI-PROCESSING UNIT INTERCONNECTED ACCELERATOR SYSTEMS AND CONFIGURATION TECHNIQUES
20220308890 · 2022-09-29
Inventors
Cpc classification
G06F15/17318
PHYSICS
G06F15/17343
PHYSICS
International classification
G06F9/38
PHYSICS
G06F9/50
PHYSICS
Abstract
A compute system providing hierarchical scaling can include one or more sets of parallel processing units. The parallel processing units in a set can be organized into subsets of parallel processing units. Each parallel processing unit can be configurably couplable to two nearest neighbor parallel processing units in a same subset by two communication links, and each parallel processing unit can be configurably couplable to farthest neighbor parallel processing unit in the same subset by one communication link. Furthermore, each parallel processing unit can be configurably couplable to a corresponding parallel processing unit in the other subset by two communication links. The compute system can be configured by configuring the communication links of a set of parallel processing units into one or more compute clusters including a corresponding number of communication rings based on a specified compute parameter. Input data for computing on a given compute cluster divided and loaded onto respective parallel processing units of the given compute cluster. A function can be computed on the loaded input data by the given compute cluster using a parallel communication ring algorithm of the function.
Claims
1. A compute system comprising: one or more sets of parallel processing units, wherein the parallel processing units in a set are organized into subsets of parallel processing units, each parallel processing unit is configurably couplable to two nearest neighbor parallel processing units in a same subset by two communication links, each parallel processing unit is configurably couplable to farthest neighbor parallel processing unit in the same subset by one communication link, and each parallel processing unit is configurably couplable to a corresponding parallel processing unit in the other subset by two communication links.
2. The compute system of claim 1, wherein the communication links comprise bi-directional communication links.
3. The compute system of claim 1, wherein the communication links of a given set of parallel processing units are configured into one or more compute clusters including a corresponding number of communication rings based on a specified compute parameter.
4. The compute system of claim 3, wherein each of the one or more compute clusters are configured to compute a corresponding Reduce or All_Reduce function on corresponding input data using a parallel ring Reduce or All_Reduce algorithm
5. The compute system of claim 3, wherein the specified compute parameter comprises a number of parallel processing units of a given compute cluster.
6. The compute system of claim 3, wherein the specified compute parameter comprises an amount of compute processing bandwidth.
7. The compute system of claim 1, wherein the one or more sets of parallel processing units comprises one or more sets of eight parallel processing units, wherein the parallel processing units in a set are organized in two subsets of four parallel processing units, each parallel processing unit is configurably couplable to two nearest neighbor parallel processing units in a same subset by two communication links, each parallel processing unit is configurably couplable to a farthest neighbor parallel processing unit in the same subset by one communication link, and each parallel processing unit is configurably couplable to a corresponding parallel processing unit in the other subset by two communication links.
8. The compute system of claim 7, wherein a given set of eight parallel processing units are configured into one compute cluster wherein the eight parallel processing units are coupled by three communication rings.
9. The compute system of claim 7, wherein a given set of eight parallel processing units are configured into two compute clusters of four parallel processing units, and the four parallel processing units of each compute cluster are coupled by two communication rings.
10. The compute system of claim 7, wherein a given set of eight parallel processing units are configured into four compute clusters of two parallel processing units, and the two parallel processing unit of each compute cluster are coupled together by one communication ring.
11. The compute system of claim 7, wherein a given set of eight parallel processing units are configured into one compute cluster of four parallel processing units and two compute clusters of two parallel processing units, the four parallel processing units of the compute cluster of four parallel processing units are coupled by two communication rings, and the two parallel processing unit of each of the compute clusters of two parallel processing units are coupled together by a respective communication ring.
12. A compute method comprising: configuring communication links of a set of parallel processing units into one or more compute clusters including a corresponding number of communication rings based on a specified compute parameter; and computing a function on input data by one of the compute clusters using a parallel communication ring algorithm.
13. The compute method according to claim 12, wherein the specified compute parameter comprises a number of parallel processing units of a given compute cluster.
14. The compute method according to claim 12, wherein the specified compute parameter comprises an amount of compute processing bandwidth.
15. The compute method according to claim 12, wherein the set of parallel processing units comprises eight parallel processing units organized in two subsets of four parallel processing units including two bi-directional communication links between each set of nearest neighbors of parallel processing units in each subset, one bi-directional communication link between each set of farthest neighbors of parallel processing units in each subset, and two bi-directional communication links between corresponding parallel processing units of the two subsets of parallel processing units.
16. The compute method according to claim 15, wherein configuring communication links of the set of parallel processing units into one or more compute clusters comprises configuring the bi-directional communication links into three parallel communication rings coupling the eight parallel processing units as one compute cluster.
17. The compute method according to claim 16, further comprising dividing the input data into six groups and loading respective pairs of the groups of data into respective parallel processing units.
18. The compute method according to claim 15, wherein configuring communication links of the set of parallel processing units into one or more compute clusters comprises configuring the bi-directional communication links into sets of two parallel communication rings coupling each respective subset of four parallel processing units as one or more respective compute clusters.
19. The compute method according to claim 18, further comprising dividing the input data into four groups and loading respective pairs of the groups of data into respective parallel processing units of a given compute cluster of four parallel processing units.
20. The compute method according to claim 15, wherein configuring communication links of the set of parallel processing units into one or more compute clusters comprises configuring the bi-directional communication links into at least one sets of one communication ring coupling each respective subset of two parallel processing units as one or more respective compute clusters.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] Embodiments of the present technology are illustrated by way of example and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
DETAILED DESCRIPTION OF THE INVENTION
[0026] Reference will now be made in detail to the embodiments of the present technology, examples of which are illustrated in the accompanying drawings. While the present technology will be described in conjunction with these embodiments, it will be understood that they are not intended to limit the technology to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present technology, numerous specific details are set forth in order to provide a thorough understanding of the present technology. However, it is understood that the present technology may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the present technology.
[0027] Some embodiments of the present technology which follow are presented in terms of routines, modules, logic blocks, and other symbolic representations of operations on data within one or more electronic devices. The descriptions and representations are the means used by those skilled in the art to most effectively convey the substance of their work to others skilled in the art. A routine, module, logic block and/or the like, is herein, and generally, conceived to be a self-consistent sequence of processes or instructions leading to a desired result. The processes are those including physical manipulations of physical quantities. Usually, though not necessarily, these physical manipulations take the form of electric or magnetic signals capable of being stored, transferred, compared and otherwise manipulated in an electronic device. For reasons of convenience, and with reference to common usage, these signals are referred to as data, bits, values, elements, symbols, characters, terms, numbers, strings, and/or the like with reference to embodiments of the present technology.
[0028] It should be borne in mind, however, that these terms are to be interpreted as referencing physical manipulations and quantities and are merely convenient labels and are to be interpreted further in view of terms commonly used in the art. Unless specifically stated otherwise as apparent from the following discussion, it is understood that through discussions of the present technology, discussions utilizing the terms such as “receiving,” and/or the like, refer to the actions and processes of an electronic device such as an electronic computing device that manipulates and transforms data. The data is represented as physical (e.g., electronic) quantities within the electronic device's logic circuits, registers, memories and/or the like, and is transformed into other data similarly represented as physical quantities within the electronic device.
[0029] In this application, the use of the disjunctive is intended to include the conjunctive. The use of definite or indefinite articles is not intended to indicate cardinality. In particular, a reference to “the” object or “a” object is intended to denote also one of a possible plurality of such objects. The use of the terms “comprises,” “comprising,” “includes,” “including” and the like specify the presence of stated elements, but do not preclude the presence or addition of one or more other elements and or groups thereof. It is also to be understood that although the terms first, second, etc. may be used herein to describe various elements, such elements should not be limited by these terms. These terms are used herein to distinguish one element from another. For example, a first element could be termed a second element, and similarly a second element could be termed a first element, without departing from the scope of embodiments. It is also to be understood that when an element is referred to as being “coupled” to another element, it may be directly or indirectly connected to the other element, or an intervening element may be present. In contrast, when an element is referred to as being “directly connected” to another element, there are not intervening elements present. It is also to be understood that the term “and or” includes any and all combinations of one or more of the associated elements. It is also to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.
[0030] Referring now to
[0031] The hierarchical scaling of the PPUs will be further explained with reference to
[0032] Accordingly, in another implementation, the eight PPUs can be configured as two compute clusters 905, 910 of four PPUs 305-320, 325-340 each, as illustrated in
[0033] In yet other implementations, the eight PPUs can be configured as four compute clusters of 1005, 1010, 1015, 1020 of two PPUs 305-310, 315-320, 325-330, 335-340, as illustrated in
[0034] In yet other implementations, the eight PPUs can be configured as a combination of one compute cluster 1105 of four PPUs 305-320, and two compute clusters 1110, 1115 of two PPUs 325-330, 335-340, as illustrated in
[0035] Referring again to
[0036] At 830, the Reduce, All_Reduce or similar function can be computed on the input data by the given compute cluster using a parallel ring Reduce, All_Reduce or similar parallel ring algorithm. In a parallel ring algorithm, each of the plurality of PPUs (e.g., N nodes) communicates with its two nearest neighbor PPUs 2*(N−1) times, exchanging a respective group on a respective ring in a respective direction. In the first N−1 iterations, a given PPU sends respective values on respective rings to its nearest neighbors. In the first N−1 iterations, the given PPU also receives respective values on respective rings from its nearest neighbors, and adds the received values to respective values in the given PPU's buffer. In the second N−1 iterations, the given PPU sends respective values on respective rings to its nearest neighbors. In the second N−1 iterations, the given PPU also receives respective values on respective rings from its nearest neighbors, and replaces the respective values in the given PPU's buffer with the respective received values.
[0037] Referring now to
[0038] Referring now to
[0039] The PPU 1300 can also include one or more joint test action group (JTAG) engines 1375, one or more inter-integrated circuit (I.sup.2C) interfaces and or serial peripheral interfaces (SPI) 1380, one or more peripheral component interface express (PCIe) interfaces 1385, one or more codecs (CoDec) 1390, and the like. In one implementation, the plurality of compute cores 1305, 1310, the plurality of inter-chip links (ICL) 1315, 1320, one or more high-bandwidth memory interfaces (HBM I/F) 1325, 1330, one or more communication processors 1335, one or more direct memory access (DMA) controllers 1340, 1345, one or more command processors (CP) 1350, one or more networks-on-chips (NoCs) 1355, shared memory 1360, one or more high-bandwidth memory (HBM) 1365, 1370, one or more joint test action group (JTAG) engines 1375, one or more inter-integrated circuit (1.sup.2C) interfaces and or serial peripheral interfaces (SPI) 1380, one or more peripheral component interface express (PCIe) interfaces 1385, one or more codecs (CoDec) 1390, and the like can be fabricated in one monolithic integrated circuits (ICs)
[0040] The ICLs 1315, 1320 can be configured for chip-to-chip communication between a plurality of PPUs. In one implementation, the PPU 1300 can include seven ICLs 1315, 1320. The communication processor 1335 and direct memory access engines 1340, 1345 can be configured to coordinate data sent and received through the ICLs 1315, 1320. The network-on-chip (NoC) 1355 can be configured to coordinate data movement between the compute cores 1305, 1310 and the shared memory 1360. The communication processor 1335, direct memory access engines 1340, 1345, network on chip 1355 and high-bandwidth memory interfaces (HBM I/F) 1325, 1330 can be configured to coordinate movement of data between the high-bandwidth memory 1365, 1370, the shared memory 1360 and the ICLs 1315, 1320. The command processor 1350 can be configured to serve as an interface between the PPU 1300 and one or more host processing units. The plurality of the PPUs 1300 can advantageously employed to compute a Reduce, All_Reduce or other similar functions as described above with reference to
[0041] In accordance with aspects of the present technology, hierarchical enables a plurality of PPUs to be configured as one or more compute clusters coupled by a corresponding number of parallel communication rings. Hierarchical scaling the plurality of PPUs can be advantageous when an application requires a smaller portion of the computational resources of the plurality of PPUs than can be serviced by a compute cluster of a subset of the plurality of PPUs. Likewise, hierarchical scaling can be advantageously employed in a cloud computing platform to readily enable clients to purchase the computing bandwidth of a cluster of the PPUs instead of the entire plurality of PPUs.
[0042] The foregoing descriptions of specific embodiments of the present technology have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the present technology to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the present technology and its practical application, to thereby enable others skilled in the art to best utilize the present technology and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.