Method and apparatus for performing machine learning operations in parallel on machine learning hardware

11995448 ยท 2024-05-28

Assignee

Inventors

Cpc classification

International classification

Abstract

A method includes receiving a first set of data. The method also includes receiving an instruction to determine a largest value within the first set of data. The first set of data is divided into a first plurality of data portions based on a hardware architecture of a first plurality of processing elements. The first plurality of data portions is mapped to the first plurality of processing elements. Each data portion of the first plurality of data portions is mapped exclusively to a processing element of the first plurality of processing elements. Each data portion of the first plurality of data portions is processed by its respective processing element to identify a largest value from each data portion of the first plurality of data portions, wherein the processing forms a first output data comprising the largest value from the each data portion of the first plurality of data portions.

Claims

1. A computer implemented method, comprising: receiving a first set of data; receiving an instruction to determine a largest value within the first set of data; dividing the first set of data into a first plurality of data portions based on a hardware architecture of a first plurality of processing elements, wherein a size of data portions of the first plurality of data portions changes depending on availability of processing elements of the first plurality of processing elements; mapping the first plurality of data portions to the first plurality of processing elements, wherein each data portion of the first plurality of data portions is mapped exclusively to a processing element of the first plurality of processing elements; and processing each data portion of the first plurality of data portions by its respective processing element by executing the instruction to identify a largest value from each data portion of the first plurality of data portions, wherein the processing forms a first output data comprising the largest value from the each data portion of the first plurality of data portions.

2. The computer implemented method of claim 1 further comprising tracking a global index associated with each element of the first plurality of data portions.

3. The computer implemented method of claim 2, wherein each processing element of the first plurality of processing elements outputs a local index and value associated with a largest value.

4. The computer implemented method of claim 3, wherein the tracking comprises using an offset value associated with each data portion of the first plurality of data portions being mapped to respective processing element of the first plurality of processing elements.

5. The computer implemented method of claim 1 further comprising compiling the instruction to at least one or more instruction set architecture (ISA) format.

6. The computer implemented method of claim 1, wherein the mapping is further based on availability of processing elements.

7. The computer implemented method of claim 1, wherein the first plurality of processing elements is a subset of a second plurality of processing elements of the hardware.

8. The computer implemented method of claim 1, wherein the hardware architecture is an inference engine of a machine learning hardware.

9. The computer implemented method of claim 1 further comprising outputting the first output data to a processing element of the hardware to determine a largest value of the first output data, wherein the largest value of the first output data is a same as the largest value of the first set of data.

10. The computer implemented method of claim 9, wherein the processing element that processes the first output data is a processing element selected from the first plurality of processing elements.

11. The computer implemented method of claim 9, wherein the processing element that processes the first output data is a processing element that is different from processing elements of the first plurality of processing elements.

12. The computer implemented method of claim 1, wherein the first set of data is a tensor.

13. The computer implemented method of claim 1, wherein the first plurality of processing elements operate on their respective data portion of the first plurality of data portions in parallel.

14. The computer implemented method of claim 1 further comprising: receiving a second set of data; receiving an instruction to determine a largest value within the second set of data; dividing the second set of data into a second plurality of data portions based on the hardware architecture of a second plurality of processing elements; mapping the second plurality of data portions to the second plurality of processing elements, wherein each data portion of the second plurality of data portions is mapped exclusively to a processing element of the second plurality of processing elements; and processing each data portion of the second plurality of data portions by its respective processing element to identify a largest value from each data portion of the second plurality of data portions, wherein the processing forms a second output data comprising the largest value from the each data portion of the second plurality of data portions.

15. The computer implemented method of claim 14, wherein the first plurality of processing elements process the first plurality of data portions and wherein the second plurality of processing elements process the second plurality of data portions in parallel.

16. The computer implemented method of claim 14, wherein the first set of data and the second set of data are parts of a tensor data.

17. The computer implemented method of claim 14 further comprising outputting the second output data to a processing element of the hardware to determine a largest value of the second output data, wherein the largest value of the second output data is a same as the largest value of the second set of data.

18. A system comprising: a compiler configured to receive an instruction to determine a largest value within a first set of data, and wherein the compiler is further configured to divide the first set of data into a first plurality of data portions based on a hardware architecture of a first plurality of processing elements, wherein a size of data portions of the first plurality of data portions changes depending on availability of processing elements of the first plurality of processing elements, and wherein the compiler is further configured to map the first plurality of data portions to the first plurality of processing elements, wherein each data portion of the first plurality of data portions is mapped exclusively to a processing element of the first plurality of processing elements; and a second plurality of processing elements comprising the first plurality of processing elements, wherein each processing element of the first plurality of processing elements is configured to process its respective data portion of the first plurality of data portions to identify a largest value from that data portion, wherein the first plurality of processing elements is configured to form a first output data comprising the largest value from the each data portion of the first plurality of data portions.

19. The system of claim 18, wherein each processing element has an on-chip memory (OCM) associated therewith, and wherein a global index associated with each element of the first plurality of data portions is tracked in its respective OCM.

20. The system of claim 19, wherein each processing element of the first plurality of processing elements outputs a local index associated with a largest value.

21. The system of claim 20, wherein the global index is tracked using an offset value associated with each data portion of the first plurality of data portions being mapped to respective processing element of the first plurality of processing elements.

22. The system of claim 18, wherein the compiler is configured to compile the instruction to at least one or more instruction set architecture (ISA) format.

23. The system of claim 18, wherein the compiler is further configured to map based on availability of processing elements.

24. The system of claim 18, wherein the hardware architecture is an inference engine of a machine learning hardware.

25. The system of claim 18, wherein the first plurality of processing elements is configured to output the first output data to a processing element of the hardware to determine a largest value of the first output data, wherein the largest value of the first output data is a same as the largest value of the first set of data.

26. The system of claim 25, wherein the processing element that processes the first output data is a processing element selected from the first plurality of processing elements.

27. The system of claim 25, wherein the processing element that processes the first output data is a processing element that is different from processing elements of the first plurality of processing elements.

28. The system of claim 18, wherein the first set of data is a tensor.

29. The system of claim 18, wherein the first plurality of processing elements operate on their respective data portion of the first plurality of data portions in parallel.

30. The system of claim 18, wherein: the compiler is further configured to receive an instruction to determine a largest value within a second set of data, the compiler is further configured to divide the second set of data into a second plurality of data portions based on the hardware architecture of a third plurality of processing elements that is a subset of processing elements within the second plurality of processing elements; the compiler is further configured to map the second plurality of data portions to the third plurality of processing elements, wherein each data portion of the second plurality of data portions is mapped exclusively to a processing element of the third plurality of processing elements; and each data portion of the second plurality of data portions is processed by its respective processing element of the third plurality of processing elements to identify a largest value from each data portion of the second plurality of data portions, wherein the processing forms a second output data comprising the largest value from the each data portion of the second plurality of data portions.

31. The system of claim 30, wherein the first plurality of processing elements process the first plurality of data portions and wherein the third plurality of processing elements process the second plurality of data portions in parallel.

32. The system of claim 30, wherein the first set of data and the second set of data are parts of a tensor data.

33. The system of claim 30, wherein the third plurality of processing elements is configured to output the second output data to a processing element of the hardware to determine a largest value of the second output data, wherein the largest value of the second output data is a same as the largest value of the second set of data.

34. The system of claim 18 further comprising a memory configured to store a first set of data.

35. A system comprising: a means for receiving a first set of data; a means for receiving an instruction to determine a largest value within the first set of data; a means for dividing the first set of data into a first plurality of data portions based on a hardware architecture of a first plurality of processing elements, wherein a size of data portions of the first plurality of data portions changes depending on availability of processing elements of the first plurality of processing elements; a means for mapping the first plurality of data portions to the first plurality of processing elements, wherein each data portion of the first plurality of data portions is mapped exclusively to a processing element of the first plurality of processing elements; and a means for processing each data portion of the first plurality of data portions by its respective processing element to identify a largest value from each data portion of the first plurality of data portions, wherein the means for processing forms a first output data comprising the largest value from the each data portion of the first plurality of data portions.

36. A computer implemented method, comprising: receiving a first set of data; receiving an instruction to determine a largest value within the first set of data; dividing the first set of data into a first plurality of data portions based on a hardware architecture of a first plurality of processing elements, wherein a size or a number components associated with at least one processing element of the first plurality of processing elements is different from another processing element of the first plurality of processing elements; mapping the first plurality of data portions to the first plurality of processing elements, wherein each data portion of the first plurality of data portions is mapped exclusively to a processing element of the first plurality of processing elements; and processing each data portion of the first plurality of data portions by its respective processing element by executing the instruction to identify a largest value from each data portion of the first plurality of data portions, wherein the processing forms a first output data comprising the largest value from the each data portion of the first plurality of data portions.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.

(2) FIG. 1A-1D depict examples of parallel processing of a TopK operation according to one aspect of the present embodiments.

(3) FIG. 2 depicts a flow diagram of parallel processing of a TopK operation according to one aspect of the present embodiments.

DETAILED DESCRIPTION

(4) The following disclosure provides many different embodiments, or examples, for implementing different features of the subject matter. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

(5) Before various embodiments are described in greater detail, it should be understood that the embodiments are not limiting, as elements in such embodiments may vary. It should likewise be understood that a particular embodiment described and/or illustrated herein has elements which may be readily separated from the particular embodiment and optionally combined with any of several other embodiments or substituted for elements in any of several other embodiments described herein. It should also be understood that the terminology used herein is for the purpose of describing the certain concepts, and the terminology is not intended to be limiting. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood in the art to which the embodiments pertain.

(6) As discussed, the TopK operation (and its special case (i.e., ArgMax) where K has a value 1 to identify the largest value and its index) has become prevalent for ranking results in various applications, such as ML applications. Unfortunately, TopK operation has traditionally been implemented in an inefficient and wasteful manner using a single processing element.

(7) Accordingly, a need has arisen to reduce processing time by leveraging a multi-core architecture of an ML hardware with multiple processing elements. It is appreciated that the embodiments are described with respect to the special case where K is equal to one (i.e., ArgMax) to identify the largest value and its index value for illustration purposes and should not be construed as limiting the scope of the embodiments. As such, it is appreciated that a similar process may be used to identify the top K values and their respective index values. It is further appreciated that an instruction set architecture (ISA) associated with the ML hardware may be used by the compiler to more efficiently implement the ArgMax operation on the ML hardware. It is appreciated that the compiler may generate a lower level code, e.g., in binary format, executable by the ML hardware from a higher level code.

(8) In general, an ArgMax operation identifies the largest value of a vector and its index. For illustrative purposes, a vector data may be V=[100, 2, 101, 53, 33, 53, 67, 94] and it may have eight elements. An ArgMax operation identifies index 2 associated with vector element value 101 as the largest element within the vector. Accordingly, an ArgMax operation may return index 2. It is appreciated that if two elements of the vector data have the same largest value (e.g., if another element within the vector had a value 101) then the index of the first occurrence of the element is taken.

(9) The proposed approach leverages the architecture of a ML hardware-based system that includes a plurality of processing elements to parallel process the ArgMax operation, reducing the amount of processing time. For example, data represented in a vector with n-elements may be divided into m-number of sub-vectors where each sub-vector has k-number (where k is smaller than n) of elements and where each sub-vector is processed by one processing element of the ML hardware. Accordingly, multiple sections (i.e., sub-vector) of the data vector may be simultaneously processed, thereby reducing the processing time. It is appreciated that the global index associated with each element in various sub-vectors are tracked in order, ultimately identify the largest value and its corresponding index in the vector data. It is appreciated that a compiler compiles a higher level code to a lower level code (or executable code on ML hardware). Having knowledge of the ML hardware architecture, the compiler may divide the vector data into sub-vectors and map the sub-vectors to their respective processing elements based on the knowledge of the architecture in order to efficiently execute the instructions.

(10) It is appreciated that ML hardware architecture may include a host, a memory, a core, a data streaming engine, an instruction-streaming engine, and an interference engine. The inference engine includes a plurality of processing elements. The core is configured to interpret a plurality of ML commands/instructions for an ML operation, e.g., TopK, ArgMax, etc., and/or data received from the host, e.g., vector data, and coordinate activities of the streaming and the inference engines based on the data in the received ML commands. The ML operation, e.g., ArgMax, may be compiled into one or more ISA instruction and sent along with its associated data, e.g., sub-vectors, to one or more processing elements of the inference engine for processing.

(11) In some embodiments, the inference engine may include a dense operation engine and an irregular operation engine. The dense operation engine is an engine that is optimized to efficiently process dense data with regular operations, e.g., matrix operations such as multiplication, matrix manipulation, etc. On the other hand the irregular operation engine is an engine that is optimized to efficiently process sporadic data with irregular operations, e.g., memory transpose, addition operation, operations on irregular data structures (such as trees, graphs, and priority queues). In some embodiments, Tanh and Sigmoid operation may be activated by a processing element and executed as a dense operation or irregular operation.

(12) According to some embodiments, the core may coordinate some of the instructions, e.g., TopK, ArgMax, etc., received from the host to be processed. In some embodiments, the core may be a general processor, e.g., a CPU, etc.

(13) Specifically, the core is configured to divide the plurality of ML commands between the core and the inference engine for efficient execution thereof. The ML commands, e.g., TopK, ArgMax, etc., are compiled by the compiler into ISA instructions and the relevant data associated with the ISA instructions are transmitted for execution to the inference engine from the core and the memory to the instruction-streaming engine and the data streaming engine for efficient streaming to the inference engine. The data and instruction steaming engines are configured to send one or more data streams, e.g., data sub-vectors to be operated on by the plurality of processing elements, and ML commands that are compiled, e.g., ISA instructions corresponding to TopK or ArgMax, to the inference engine in response to the received programming instructions from the core.

(14) It is appreciated that, in some embodiments, the ML commands being transmitted from the core to the data/instruction-streaming engines is in a function call format, therefore enabling different processors with different instruction set architectures to be programmed using one type of instruction set architecture. To the core, the operation being performed is a write operation into a memory component, but in reality the operation being done is passing on specific instructions along with their associated data via a function call to the streaming engines for transmission to the inference engine where they can be executed. The inference engine is configured to process the instruction/data streams received from the data/instruction stream engines for the ML operation according to the programming instructions received from the instruction/data streaming engines.

(15) For a non-limiting example, the inference engine may include 64 processing elements (each processing element may further includes a plurality of smaller processing elements PE and POD that are described in application Ser. No. 16/226,508, filed Dec. 19, 2018 that is incorporated herein by reference in its entirety). Each of those processing elements is configured to receive a sub-vector and an instruction (i.e., compiled ArgMax instruction). As such, multiple sub-vectors may be operated on simultaneously, thereby reducing the processing time. For illustrative purposes, it is assumed that the vector data has 128 elements and that there are 64 processing elements where each processing element is configured to process 32 elements. Accordingly, 4 processing elements may receive a sub-vector (each 32 elements as an example) to process a vector data of size 128 elements in parallel while the other 60 processing elements of the inference engine may operate on a different vector or perform a different ML operation altogether. Accordingly, the index associated with the vector with the largest value can be identified.

(16) The proposed ML hardware architecture is highly efficient, flexible and optimized for high-efficiency ML computing while programmable to adapt to the changing environment, usage, applications and algorithms for ML with reduced overhead. By providing hardware support to streamline data/instruction flow, the proposed ML hardware architecture improves system-level performance by significantly reducing the hardware overhead involved in moving data and/or instruction in existing computing architectures. Moreover, the programming instruction set reduces the number of instructions required to perform certain tasks, e.g., processing, moving data, loading data, etc. The proposed ML hardware architecture works well with existing software frameworks and code and may be applied to a wide variety of ML algorithms and neural networks including but not limited to convolution neural network (CNN), recurrent neural network (RNN), gradient boosting machine (GBM), generative adversarial neural network, decision trees, random forest, support vector machine (SVM), clustering, Markov random field (MRF), etc.

(17) FIGS. 1A-1D depict non-limiting examples of parallel processing for a ML operation according to one aspect of the present embodiments. It is appreciated that the described embodiments, as shown in FIGS. 1A-1D, are for illustrative purposes only and should not be construed as limiting the scope of the embodiments. A special case of TopK operation, where K is equal to 1 is illustrated by FIGS. 1A-1D. In other words, the illustrated examples show an ArgMax operation. In this example, data [a.sub.0, a.sub.1, . . . , a.sub.511] are stored in a memory 103 component. An ArgMax instruction is compiled by a compiler and one or more ISA instructions are transmitted to a plurality of processing elements 102. Processing elements 102 may include several processing elements, e.g., processing elements 102A, 102B, . . . , 102P. It is appreciated that 16 processing elements are shown for illustrative purposes and should not be construed as limiting the scope of the embodiments; for example, in some embodiments, 64 processing elements may be used.

(18) In the ML hardware system 100, the processing elements 102 may be an inference engine and are described in application Ser. No. 16/226,508, filed Dec. 19, 2018, which are incorporated herein by reference in its entirety. When the ArgMax operation (i.e., instruction) is compiled by a compiler, one or more ISA instructions along with the associated data are efficiently transmitted to processing elements 102 in order to perform parallel processing.

(19) In this nonlimiting example of system 100, data [a.sub.0, a.sub.1, . . . , a.sub.511] is divided into a plurality of sub-vectors, e.g., [a.sub.0, a.sub.1, . . . , a.sub.31], [a.sub.32, a.sub.33, . . . , a.sub.63], . . . , [a.sub.480, a.sub.481, . . . , a.sub.511], where each sub-vector can be operated on by its corresponding processing element. It is appreciated that the vector may be divided into sub-vectors in an efficient manner by the compiler having knowledge of the architecture, i.e., architecture of the processing elements 102, for efficient processing thereof. For example, the compiler having knowledge of the ML hardware may determine that each sub-vector should include 32 elements as opposed to 64 elements or 16 elements, etc., such that a localized ArgMax operation can be performed on each sub-vector by its corresponding processing element in an efficient manner to reduce the processing time. As such, vector data [a.sub.0, a.sub.1, . . . , a.sub.511] is divided into 16 sub-vectors and each sub-vector is mapped to a particular processing element of the processing elements 102. It is appreciated that in this example, 16 sub-vectors are generated and mapped to 16 processing elements, thereby enabling parallel processing (simultaneous processing) on all elements at the same time, reducing the processing time. In other words, the 16 processing elements process data simultaneously, each finding the index associated with its largest value, thereby reducing the amount of processing time.

(20) In this nonlimiting example, the first sub-vector [a.sub.0, a.sub.1, . . . , a.sub.31] may be mapped to processing element 102A, the second sub-vector [a.sub.32, a.sub.33, . . . , a.sub.63] may be mapped to processing element 102B, . . . , and the last sub-vector [a.sub.480, a.sub.481, . . . , a.sub.511] may be mapped to processing element 102P. Each processing element (architecture of which is described in application Ser. No. 16/226,508, filed Dec. 19, 2018 that is incorporated herein by reference in its entirety) may perform an ArgMax operation on its corresponding sub-vector. It is appreciated that the processing elements 102A-102P may process data simultaneously, thereby reducing the processing time. In this example, each processing element returns the largest value (or index associated with the largest value), e.g., 16 values. The output from each processing element may form another vector (in this example 16 elements) and it may be fed back into a processing element to find the largest value or the index associated with the largest value among the 16 elements. It is appreciated that as a result, the ArgMax operation for this illustrative example can be parallel processed in two iterations to return the largest value or its index. Moreover, it is appreciated that if more than one processing element is needed in subsequent iteration (e.g., if one processing element is unable to process all the data in one iteration) then more than one processing element may be used to facilitate parallel processing of the data, reducing the processing time.

(21) It is noteworthy that since the original vector [a.sub.0, a.sub.1, . . . , a.sub.511] was subdivided into a number of sub-vectors and distributed among a plurality of processing elements for parallel processing, the global index associated with each element should be preserved such that the correct index associated with the largest value can be determined as an output. In this example, an offset value may be tracked to preserve the global index. For example, if the first sub-vector [a.sub.0, a.sub.1, . . . , a.sub.31] is mapped to processing element 102A then its offset value is 0 where as the second sub-vector [a.sub.32, a.sub.33, . . . , a.sub.63] mapped to the processing element 102B has an offset value of 32. Accordingly, when the processing element 102B performs an ArgMax operation on the second sub-vector [a.sub.32, a.sub.33, . . . , a.sub.63], it generates an index associated with the largest value within that sub-vector where the index is its local index. In order to convert that local index to global index, the output index from the processing element 102B which is its local index is added to the offset value of 32. Similarly, an offset value 64 is tracked for the third sub-vector [a.sub.64, a.sub.65, . . . , a.sub.95] when it is mapped to processing element 102C. It is appreciated that the offset associated with each sub-vector may be tracked by the processing element it is assigned to. In some embodiments, the offset value for each processing element may be stored within a memory component of the processing element, e.g., on-chip memory (OCM). It is appreciated that in the first iteration one offset value may be stored for each sub-vector, whereas in the second iteration each element may have its own offset value because each element in the second iteration is an output from a different processing element (in an ArgMax operation) and as such a different offset value for each element is tracked.

(22) It is appreciated that the example of FIG. 1A illustrates the data vector being mapped according to the number of processing elements. However, it is appreciated that in some embodiments, the data vector may be mapped according to the number of available processing elements. For example, in the illustrated example above, if 4 out of 16 processing elements are busy with other operations, then the data vector may be divided and mapped to the available processing elements, i.e., the remaining 12 processing elements. As such, the size of the sub-vectors may also be adjusted based on availability of the processing. In this example, if 12 processing elements are available, then the size of the sub-vector may be adjusted from 32 to 16 for example and the number of iterations to perform the ArgMax by the 12 processing elements may be increased.

(23) Referring now to FIG. 1B, a similar example to that of FIG. 1A is shown except that the data vector has only 128 elements. The vector data [a.sub.0, a.sub.1, . . . , a.sub.127] is divided into a plurality of sub-vectors based on the number of processing elements and/or available processing elements. In this example, the compiler determines that each sub-vector should be 32 elements, and as such 4 sub-vectors are generated. Each sub-vector is mapped to a particular processing element. In this nonlimiting example, the first sub-vector [a.sub.0, a.sub.1, . . . , a.sub.31] is mapped to processing element 102A, the second sub-vector [a.sub.32, a.sub.33, . . . , a.sub.63] is mapped to processing 102B, the third sub-vector [a.sub.64, a.sub.65, . . . , a.sub.95] is mapped to processing 102E, and the fourth sub-vector [a.sub.96, a.sub.97, . . . , a.sub.127] is mapped to processing 102F. As described above, the offset associated with each sub-vector is tracked in order to track the indices globally. Each processing element performs an ArgMax operation on its respective sub-vector. It is appreciated that the processing element 102A may determine that the index associated with element a.sub.27 has the largest value within the first sub-vector, the processing element 102B determines that the index associated with element a.sub.34 has the largest value within the second sub-vector, the processing element 120E determines that the index associated with element a.sub.94 has the largest value within the third sub-vector, and the processing element 120F determines that the index associated with element a.sub.102 has the largest value within the fourth sub-vector. The operation of the four processing elements 102A, 102B, 102E, and 102F are tracked and synchronized. Once the indices from each processing element are output, another vector is formed, e.g., [a.sub.27, a.sub.34, a.sub.94, a.sub.102]. The newly formed vector may be input to one of the four processing elements, 102A, 102B, 102E, 102F to perform an ArgMax operation. However, it is appreciated that another processing element different from the original 4 processing elements, e.g., processing element 102K, may be used. In this example, the processing element 102A is reused to perform an ArgMax operation on the vector [a.sub.27, a.sub.34, a.sub.94, a.sub.102]. As such, the largest value and its index can be determined which is the largest value and index associated with the original vector data [a.sub.0, a.sub.1, . . . , a.sub.127].

(24) FIG. 1C is similar to FIG. 1A except that the vector data is large, e.g., [a.sub.0, a.sub.1, . . . , a.sub.1023]. As such, in this nonlimiting example, the sub-vectors may be the same size as the one in FIG. 1A. Thus, 32 sub-vectors are generated, e.g., [a.sub.0, a.sub.1, . . . , a.sub.31], [a.sub.32, a.sub.33, . . . , a.sub.63], . . . , [a.sub.991, a.sub.992, . . . , a.sub.1023]. The first 16 sub-vectors, i.e., [a.sub.0, a.sub.1, . . . , a.sub.31], [a.sub.32, a.sub.33, . . . , a.sub.63], . . . , [a.sub.480, a.sub.481, . . . , a.sub.511], are mapped to the available processing elements (e.g., processing elements 102A-120P) and an ArgMax operation is performed on them, as described in FIG. 1A. The output from each processing element is tracked, i.e., 16 outputs corresponding to the largest value or index of the largest value within each sub-vector. However, since the number of processing elements (or available processing elements) in the processing elements 102 could not accommodate parallel processing on all sub-vectors simultaneously, then the second set of 16 sub-vectors, i.e., [a.sub.512, . . . , a.sub.544], . . . [a.sub.991, a.sub.992, . . . , a.sub.1023], are mapped to the available processing elements (e.g., processing elements 102A-120P once they are done performing an ArgMax operation on the first 16 sub-vectors). The output from each processing element is tracked, i.e., 16 outputs corresponding to the largest value or index of the largest value within each sub-vector. It is appreciated that the global index is also tracked using an offset as described above. Accordingly, 32 outputs where each correspond to the largest value or index associated with the largest value for each sub-vector are output and forms a new vector. The new vector of 32 elements is then mapped to an available processing element, e.g., processing element 102G, to perform an ArgMax operation. The output from the processing element 102G is the largest value or index associated with the largest value in the vector data [a.sub.0, a.sub.1, . . . , a.sub.1023].

(25) FIG. 1D is similar to that of FIG. 1B except that the data is a tensor, which is a data structure for ML operations, with m (e.g., 4) vectors each of size n (e.g., 128). In this nonlimiting example, each vector of size n is divided into sub-vectors and the sub-vectors are mapped similar to that of FIG. 1B, e.g., each vector is mapped to 4 processing elements. For example, vector [a.sub.0, a.sub.1, . . . , a.sub.127] may be subdivided into four sub-vectors [a.sub.0, a.sub.1, . . . , a.sub.31], [a.sub.32, a.sub.33, . . . , a.sub.63], . . . , [a.sub.96, a.sub.97, . . . , a.sub.127] and mapped to processing elements 102A, 102B, 102E, and 102F respectively. Similarly, vector [b.sub.0, b.sub.1, . . . , b.sub.127] may be subdivided into four sub-vectors [b.sub.0, b.sub.1, . . . , b.sub.31], [b.sub.32, b.sub.33, . . . , b.sub.63], . . . , [b.sub.96, b.sub.97, . . . , b.sub.127] and mapped to processing elements 102C, 102D, 102G, and 102H respectively. Vector [c.sub.0, c.sub.1, . . . , c.sub.127] may be subdivided into four sub-vectors [c.sub.0, c.sub.1, . . . , c.sub.31], [c.sub.32, c.sub.33, . . . , c.sub.63], . . . , [c.sub.96, c.sub.97, . . . , c.sub.127] and mapped to processing elements 102I, 102J, 102M, and 102N respectively. Vector [d.sub.0, d.sub.1, . . . , d.sub.127] may be subdivided into four sub-vectors [d.sub.0, d.sub.1, . . . , d.sub.31], [d.sub.32, d.sub.33, . . . , d.sub.63], . . . , [d.sub.96, d.sub.97, . . . , d.sub.127] and mapped to processing elements 102K, 102L, 1020, and 102P respectively. It is appreciated that as described in FIG. 1, each processing element performs an ArgMax operation on its sub-vector while tracking the offset to preserve the global indices associated with each element.

(26) In some embodiments, the processing element 102A performs an ArgMax operation on sub-vector [a.sub.0, a.sub.1, . . . , a.sub.31], the processing element 102B performs an ArgMax operation on sub-vector [a.sub.32, a.sub.33, . . . , a.sub.63], the processing element 102E performs an ArgMax operation on sub-vector [a.sub.64, a.sub.65, . . . , a.sub.95], the processing element 102F performs an ArgMax operation on sub-vector [a.sub.96, a.sub.97, . . . , a.sub.127]. Accordingly, the processing element 102A outputs the largest value or an index associated with the largest value, e.g., index for element a.sub.11, the processing element 102B outputs the largest value or an index associated with the largest value, e.g., index for element a.sub.45, the processing element 102E outputs the largest value or an index associated with the largest value, e.g., index for element a.sub.79, and the processing element 102F outputs the largest value or an index associated with the largest value, e.g., index for element a.sub.126. It is appreciated that since each processing element outputs a local index associated with the largest element, an offset value can be used to track the indices globally, as presented above. A new vector containing the largest values or indices associated with the largest values from each of the processing elements 102A, 102B, 102E, and 102F is formed, e.g., [a.sub.11, a.sub.45, a.sub.79, a.sub.126]. The newly formed vector may be used by an available processing element, e.g., processing element 102B, in a subsequent iteration to perform an ArgMax operation on the newly formed vector [a.sub.11, a.sub.45, a.sub.79, a.sub.126]. The largest value or index associated with the largest value is therefore determined. It is appreciated that during the second iteration, the offset value of 0 associated with element a.sub.11 is used, while offset value of 32 is used for a.sub.45, 64 is used for a.sub.79, and 96 is used for a.sub.126 to track the global indices associated with the elements of the newly formed vector [a.sub.11, a.sub.45, a.sub.79, a.sub.126].

(27) It is appreciated that other vectors of the tensor may be processed similarly, as described above. For example, the processing element 102C performs an ArgMax operation on sub-vector [b.sub.0, b.sub.1, . . . , b.sub.31], the processing element 102D performs an ArgMax operation on sub-vector [b.sub.32, b.sub.33, . . . , b.sub.63], the processing element 102G performs an ArgMax operation on sub-vector [b.sub.64, b.sub.65, . . . , b.sub.95], the processing element 102H performs an ArgMax operation on sub-vector [b.sub.96, b.sub.97, . . . , b.sub.127]. Accordingly, the processing element 102C outputs the largest value or an index associated with the largest value, e.g., index for element b.sub.3, the processing element 102D outputs the largest value or an index associated with the largest value, e.g., index for element b.sub.60, the processing element 102G outputs the largest value or an index associated with the largest value, e.g., index for element b.sub.83, and the processing element 102H outputs the largest value or an index associated with the largest value, e.g., index for element b.sub.101. It is appreciated that since each processing element outputs a local index associated with the largest element, an offset value can be used to track the indices globally, as presented above. A new vector containing the largest values or indices associated with the largest values from each of the processing elements 102C, 102D, 102G, and 102H is formed, e.g., [b.sub.3, b.sub.60, b.sub.83, b.sub.101]. The newly formed vector may be used by an available processing element, e.g., processing element 102H, in a subsequent iteration to perform an ArgMax operation on the newly formed vector [b.sub.3, b.sub.60, b.sub.83, b.sub.101]. The largest value or index associated with the largest value is therefore determined. It is appreciated that during the second iteration, the offset value of 0 associated with element b.sub.3 is used, while offset value of 32 is used for b.sub.60, 64 is used for b.sub.83, and 96 is used for b.sub.101 to track the global indices associated with the elements of the newly formed vector [b.sub.3, b.sub.60, b.sub.83, b.sub.101].

(28) The processing element 102I performs an ArgMax operation on sub-vector [c.sub.0, c.sub.1, . . . , c.sub.31], the processing element 102J performs an ArgMax operation on sub-vector [c.sub.32, c.sub.33, . . . , c.sub.63], the processing element 102M performs an ArgMax operation on sub-vector [c.sub.64, c.sub.65, . . . , c.sub.95], the processing element 102M performs an ArgMax operation on sub-vector [c.sub.96, c.sub.97, . . . , c.sub.127]. Accordingly, the processing element 102I outputs the largest value or an index associated with the largest value, e.g., index for element c.sub.23, the processing element 102J outputs the largest value or an index associated with the largest value, e.g., index for element c.sub.59, the processing element 102M outputs the largest value or an index associated with the largest value, e.g., index for element c.sub.71, and the processing element 102N outputs the largest value or an index associated with the largest value, e.g., index for element c.sub.117. It is appreciated that each processing element outputs a local index associated with the largest element, an offset value can be used to track the indices globally, as presented above. A new vector containing the largest values or indices associated with the largest values from each of the processing elements 102I, 102J, 102M, and 102N is formed, e.g., [c.sub.23, c.sub.59, c.sub.71, c.sub.117]. The newly formed vector may be used by an available processing element, e.g., processing element 102N, in a subsequent iteration to perform an ArgMax operation on the newly formed vector [c.sub.23, c.sub.59, c.sub.71, c.sub.117]. The largest value or index associated with the largest value is therefore determined. It is appreciated that during the second iteration, the offset value of 0 associated with element c.sub.23 is used, while offset value of 32 is used for c.sub.59, 64 is used for c.sub.71, and 96 is used for c.sub.117 to track the global indices associated with the elements of the newly formed vector [c.sub.23, c.sub.59, c.sub.71, c.sub.117].

(29) The processing element 102K performs an ArgMax operation on sub-vector [d.sub.0, d.sub.1, . . . , d.sub.31], the processing element 102L performs an ArgMax operation on sub-vector [d.sub.32, d.sub.33, . . . , d.sub.63], the processing element 102O performs an ArgMax operation on sub-vector [d.sub.64, d.sub.65, . . . , d.sub.95], the processing element 102P performs an ArgMax operation on sub-vector [d.sub.96, d.sub.97, . . . , d.sub.127]. Accordingly, the processing element 102K outputs the largest value or an index associated with the largest value, e.g., index for element d.sub.29, the processing element 102L outputs the largest value or an index associated with the largest value, e.g., index for element d.sub.47, the processing element 102O outputs the largest value or an index associated with the largest value, e.g., index for element d.sub.93, and the processing element 102P outputs the largest value or an index associated with the largest value, e.g., index for element d.sub.123. It is appreciated that since each processing element outputs a local index associated with the largest element, an offset value can be used to track the indices globally, as presented above. A new vector containing the largest values or indices associated with the largest values from each of the processing elements 102K, 102L, 1020, and 102P is formed, e.g., [d.sub.29, d.sub.47, d.sub.93, d.sub.123]. The newly formed vector may be used by an available processing element, e.g., processing element 102O, in a subsequent iteration to perform an ArgMax operation on the newly formed vector [d.sub.29, d.sub.47, d.sub.93, d.sub.123]. The largest value or index associated with the largest value is therefore determined. It is appreciated that during the second iteration, the offset value of 0 associated with element d.sub.29 is used, while offset value of 32 is used for d.sub.47, 64 is used for d.sub.93, and 96 is used for d.sub.123 to track the global indices associated with the elements of the newly formed vector [d.sub.29, d.sub.47, d.sub.93, d.sub.123].

(30) It is appreciated that while the above examples are provided within the context of ArgMax operation, the embodiments and the process is not limited thereto. For example, a TopK operation may similarly be performed similar to that described above except that instead of determining/tracking the largest value and/or its index, the top K (i.e., top largest K elements) values and/or their respective indices are identified and tracked. For example, in a TopK operation each processing element outputs the TopK elements of its respective sub-vector. For illustrative purposes, it is presumed that a sub-vector for a first processing element is [10, 8, 6, 2] and a sub-vector for a second processing element is [9, 7, 3, 1] in a Top2 operation. Each processing element outputs its top 2 elements, e.g., [10, 8] from the first processing element and [9, 7] from the second processing element during its first iteration and the Top2 operation during a second iteration outputs [10, 9] which is the correct output.

(31) An illustrative example of an ML library high level code for data exchange after performing the ArgMax operation is illustrated below. In the example below is all to one gather operation for the values and indices while other examples may be all to all data exchange (gather/scatter).

(32) TABLE-US-00001 int src = 0, dst = 0; // Source and destination base addresses int K = 8;// K=8 for Top8 SyncTask_t* sync_task = (SyncTask_t*) malloc(sizeof(SyncTask_t)); sync_task->tilemask = 0xffffffffffffffffull; sync_task->syncbits = 0; sync_task->set_tag = 0; sync_task->ins_sync_tag = 0; MonitorTask_t* monitor_task = (MonitorTask_t*) malloc(sizeof(MonitorTask_t)); monitor_task->startTilePerfCnt = 0; monitor_task->endPilePerfCnt = 0; monitor_task->startDODPerfCnt = 0; monitor_task->endDODPerfCnt = 0; size = mllib_CopyRemoteAbs(inst_buffer, sync_task, monitor_task, src, dat, K, 0, 0, 1, 0, 0, 1, 0, 0, K, 64}; size += mllib_CopyRemoteAbs(inst_buffertsize, sync_task, monitor_task, src+K, dst+64*K, 2*K, 0, 0, 1, 0, 0, 1, 0, 0, 2*K, 64); sync_task->syncbits = 3; size += mllib_Sync(inst_buffertaize, sync_task, monitor_task);

(33) It is appreciated that a low level ISA code may be generated from the ML library high level code corresponding to the all to one data exchange operation above, and is shown below as an example.

(34) TABLE-US-00002 //1 PETaskBest 4 0xffffffffffffffff 0 0 0 0 0 0 0 TileTaskParamUpdUsingTileID 4 3 2 8 0 0 0 0 0 0 64 PELoadStream1 0x0 8 0 1 0 1 PEMove 0x1f 0x1e 8 0 0 1 1 1 PEStoreStream 0x0 8 0 1 0 1 2 0 0 //2 PETaskBest 4 0xffffffffffffffff 0 0 0 0 0 0 0 TileTaskParamUpdUsingTileID 4 3 2 16 0 0 0 0 0 0 64 PELoadStream1 0x8 16 0 1 0 1 PEMove 0x1f 0x1e 16 0 0 1 1 1 PEStoreStream 0x200 16 0 1 0 1 2 0 0 //3 PETaskBest 1 0xffffffffffffffff 3 0 0 0 0 0 0 PESync

(35) It is appreciated that utilizing ISA instruction in ML hardware unlike application specific integrated circuit (ASIC), is not only programmable but also leverages ahead of time compilation to efficiently divide the data (e.g., vector data) to be operated on, e.g., TopK, ArgMax, etc., via processing elements in parallel.

(36) FIG. 2 depicts a flow diagram of parallel processing of a ML operation according to one aspect of the present embodiments. At step 210, a first set of data is received, as described in FIGS. 1A-1D. At step 220, an instruction to determine a largest value with the first set of data is received, as described above. At step 230, the first set of data is divided into a first plurality of data portions based on a hardware architecture of a first plurality of processing elements, as discussed in FIGS. 1A-1D. At step 240, the first plurality of data portions is mapped to the first plurality of processing elements, as described above. Each data portion of the first plurality of data portions is mapped exclusively to a processing element of the first plurality of elements. At step 250, each data portion of the first plurality of data portions is processed by its respective processing element to identify a largest value from each data portion, as illustrated in FIGS. 1A-1D. The processing forms a first output data that includes the largest value from each data portion of the first plurality of data portions. It is appreciated that the first output data may be mapped to a processing element in a second iteration of the processing to identify the largest value within the first output data. It is appreciated that a global index associated with each element may be tracked, as described above. It is appreciated that the process of dividing the data into portions and mapping them to the processing elements may be repeated until all data elements of the vector data is processed. The output of the processing elements form new vector data and processed repeatedly as described above until the largest value and/or index associated with the largest value within the vector data is identified.

(37) The foregoing description of various embodiments of the claimed subject matter has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. Embodiments were chosen and described in order to best describe the principles of the invention and its practical application, thereby enabling others skilled in the relevant art to understand the claimed subject matter, the various embodiments and the various modifications that are suited to the particular use contemplated.