SYSTEM AND METHOD FOR COMPACTING TEST DATA IN MANY-CORE PROCESSORS
20210199717 · 2021-07-01
Inventors
Cpc classification
G01R31/31703
PHYSICS
H03K19/21
ELECTRICITY
H03K19/20
ELECTRICITY
International classification
Abstract
A method for testing a many-core processor comprises grouping a plurality of cores in the processor into a plurality of super cores, wherein each super core comprises one or more scan chains that propagate through a respective super core. Further, the method comprises grouping the plurality of super cores into a plurality of clusters. The method also comprises comparing one or more scan chain outputs of respective super cores in each cluster using a network of XOR and OR gates to generate a single bit fault signature for each scan chain in a respective cluster and compacting the single bit fault signatures for each scan chain using a hybrid of spatial and temporal compactors to generate a single bit fault signature for each cluster. The method also comprises method of using a cost function to obtain hierarchical parameters to achieve optimized ATPG effort, area overhead and test time.
Claims
1. A system for performing a method of testing a many-core processor, the system comprising: a processing device communicatively coupled with a memory comprising: a plurality of cores comprised within the many-core processor grouped into a plurality of super cores, wherein each super core comprises one or more scan chains that propagates there through, wherein each super core comprises a respective subset of the plurality of cores, and wherein the plurality of super cores are grouped into a plurality of first level clusters, wherein the plurality of first level clusters comprise clusters at a first level of a hierarchical cluster configuration, wherein each first level cluster of the plurality of first level clusters comprises a respective subset of the plurality of super cores, wherein the plurality of first level clusters are grouped into a plurality of second level clusters, wherein the plurality of second level of clusters comprise clusters at a second level of the hierarchical cluster configuration of clusters; a network of combinational logic, wherein one or more scan chain outputs of respective super cores in each first level cluster are compared using the network of combinational logic to generate a single bit fault signature for each scan chain in a respective first level cluster for each clock cycle, wherein a respective scan chain traversing each super core in a first level cluster is identical to a respective scan chain traversing other super cores in a same first level cluster; and a first compactor, wherein respective single bit fault signatures for each scan chain in each first level cluster are compacted using a first compactor to generate a single bit fault signature for a respective first level cluster for each clock cycle.
2. The system of claim 1, wherein the first compactor comprises one or more OR gates.
3. The system of claim 1, wherein the first compactor comprises a multiple input shift register (MISR).
4. The system of claim 1, wherein the network of combinational logic comprises XOR and OR gates.
5. The system of claim 1, wherein each second level cluster of the plurality of second level clusters comprises a respective subset of the plurality of first level clusters, and wherein the system further comprises: a second compactor operable to compact respective single bit fault signatures for each first level cluster in a second level cluster to generate a single bit fault signature for each second level cluster.
6. The system of claim 1, wherein a number of levels in the hierarchical cluster configuration for grouping the plurality of cores in the many-core processor are determined in accordance with a cost function.
7. The system of claim 6, wherein a type of compactor to be used for the first compactor and the second compactor is determined in accordance with the cost function.
8. The system of claim 7, wherein the type of compactor is selected from a group consisting of: OR-based compactor and a multiple input shift register (MISR).
9. The system of claim 8, wherein the cost analysis is performed in accordance with a cost function.
10. The system of claim 8, wherein the cost analysis is performed in accordance with a cost matrix.
11. The system of claim 8, wherein the cost analysis is performed by considering a plurality of factors selected from a group consisting of: area overhead, test time, aliasing probability, maximum debug overhead and debuggability.
12. A computer-implemented method of testing a many-core processor, the method comprising: comparing one or more scan chain outputs of respective super cores in each first level cluster of a plurality of first level clusters using a network of combinational logic to generate a single bit fault signature for each scan chain in a respective first level cluster for each clock cycle, wherein a plurality of cores comprised within the many-core processor are grouped into a plurality of super cores, wherein each super core comprises one or more scan chains that propagate there through, wherein each super core comprises a respective subset of the plurality of cores, wherein the plurality of super cores are grouped into the plurality of first level clusters, wherein the plurality of first level clusters comprise clusters at a first level of a hierarchical cluster configuration, wherein each first level cluster of the plurality of first level clusters comprises a respective subset of the plurality of super cores, and wherein a respective scan chain traversing each super core in a first level cluster is identical to respective scan chains traversing other super cores in a same first level cluster, wherein the plurality of first level clusters are grouped into a plurality of second level clusters, wherein the plurality of second level of clusters comprise clusters at a second level of the hierarchical cluster configuration of clusters; and compacting respective single bit fault signatures for each scan chain in each first level cluster using a first compactor to generate a single bit fault signature for a respective first level cluster for each clock cycle.
13. The computer-implemented method of claim 12, wherein the first compactor comprises one or more OR gates.
14. The computer-implemented method of claim 12, wherein the first compactor comprises a multiple input shift register (MISR).
15. The computer-implemented method of claim 12, wherein the network of combinational logic comprises XOR and OR gates.
16. The computer-implemented method of claim 12, wherein each second level cluster of the plurality of second level clusters comprises a respective subset of the plurality of first level clusters, and wherein the method further comprises: compacting respective single bit fault signatures for each first level cluster in a second level cluster using a second compactor to generate a single bit fault signature for each second level cluster.
17. A non-transitory computer-readable storage medium having stored thereon, computer executable instructions that, if executed by a computer system cause the computer system to perform a method of testing a many-core processor, the method comprising: comparing a scan chain output of respective super cores in each first level cluster of a plurality of first level clusters using a network of combinational logic to generate a single bit fault signature for each first level cluster, wherein a plurality of cores comprised within the many-core processor is grouped into a plurality of super cores, wherein each super core comprises a scan chain that propagates there through, wherein each super core comprises a respective subset of the plurality of cores, wherein the plurality of super cores are grouped into the plurality of first level clusters, wherein the plurality of first level clusters comprise clusters at a first level of a hierarchical cluster configuration, wherein the plurality of first level clusters are grouped into a plurality of second level clusters, wherein the plurality of second level of clusters comprise clusters at a second level of the hierarchical cluster configuration of clusters, wherein each first level cluster of the plurality of first level clusters comprises a respective subset of the plurality of super cores, wherein each second level cluster of the plurality of second level clusters comprises a respective subset of the plurality of first level clusters, and wherein a respective scan chain traversing each super core in a first level cluster is identical to a respective scan chain traversing other super cores in a same first level cluster; and compacting respective single bit fault signatures of each first level cluster using a first compactor to generate a single bit fault signature for the plurality of first level clusters.
18. The non-transitory computer-readable storage medium of claim 17, wherein the first compactor comprises one or more OR gates.
19. The non-transitory computer-readable storage medium of claim 17, wherein the first compactor comprises a multiple input shift register (MISR).
20. The non-transitory computer-readable storage medium of claim 17, wherein the network of combinational logic comprises XOR and OR gates.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] Embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements.
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024] In the figures, elements having the same designation have the same or similar function.
DETAILED DESCRIPTION OF THE INVENTION
[0025] Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. While the embodiments will be described in conjunction with the drawings, it will be understood that they are not intended to limit the embodiments. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents. Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a thorough understanding. However, it will be recognized by one of ordinary skill in the art that the embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments.
Notation and Nomenclature Section
[0026] Some regions of the detailed descriptions which follow are presented in terms of procedures, logic blocks, processing and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present application, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system.
[0027] It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing the terms such as “grouping,” “comparing,” “compacting,” “generating,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
[0028] The description below provides a discussion of computers and other devices that may include one or more modules. As used herein, the term “module” or “block” may be understood to refer to software, firmware, hardware, and/or various combinations thereof. It is noted that the blocks and modules are exemplary. The blocks or modules may be combined, integrated, separated, and/or duplicated to support various applications. Also, a function described herein as being performed at a particular module or block may be performed at one or more other modules or blocks and/or by one or more other devices instead of or in addition to the function performed at the described particular module or block. Further, the modules or blocks may be implemented across multiple devices and/or other components local or remote to one another. Additionally, the modules or blocks may be moved from one device and added to another device, and/or may be included in both devices. Any software implementations of the present invention may be tangibly embodied in one or more storage media, such as, for example, a memory device, a floppy disk, a compact disk (CD), a digital versatile disk (DVD), or other devices that may store computer code.
[0029] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present invention. As used throughout this disclosure, the singular forms “a,” “an,” and “the” include plural reference unless the context clearly dictates otherwise. Thus, for example, a reference to “a module” includes a plurality of such modules, as well as a single module, and equivalents thereof known to those skilled in the art.
[0030] System and Method for Compacting Test Data in Many-Core Processors
[0031] Typically, many-core processors (e.g., large neural network accelerators) require excessive test pattern generation efforts including many hours of run time, a significant amount of disk space and several man hours. Further, the large test patterns and long test times result in higher processor cost. Conventional test methods treat the entire homogeneous processing array of a many-core architecture (comprising identical or similar cores) as a single piece of random logic; this results in an unnecessary expenditure of run time, disk-space, test resources and man hours.
[0032] Embodiments of the present invention enable a hierarchical test solution to take advantage of the uniform architecture of many-core processors (e.g. processors with multiple identical cores) and significantly reduce automatic test pattern generation (ATPG) time, memory requirements and test execution time. Many-core processors may include but are not limited to neural network processors, AI accelerators, Digital Signal Processors (DSP), and RISC-V processors. By comparing the outputs of the several cores before transmitting the test results to the tester, embodiments of the present invention are able to take advantages of the similarities between the cores and in the test data generated by the similar cores to compact the test data directly on the processor chip. In this way, fewer bits need to be transmitted to the tester for analysis, which drastically reduces test time, storage requirements, and cost.
[0033] Embodiments of the present invention further relate to a hierarchical test compaction methodology that utilizes a hybrid combination of both spatial and temporal compactors. By comparing data from similar cores of a many-core processor instead of downloading data from all the cores to the tester, embodiments of the present invention advantageously save test time and cost with low hardware overhead. Further, embodiments of the present invention advantageously provide a compaction methodology that is scalable, flexible and makes easier debug implementation and capability.
[0034]
[0035] Each of the clusters may comprise one or more super processing engines (SPEs). The embodiment of
[0036] Each SPE may in turn comprise 4 or more processing engines (PEs) or cores, e.g., cores 101, 102, 104 and 105 as shown in
[0037] It is noted that
[0038] Each core may comprise one or more stitched scan chains that propagate through it. For example, scan chain 131 propagates through the cores of SPE 110, scan chain 132 propagates through the cores of SPE 111, scan chain 133 propagates through the cores of SPE 112 and scan chain 134 propagates through the cores of SPE 113. Note that there can be more than one scan chain that traverses a single SPE.
[0039] Each scan chain is a single stitched chain comprising chains in multiple PEs inside an SPE. It is appreciated, however, that the scan chain does not need to be a chain that is connected serially through all the PEs in an SPE. For example, one or more scan chains in an SPE may be intermingled through all the PEs in the SPE, where the PEs are collectively treated as one single piece of logic (this configuration will be discussed in more detail in connection with
[0040] Note that the scan chains in each cluster may be identical. For example, in the embodiment of
[0041] Stated differently, for multiple scan chains inserted in 2 SPEs, e.g., (Xa in SPE 1, X′a in SPE 2), (Xb in SPE 1, X′b in SPE 2), associated chains of SPEs in the same L.sub.0 cluster are identical (e.g., Xa=Xb, X′a=X′b) and are XOR-ed as shown in
[0042] If none of the cores in either SPE 110 or 111 return a faulty response for a given clock cycle, both inputs to XOR gate 136 will be a logical ‘0’, which will result in a logical ‘0’ at the output of the XOR gate 136. However, if either one or more of the cores in SPE 110 or one or more of the cores in SPE 111 return a faulty response in any given clock cycle, one of the inputs to XOR gate 136 will be a logical ‘1’ which will result in a logical ‘1’ at the output of the XOR gate 136.
[0043] The output of the XOR gate represents a single-bit fault signature for the respective scan chains in a cluster (e.g., a L.sub.0 cluster). For example, a logical ‘0’ at the output an XOR gate 136 will indicate a fault-free signature while a logical ‘1’ at the output of the XOR gate 136 will indicate a faulty response for the associated scan chains 131 and 132 in cluster 150.
[0044] It will be appreciated that in rare circumstances aliasing may occur. Aliasing occurs only if all the scan cells carry faults on the same cycle. For example, if both scan chains 131 and 132 produce a fault on the same clock cycle, a logical ‘1’ will be presented at both inputs of XOR gate 136 resulting in a logical ‘0’ at the output of the XOR gate. This is a non-representative edge (or boundary) case that is unlikely to occur for all clock cycles and can, therefore, be accounted for and corrected statistically when analyzing the test results collectively.
[0045] Thereafter, the outputs of the XOR gates (e.g. outputs 145 and 146) may then be OR-ed together to produce a one-bit fault signature for two or more clusters, where a logical ‘0’ represents a fault-free signature and a logical ‘1’ represents a fault in at least one cluster at the base-level (e.g., at the L.sub.0 level).
[0046]
[0047] As discussed in connection with
[0048] Referencing
[0049] Note that the XOR-OR comparison scheme shown in
[0050]
[0051] The cluster 201 comprises 4 SPEs. Each SPE, e.g., SPE 210 within cluster 201 comprises 4 scan chains.
[0052]
[0053] As discussed in connection with
[0054] It should be noted that while the OR-compaction network 348 in
[0055] In one embodiment, OR-compaction network 348 may be comprised of an OR-gate tree with “m” inputs and (m−1) 2-input OR-gates.
[0056] It is appreciated that the output 370 of the L.sub.1 cluster 311 if there are no errors within any of the PEs (or scan chains) will be a logical ‘0’ for that clock cycle indicating a fault-free L.sub.1 cluster 311. However, a logical ‘1’ at the output 370 of the L.sub.1 cluster 311 indicates that a fault was found in at least one of the scan chains comprised within the one or more L.sub.0 clusters.
[0057]
[0058] Instead of an OR-compactor, however,
[0059] Output 451 of the L.sub.1 cluster 449 if there are no errors within any of the PEs (or scan chains) will be a logical ‘0’ indicating a fault-free L.sub.1 cluster 449, with ignorable aliasing probability. However, a logical ‘1’ at the output 451 of the L.sub.1 cluster 449 indicates that a fault was found in at least one of the scan chains comprised within the one or more L.sub.0 clusters.
[0060] It is appreciated that an OR-gate comprises combinational logic and, accordingly, it is considered a spatial compactor. On the other hand, a MISR is a sequential element comprising sequential circuits, e.g., flip-flops and, accordingly, is considered a temporal compactor. Note that the temporal compactors, e.g., MISRs are used for both compaction and for re-timing. Compared with a spatial compactor, a temporal compactor is more complicated and costly with respect to hardware overhead.
[0061] In many instances, when compacting the outputs of potentially thousands of L.sub.0 clusters at the base-level, if only combinational logic (e.g., OR gates) is used, the 1-bit signals from a cluster may need to be routed over long distances to the next stage of the circuitry, which consumes time and may slow down the speed of the chip. The delay from the combinational circuitry may accumulate on the test paths and the test clock will typically need to be slowed down. Accordingly, sequential logic may be needed to break or split the long routes so that the longest timing paths in the chip are shorter. In one embodiment, a re-timing register may be used to shorten the test paths and increase the test clock speed. However, in the embodiment shown in
[0062]
[0063] As noted above, embodiments of the present invention relate to a hierarchical test compaction methodology that utilizes a hybrid combination of both spatial and temporal compactors. The compaction of the test results from the various clusters can take place using varying hierarchical levels. For example, certain many-core processors with fewer clusters may require fewer levels of hierarchy to compact all the test results while many-core processors with a higher number of clusters would in general require many levels of hierarchy. As will be discussed below, the number of levels in the hierarchical may be determined based on a cost analysis and function.
[0064] As shown in
[0065] The resulting compacted signal 511 at the L.sub.i+1 level of hierarchy (outputted from L.sub.i+1 cluster 551) can be further compacted down along with the outputs of the other L.sub.i+1 clusters (within the L.sub.i+2 cluster 552) using compactor 504 to yield a single bit fault signature 512 at the L.sub.i+3 level. The output 512 of L.sub.i+2 compactor 552 can be compacted with the output of other L.sub.i+2 clusters using compactor 505 to yield a single bit fault signature at the output of the L.sub.i+3 cluster 553 (which, for the example shown in
[0066] In this way, all the L.sub.0 clusters at the base-level of the hierarchy can be organized hierarchically and the results can be compacted using a combination of spatial and temporal compactors to produce a single bit fault signature for the entire chip. The compaction scheme relates to a pass/fail test, so each cluster produces a single bit signature that can be further compacted at higher levels of the hierarchy by combining the result with other clusters (using OR-gates or MISRs). A user (or a test program programmed to receive results) may only need to check if a ‘1’ appears at the output of a compactor, e.g., compactor 505 in
[0067] It should be noted that the compactors at any level of the hierarchy, e.g., compactor 502, 504, 505, may either be an OR-compactor or a MISR. Embodiments of the present invention use a cost function to determine the optimal hierarchical compaction configuration including to determine whether a compactor at any particular level will comprise an OR-compactor or a MISR. The cost model guides users to select the specific compaction architecture (as will be discussed further below).
[0068]
[0069] As seen in
[0070]
[0071] As shown in
[0072] As noted above, in an embodiment, a cost function determines the optimal hierarchical compaction configuration for a many-core processor based on several factors. In another embodiment, the cost function may be a cost matrix that is used to compute the optimal configuration.
[0073] In one embodiment, the output of the cost function comprises the compaction type at each level, L.sub.i. Additionally another output of the cost function is the resulting number of clusters at each level. In other words, the cost function (or cost matrix) helps determine how many clusters, at each level, the cores from the many-core processor will be grouped into. Furthermore, the cost function also determines whether the compactor at each level of hierarchy will be an OR-based compactor or a MISR-based compactor.
[0074] Embodiments of the present invention when used along with an appropriate small core-based ATPG method, can advantageously reduce ATPG time and memory requirements by 10x and test time by 2x.
[0075]
[0076] As shown in
[0077] The inputs of the cost function or “cost matrix” needs to receive certain inputs to determine the optimal hierarchical compaction configuration. For instance, the inputs to the cost function comprise: a) the number of identical PEs (N) in the many-core processor; b) the number of flip-flops per PE (which provides the size of the core); c) the number of output channels post-compaction (l.sub.f.sup.out) 845. The number of output channels is determined by the design limit. Given these inputs, the cost function attempts to find an optimized configuration including determining the number of hierarchies, the grouping policies, and the type of compaction.
[0078] More specifically, the outputs of the cost function comprise: a) scan chain count “m” inside an SPE; b) n.sub.PE, the number of PEs inside each SPE at the base level of the hierarchy; c) n.sub.SPE, the total number of SPEs in the design at the base level of the hierarchy (n.sub.SPE=N/n.sub.PE, where N is the total PE count in the design and is provided by the user as an input to the cost function as indicated above); d) n.sup.0.sub.SPE, the number of SPEs per L.sub.0 cluster (n.sup.0.sub.SPE=n.sub.SPE/l.sub.0.sup.out, where l.sub.0.sup.out is the number of L.sub.0 clusters); e) number of clusters at each level including L.sub.0 level clusters (l.sub.0.sup.out) 815, L.sub.1 level clusters (l.sub.1.sup.out) 825, L.sub.2 level clusters (l.sub.2.sup.out) 835, etc.; and e) the compaction type at each level of hierarchy.
[0079] In one embodiment, the cost function or matrix comprises test cost parameters. Test cost parameters may include: a) area overhead (for flops, XOR, and OR gates and how much hardware overhead is needed); b) test time (critical path delay, test cycle count); c) aliasing probability; d) maximum debug overhead (flop count); and e) debuggability. All these various parameters are taken into account in the cost function to produce the outputs discussed above. Embodiments of the present invention may provide user-defined constraint envelopes for one or more test cost parameters to prune design space and recommend optimal set of compaction configurations to the user. Embodiments of the present invention also enable optimization of test-cost (aliasing, test area overhead, and test time) to determine the optimal hierarchical compaction configuration.
[0080]
[0081] At step 902, a plurality of cores comprised within a many-core processor are grouped into a plurality of super cores. Each super core comprises one or more scan chains that propagate through a respective super core, wherein each super core comprises a subset of the plurality of cores.
[0082] At step 904, the plurality of super cores are grouped into a plurality of first level clusters. The plurality of first level clusters comprise clusters at a first level of a hierarchical configuration of clusters (e.g., cluster L.sub.0 310 in
[0083] At step 906, one or more scan chain outputs of respective super cores in each cluster are compared using a network of XOR and OR gates to generate a single bit fault signature for each scan chain in a respective cluster.
[0084] Finally, at step 908, respective single bit fault signatures for each scan chain in each cluster are compacted using a compactor to generate a single bit fault signature for a respective cluster. As noted previously, the compactor can either be a spatial compactor (e.g. a network of OR-gates comprising combinational logic) or a temporal compactor (e.g. a MISR comprising sequential logic).
[0085] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as may be suited to the particular use contemplated.