Simultaneous multi-processor apparatus applicable to achieving exascale performance for algorithms and program systems
11669418 · 2023-06-06
Assignee
Inventors
Cpc classification
G06F11/10
PHYSICS
G06F15/161
PHYSICS
International classification
G06F11/14
PHYSICS
G06F11/10
PHYSICS
G06F11/16
PHYSICS
G06F11/20
PHYSICS
G06F15/16
PHYSICS
Abstract
Apparatus adapted for exascale computers are disclosed. The apparatus includes, but is not limited to at least one of: a system, data processor chip (DPC), Landing module (LM), chips including LM, anticipator chips, simultaneous multi-processor (SMP) cores, SMP channel (SMPC) cores, channels, bundles of channels, printed circuit boards (PCB) including bundles, floating point adders, accumulation managers, QUAD Link Anticipating Memory (QUADLAM), communication networks extended by coupling links of QUADLAM, log2 calculators, exp2 calculators, logALU, Non-Linear Accelerator (NLA), and stairways. Methods of algorithm and program development, verification and debugging are also disclosed. Collectively, embodiments of these elements disclose a class of supercomputers that obsolete Amdahl's Law, providing cabinets of petaflop performance and systems that may meet or exceed an exaflop of performance for Block LU Decomposition (Linpack).
Claims
1. An apparatus, comprising: A. an anticipator chip adapted to respond to a system performance requirement of Kpref exaflop by a system for a sustained runtime for an algorithm and an incremental state of said algorithm received by said anticipator; B. said anticipator chip is adapted to respond to said incremental state by creating an anticipated requirement; and C. said anticipator chip is adapted to respond to said anticipated requirement by directing at least part of said system to achieve said system performance requirement D. wherein said anticipator further includes a state table adapted for configuration to integrate said incremental states of said algorithm to update said state table to account for said anticipated requirement; and said anticipator responds to a successor incremental state based upon said state table in order to generate a successor anticipated requirement; E. wherein said state table is adapted to integrate said incremental states of said algorithm to update said state table to account for said anticipated requirement, for each of said incremental states; F. wherein said algorithm includes a form of Block LU Decomposition with partial pivoting of a matrix A including at least N rows and at least N columns of double precision floating point numbers, where said N is a member of a second group consisting of said number multiplied by K of said rows and said columns resulting in said Kpref sustained performance, where said K is 1024; G. wherein said number is a member of the second group consisting of 1/4 to 16; H. wherein said sustained runtime is at least one and no more than 8 hours; wherein said incremental state includes a pivot decision for one of said columns of said matrix A; and J. wherein said Kpref exaflop is a member of a first group consisting of ¼ exaflop to 1 exaflop.
2. The apparatus of claim 1, wherein said anticipated requirement, includes A. an anticipated future memory transfer requirement of at least one memory unit array as an associated large memory to said anticipator chip; and B. an anticipated future transfer requirement of at least one Landing Module (LM) chip as at least one associated communication node chip to said anticipator chip.
3. The apparatus of claim 2, wherein said anticipator adapted to respond to said anticipated requirement includes said anticipator configured to perform A. said anticipator scheduling memory transfers of said associated memory unit array to fulfill said anticipated future memory transfer requirement; B. said anticipator configuring at least one of said associated communication node chips to fulfill said anticipated future transfer requirement.
4. The apparatus of claim 2, wherein said anticipated requirement further includes: A. an anticipated internal transfer requirement for a Data Processor Chip (DPC) as an associated DPC to said anticipator chip; and B. said anticipator configuring at most one of said associated DPC to respond to said anticipated internal transfer requirement of said associated DPC with any coupled said associated communication node chips so that said performance requirement is met in the average over said sustained runtime.
5. The apparatus of claim 2, wherein said system performance requirement includes said system performing said Kpref exaflop for said sustained runtime directed by said algorithm.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION OF THE DRAWINGS
(10) Systems for which algorithm implementations may be proven to have exaflop or more performance require that all of the above-summarized problems be solved. Otherwise, the above basic systems analytic performance parameters for the system do not exist, and accurate performance proofs are impossible without them. To do this requires a description of a system that accurately describes the hardware in terms of its systems analytic parameters, with the minimum detail needed by the algorithm developers.
(11) A supercomputing system is a system including sub-systems known as cabinets. Each cabinet includes sub-systems are known as rows of Printed Circuit Boards (PCBs). Each of the rows of PCBs include sub-systems referred to as a backplane PCB, at least one data memory node PCB, and/or at least one communicating memory PCB.
(12)
(13) Unless otherwise noted, multipliers and multiplication refers to floating point multiplication, in particular, double precision floating point multiplication. Non-additive terms generation will refer to the result of some combination of logarithm base 2, logarithmic domain addition, logarithmic multiplication, and exponentiation base two.
(14)
(15)
(16)
(17) The system of
(18) All, or almost all, components are controlled and respond to their local stimuli and control state, implementing simultaneous communications and processing throughout the system. This document discloses and provides the basis for claiming that all exascale systems will include a version of the example system implementing simultaneous communications and processing throughout that example system. While various legacy computers, possibly supporting von Neumann architectures, super scalar instruction processing, whether or not multi-threaded, and possibly supporting caches may be found scattered through such systems, they can not be in the critical path of data processing and communications required for algorithms such as Block LU Decomposition (Linpack) to operate at an exaflop for at least 8 hours of runtime.
(19)
(20)
(21)
(22) The anticipated requirement, may include an anticipated future memory transfer requirement of at least one of the memory unit arrays as an associated large memory to the anticipator chip, an anticipated future transfer requirement of at least one of the LM chip as at least one associated communication node chip to the anticipator chip, and an anticipated internal transfer requirement for at most one of the DPC as an associated DPC to the anticipator chip.
(23) The anticipator may be adapted to respond to the anticipated requirement includes the anticipator configured to perform the anticipator scheduling memory transfers of the associated memory unit array to fulfill the anticipated future memory transfer requirement, the anticipator configuring at least one of the associated communication node chips to fulfill the anticipated future transfer requirement and the anticipator configuring at most one of the associated DPC to respond to the anticipated internal transfer requirement of the associated DPC with any coupled the associated communication node chips so that the performance requirement is met in the average over the sustained runtime.
(24) The DPC collectively create multiple of a computing floor window into a data space of the algorithm. The anticipated future memory transfer requirement may include an anticipated computing floor window input requirement from the associate memory unit array and an anticipated computing floor window output requirement to the associate memory unit array. The anticipated future transfer requirement of the associated communication node chip may include an anticipated future transfer requirements across the computing floor window and an anticipated future transfer requirement for a subsequent computing floor window. The anticipated internal transfer requirement for the associated DPC with the anticipator chip may include an anticipated loading requirement into the DPC of the computing floor window and an anticipated storing requirement from the DPC of the computing floor window. The system performance requirement may include the system performing at least ¼ of billion billion flops (exaflops) for a sustained runtime directed by the algorithm. The system performance requirement includes the system performing at least one of the exaflops for the sustained runtime directed by the algorithm.
(25) The computing floor window may include at least two columns of blocks of r rows and the r columns of the matrix A traversing all of the N rows, where the r is at least 16. The incremental state may include a pivot of a column from a diagonal row to the N of the rows of the matrix A. also, at least one of the memory unit arrays may include at least one Dynamic Ram (DRAM).
(26) From a different perspective, the apparatus of this invention includes an anticipator adapted to respond to a system performance requirement by a system for an algorithm and an incremental state of the algorithm received by the anticipator. The anticipator is adapted to respond to the incremental state by creating an anticipated requirement. The anticipator is adapted to respond to the anticipated requirement by directing the system to achieve the system performance requirement. In many implementations the anticipator may well be a chip, and to simplify this discussion, but not to limit the scope claims, anticipators will be referred to as anticipator chips. The anticipated requirement, may include an anticipated future memory transfer requirement of at least one memory unit arrays as an associated large memory to the anticipator chip, an anticipated future transfer requirement of at least one Landing Module (LM) chip as at least one associated communication node chip to the anticipator chip, and an anticipated internal transfer requirement for at most one Data Processor Chip (DPC) as an associated DPC to the anticipator chip.
(27) The AC adapted to respond to the anticipated requirement includes the anticipator configured to perform the anticipator scheduling memory transfers of the associated memory unit array to fulfill the anticipated future memory transfer requirement, the anticipator configuring at least one of the associated communication node chips to fulfill the anticipated future transfer requirement and the anticipator configuring at most one of the associated DPC to respond to the anticipated internal transfer requirement of the associated DPC with any coupled the associated communication node chips so that the performance requirement is met in the average over the sustained runtime.
(28) The anticipator may further include a state table adapted for configuration to integrate the incremental states of the algorithm to update the state table to account for the anticipated requirement and the anticipator responds to a successor incremental state based upon the state table in order to generate a successor anticipated requirement. The state table may be adapted to integrate the incremental states of the algorithm to update the state table to account for the anticipated requirement, for each of the incremental states. The incremental state may include a pivot decision for one of the columns of the matrix A.
(29)
(30) This capability to encapsulate both the data and the program changes the nature of programming these computers. Assuming for the moment that one core may keep its multiplier and possible non-additive term generator busy at least 90% of the time, and that the other resources of the core may keep up, the core in processing a 128 by 128 LU Decomposition, is busy for a minimum of about 300K clock cycles, during which time, there has been no load on the surrounding resources nor on the external communications network. Also, anything not actively used has been turned off, no longer consuming power whenever it is not being used. Note that if all the resources of the PEM, containing 4 cores are put to the task of calculating the LU Decomposition, the results may be achieved 4 times faster, because there is linear performance improvement, because again, the multiplications and non-additive term generation does not stall and everything else keeps up.
(31) Returning to
(32) Today's computer architectures stem from the von Neumann architecture, and from three primary devices building on that architecture. The von Neumann architecture implements a central processing unit (CPU) using a program counter to access a location in a memory to fetch an instruction. The CPU responds to the fetched instruction by translating it into some sequence of states, generally referred to as executing the instruction. The program counter may be altered, and the CPU repeats the process of fetching and executing instructions. The three primary devices are the IBM 360 with its use of caching, the VAX-11 with its multi-tasking and virtual memory environment, and the Pentium as representative of superscalar microprocessors. The IBM 360 introduced caches as a way to interface slow, but large, memories to the CPU. The VAX-11 successfully ran a multitude of different programs on the same CPU during a small time interval, where each program could pretend that it ran in a huge memory space. The superscalar microprocessor interprets an intermediate language of a simpler architecture, such as the 80486 or PowerPC, into smaller (pico) instructions. The pico-instructions are scheduled into streams that simultaneously operate data processing resources, such as floating point arithmetic units, at a far higher rate than the intermediate language made apparent. All of these innovations made for better general purpose computers. The extension of multithreading to superscalar microprocessors is discussed later.
(33) These legacy architectural components do not address the needs of high performance computers (HPC), the power requirements for Digital Signal Processing (DSP) circuits, nor the requirements for System On a Chip (SOC) components today. The following research results are applicable to DSP and embedded cores for SOC, but our focus here is on HPC. Each HPC program saturates the resources of its execution engine. Rather than running many programs on one computer at the same time, only one program is running on the many computers in the HPC system at the same time.
(34)
(35) In the SMP core, each simultaneous process separately owns instructed resources of the core. These owned resources, combined with the owning process state calculator component the state index, form the processor embodying the process. Each owned instructed resource includes its own local instruction processor that simultaneously responds to the process state of its owning process to generate a local instruction that instructs the instructed resource as part of the owning process. The instruction processing is local to each data processor resource. These data processing resources, such as a data memory port, an adder, and so on, are called instructed resources. Instruction processing is local to each data processor resource. These data processing resources, such as a data memory port, an adder, and so on, are called instructed resources. Each process owns separate instructed resources so that the Parallel Part (PP) and the Sequential Part (SP) need not stall each other. Owning a resource means that one, and only one, process within a task stimulates its instruction processing with its process state. A program defines the resources owned by the specific simultaneous processes of a task. A process state calculator issues a process index for each of the simultaneous processes. Local resources performing data processing, memory access, I/O and feedback are each owned by specific instruction processors, or are not used at all by that task. Ownership may vary for different tasks, but within one task is fixed. Each simultaneous process may own some of the instructed resources, which it exclusively uses and controls. For each of the simultaneous processes, the local instruction processor uses the process index for these owned resources to create a local instruction for the resource. This local instruction directs the execution of the simultaneous process through this resource.
(36) These basic decisions bring substantial benefits: The SMP core simultaneously performs both processes PP and SP as shown in
(37)
(38) The SMP core is shown executing two simultaneous processes by generating two process indexes that each drive instruction processing for the instructed resources owned by one of these processes. Each instructed resource is instructed by a local instruction generated in response to the process index of the owning simultaneous process. Both the parallelizable and sequential parts may be implemented as simultaneous processes that do not stall each other to execute. Locally generated instructions selected from multiple process indexes insure operational diversity in controlling the resources, while minimizing instruction redundancy. Matrix inversion by Gaussian elimination requires less than 24 local instructions.
(39) This combination of the process state calculators and the execution wave front renders both large external VLIW memories and instruction caches obsolete. Also, the typical first level data cache containing 32 K bytes is replaced by four instances of high speed static rams, each containing 1 K (1,024) double precision floating point numbers, which is now completely under the control of the program. All of this greatly improves energy efficiency.
(40) The execution waves are generated on each clock cycle by continuously calculating the process indexes in the instruction pipe 0 to support a simple flat time execution model. This not only simplifies the programming, but also optimizes task switching. The data entering the instruction pipe with the execution wave front generates the data results coming out of the instruction pipe. Further simplicity results from requiring the inputs of each instruction pipe to come from the outputs of the previous instruction pipe. The execution wave front as implemented in arithmetic units, such as floating point adders, forbids feedback paths internal to these units.
(41) The SMP core may be adapted to respond to a clock signal oscillating through successive clock cycles at approximately a clock period. The process state calculator is adapted to calculate the state indexes of the simultaneous processes on every clock cycle. The instruction pipe stages each include at least one, and often more than one instructed resource, which is owned by no more than one of the simultaneous processes. The process state calculator also generates a useage vector for each of the simultaneous processes, which designates which of the instructed resources are used in the execution wave front to perform the operations of the process. The process state calculator also generates a use vector summarizing what instructed resources are used for the execution wave front for all the simultaneous processes.
(42) As the execution wave front approaches the next instruction pipe stage, the use vector component for each of the instructed resources of the next stage is used to gate the power to the instructed resource, generating the gated power to that instructed resource. As a consequence, if no instructed resources are used in the execution wave front, the instructed resources are essentially turned off during the execution wave front's traversal of the instruction pipe stages.
(43) For example, a floating point adder operating at 200 MHz is unlikely to have the same pipe stages as one operating at 1 GHz. Instead of internal feedback, each feedback path is made external to the arithmetic units and partitioned into separate instructed resources. One receives input, Fin, and the others provide output ports, Fout, for feedback path queues. Simultaneous processes, like the parallelizable and sequential processes of matrix inversion, may now communicate through the separately owned input and output ports of the feedback paths in a core. Data memory is shown as including 4 RAM blocks, each with a read port with two output queues (RD 0 Q0 and Q1, for instance) and a write port (WR 0).
(44) The execution wave replaces a traditional buss and provides substantial benefits. The output of each feedback path is organized as multiple queues that stimulate the calculation of process indexes and/or the local instruction processing as the data becomes available for use within the owning process. Multiple queues in a single feedback output port enable a hierarchical response to data availability, allowing a single adder to act like a cascading adder network for accumulation in Finite Impulse Response (FIR) filters and dot products, as well as pivot entry calculation in matrix inversion and LU decomposition. All of these algorithms, as well as matrix algorithms and vector products, may now be implemented so that the multiplications do not stall, and the other core circuitry keeps up with the multiplications, providing maximum performance at the least energy cost for the required operations. This is independent of core clock frequency, or the number of pipe stages in the arithmetic circuits.
(45) As used herein, the SMP core of
(46) When data processing involves integers, the core may be referred to as a SMP integer core. When the integers range over an N bit field, the core may be referred to as a SMP Int N bit core. For example, N may be 32, 48, 64, and/or 128 bits, and/or other bit lengths. The use of and/or in the previous sentence is an acknowledgement that multiple integer lengths may be efficiently performed using the execution wave front through the resources of the SMP integer core. One skilled in the art will recognize that integers may be used in arithmetic as signed and or unsigned numbers, possibly representing fixed point numbers. Addition may also be supplemented by logic operations on corresponding bits of integer operands, possibly after one or more of those operands have been shifted.
(47) When data processing involves Floating Point (FP) numbers, the core may be referred to as a SMP FP core. The FP numbers are compatible with a floating point standard denoted as single precision (SP) with k Guard bits (SP+k G), double precision (DP) with k guard bits (DP+k G) or extended precision (EP) with k guard bits (EP+k G). For example the core may be referred to as a SMP (DP) core when the floating standard is DP. By way of example, the k may be an integer such as 0 to 6 in some implementations. In other implementations K may be larger. The number of guard bits k will be assumed to be one unless otherwise stated.
(48) Basic data cores refer to SMP data cores involving numbers operated upon by multiplication and/or addition, and possibly also logic operations such as Boolean operations, table lookups, and various shift-based operations.
(49) In several situations, some basic non-linear operations, such as reciprocal and/or reciprocal square root may be required. For the moment, to simplify the discussion, consider these operations to be provided for floating point numbers, for example, single precision (SP) numbers or double precision (DP) numbers. These operations can be provided by basic Non-Linear Accelerators (NLA), first shown in
(50) Gaussian elimination or LU decomposition. The basic NLA may also include a range clamp that can be configured to respond to a received FP number by generating a small integer output and a range limited (or clamped) fractional number, whose absolute value is less than or equal to 1.0. The small integer output can be used to direct a simultaneous process to calculate a range limited approximation of a non-linear function such as sine or cosine, logarithm or exponential, to name some examples. The Basic NLA core may in some implementations, have no inherent processes associated with it, acting instead as instructed resources arranged in the instruction pipes as shown and owned by one or more simultaneous processes associated with a SMP data core.
(51) There is however a problem with the basic NLA. Polynomial approximations can often times require twice as many multiplications as non-additive terms actually used in the polynomial calculation. The inventors have developed a log based NLA cores specific to single precision floating point and to double precision as shown first in
(52) There is a second problem, Consider for the moment an SMP core that can accumulate a condition vector of operational conditions resulting from a succession of comparison operations of a c-adder or a range clamp into a bit vector of length 64 to 128 bits in length. Such a condition vector may summarize answers to a collection of questions about database entries, such as a person's age, weight, time of birth and so on as a first step in data mining a database of such information. What is also needed is a mechanism to simultaneously match the condition vector against multiple patterns looking for outliers and/or how many of the vectors match a given pattern. The Pattern Recognizer (PR) core serves that purpose, and is adapted to receive the condition vector and simultaneously match the condition vector to a collection of pattern templates to generate and/or update a collection of tallies or generate flags to outlier comparison vectors, as an execution wave front. In
(53)
(54)
(55) A very interesting simplifying assumption can be implemented in some embodiments. Assume that no simultaneous process owns resources involving more than one type of numbers, so that an integer SMP core's processes only own instructed resources in one or more integer SMP cores, and a FP SMP core processes only own instructed resources in one or more FP SMP cores. In some situations, a SP core's processes may not own instructed resources in a DP core.
(56) Two circuit provide interfaces between the integer and floating point SMP cores. The float to int circuit converts a floating point number into an integer and the int to float circuit converts an integer to a floating point number. These circuit straddle the two cores in terms of process ownership, the int core interface components may be owned by one of the int SMP core processes, while the FP core interface components may be owned by one of the FP SMP core processes. This is shown in the example of
(57) Summarizing, the apparatus may include a Simultaneous Multi-Processor (SMP) core including a process state calculator adapted to generate a state index for each of at least two simultaneous processes; and an instruction pipeline of at least two successive instruction pipe stages adapted to execute the state index for each of the simultaneous processes, collectively performed by an execution wave front through the successive instruction pipe stages with use of an owned instructed resource by one of the simultaneous process determining whether power is supplied to the instructed resource.
(58) Further, in some embodiments, a core module as shown in
(59)
(60) The Log 2 input queue feeds the Log2 calculator, which responds to the availability of data in the log 2 input queue by generating the LgCalc Output (Out), which is a log domain formatted number, shown in some detail in
(61) The log ALU is shown receiving log domain inputs to feed 4 input queues that generate the log domain numbers used inside the logALU. These log numbers are added as fixed point numbers with indicators which may include, but are not limited to, Neg(ative) Number, Not-a-Number (NaN), Neg Infinity (NegInf) and Pos Infinity. NegInf results from taking the log2 of the FP number 0.0. In the LogALU, adding a log number with NegInf asserted results in a log result with NegInf asserted. The exp2 of a log number of NegInf asserted is FP 0.0. This insures the 0*x=0, for all normal and denorm FP numbers x.
(62) The FP2L, the Log2FP and the LogMul circuits are well enough understood that implementations of these circuits compatible with double precision floating point do not represent any substantial feasibility problems. This leaves the log2 calculator, exp2 calculator and the logALU, which will now be considered in turn.
(63) For the NLA to be feasible and testable, it is necessary to derive and analyze the log2 circuit. Several implementations of the log2 calculators shown in
(64)
(65) Once 1+x is generated, an initial selection y0 as the most significant bits of x is made. Assume for the moment that y0 ranges from 0 to 7, the top 3 most significant bits of x. The Cur_product 0 is generated as 1+y0/8. The indicators may include, but are not limited to, Neg(ative) Number, Not-a-Number (NaN), Neg Infinity and Pos Infinity. Something to note, if FPN=0.0 NegInfinity is asserted, otherwise NegInfinity is not asserted. If FPN=∞ Neg Number is asserted and Pos Infinity is asserted. The exponent_value may be calculated based upon the double precision format as defined by “IEEE 754 Standard for Binary Floating-Point Arithmetic” (ANSI/IEEE, 1985) and/or subsequent standards. In this and the following version of the log2 calculator, the step calculators and possibly the log table calculator may have the execution wave front gated off when the indicators indicate that the mantissa is not needed to generate lgCalc Out. The execution wave front may also be gated off when the log2 calculated is not needed.
(66)
(67) In both Figures, the yj stimulates a log table j to generate Lg j. The cur_product j stimulates the next step calculator j+1, until step calculator J, which does not generate cur_product J+1 nor stimulate a subsequent step calculator. The critical path for the step calculators may be seen as the path to generate the next cur_product and in the last step calculator, the path to generate yJ.
(68) In both Figures, the log table calculator receives the y0 to yJ indexes, which in many implementations are 3 bit numbers, used to access corresponding tables of fixed point numbers to generate Lg 0 to Lg J. The 0 indexed entry represents zero and the other entries are non-zero in at least some of their bits.
(69) Simplistically, in log table 0, the non-zero entries are filled across all the bits. However subsequent log table entries have their top 3 bits zeroed. So log table 1 has its top 3 bits zeroed. The log table 2 has its top 6 bits zeroed. The log table 3 has its top 9 bits zeroed. And so on. Also note that in both Figures, the log table calculator may not be pipe stage aligned with the step calculators.
(70) Formal Verification of the Log2 Calculator: Assume that the mantissa of the floating point input is correctly generated with regard to zero, denormals, Not-A-Number (NAN), and infinities. One note, negative infinity in the log domain corresponds to zero in the floating point domain, and adding a log number with negative infinity asserted to another log number generates in a log result with negative infinity set to insure that 0*x=0 is true in the corresponding floating point operations. Two definitions are used in what follows. First, in performing additions of two binary integers, the result requires the bit level carry propagation to traverse every bit cell formed to calculate the result for corresponding bits of the two numbers. This carry propagation is expensive in circuitry and in propagation time. Second, an alternative known as a carry save adder, invented by von Neumann, generates a local sum and a local carry output in each of the bit cells. Corresponding bits of three integers can be summed with the basic circuit cell. Define these bits as a, b, c and define
Local_sum=a.Math.b.Math.c=(a&
Local_carry=(a&b)∪(a&c)∪(b&c)
(71) The local_sum is 1 when only one of a, b, and c is one or when all three of them are 1. The local_carry is 1 when two or more of a, b, and c are 1, assuming a bit notation of 1 and 0. Every logic technology used to build computers is likely to have circuit cells capable of implementing this, or some variant of this, circuitry. Assume that the input X has an exponent Xe and a mantissa
(72)
which includes the guard bit. The mantissa is factored into the following product:
(73)
(74) Because the error is less than ¼ of the guard bit, this factorization is accurate enough to represent the mantissa. Consider log.sub.2(Π.sub.j=1.sup.J(1+y.sub.j8.sup.−j)). Since the logarithm of a product is the sum of the logarithms of the product's terms:
(75)
(76) So putting these pieces together
(77)
(78) This is a sum of an entry from each of J+1=19 tables, each table having 7 non-zero entries. The table entries are fixed point and as accurate as needed to insure the log fraction is as accurate as required. The problem to be solved is that, given
(79)
we need to find the best fit of
(80)
By best fit, we mean that each product term (1+y.sub.j8.sup.−j) has a non-negative remainder that is the smallest positive remainder of the choices for the factors (1+y.sub.j8.sup.−j), with y.sub.j ranging from 0 to 7. Once found
(81)
Steps to the solution include initialization, preparing for a subsequent factoring step, performing the factoring step and calculating the logarithm after the last factoring step.
(82) Initialization: Assume we have already calculated log_2_table as having 19 by 8 entries, with the log_2_table(j,0)=0, for each j from 1 to 19. Further assume these table entries are accurate to M fixed point bits and whatever additions are performed in a M+1 bit unsigned integer adder structure, so that overflow is the top most bit.
(83) We are about to calculate a vector [y.sub.1y.sub.2 . . . y.sub.19] representing the best fit product terms
(84)
Let y.sub.1=x.sub.1. The remainder is
(85)
(86) which is non-negative. Observe that this is the best fit for the first product term. Consider choosing a different value Y for y.sub.1.
(87) If Y>x.sub.1 then it is not a valid choice since the remainder would be negative.
(88) If x.sub.1>0 and Y<x.sub.1, then the remainder
(89)
is greater than
(90)
(91) If x.sub.1=0, then there is no smaller acceptable Y.
(92) Preparing for each subsequent factoring step: Initialization uses an implicit term, a ScalingFactor=8.sup.−1 . Assume that the previous step had an existing value for the ScalingFactor. Update the ScalingFactor=ScalingFactor*8.sup.−1.
(93) In many implementations, there is some counter j whose value is incremented. After initialization, j=2.
(94) For subsequent factoring steps j=j+1. Initialization generates a first best fit product, which is 1+x.sub.18.sup.−1.
(95) Assume for subsequent steps that the best fit product is denoted as Prev_Product. Subsequent factorization steps calculate a vector Cur_product.sub.k=Prev_product* (1+k*ScalingFactor) for k=1, . . . , 7. Calculate a second vector Remainder.sub.k=(1+x)−Cur_product.sub.k again, for k=1, . . . , 7. If Remainder.sub.1>0 then select y.sub.j=max {k such that Remainder.sub.k ≥0} Otherwise, y.sub.j=0.
(96) Hypothesis: y.sub.j is the best fit with a non-negative remainder.
(97) Proof: if y.sub.j==0 then there was no non-negative remainder Remainder.sub.k=1, . . . , 7 from the Cur_product.sub.k=1, . . . , 7 vector.
(98) Otherwise, since the Remainder.sub.k=1, . . . , 7 vector declines for each successive k, picking the largest k with a non-negative Remainder.sub.k insures that this choice has the smallest non-negative remainder. After the last factorization step, the vector [y.sub.1y.sub.2. . . y.sub.19 ] has been calculated.
(99) The one remaining concern is the difference of remainders for the last step, denoted here as Diff.sub.19 . Now, if y.sub.19>0 then Diff.sub.19=Cur product.sub.1−Prev_product which can be calculated as
(100)
(101) And
(102)
(103) Otherwise
(104)
(105) The generated product satisfies |(1+x)−Π.sub.j=1 .sup.J+1(1+y.sub.j8.sup.−j)|<2*2.sup.−19*3=2.sup.−56 which is within ¼ of the guard bit. Recall that 1≤1+x<2. Note that |(1+x)−Π.sub.j=1.sup.J(1+y.sub.j8.sup.−j)|<b 2*2.sup.−18*3=2.sup.−53, which is twice the guard bit, indicating that the loop terminates at or before J+1.
(106)
(107) Implementation of the Log2 calculator:
(108)
(109) Calculating the logarithm as the sum of product terms: For the moment, consider 1+x to be exact as this product.
(110) Calculating
(111)
can be done as accurately as the product term table is calculated, since the accumulation of rounding errors can be controlled by using accurate enough estimates in the log2_table. The sum of 19 numbers with ½ LSB errors, has a rounding error estimate of Rounding_error=log.sub.2(19)≈4.25 bits, so if a log domain calculator is to be accurate to ½ the guard bit in calculating X.sup.64 then the table entries need to be accurate to 54+6+4.25 bits, or 64 ¼ bits. Note that the circuitry being described can just as readily implement log2 calculators for single or quad floating point precision. What changes is the number of step calculators and the precision of the arithmetic being performed in those calculators and the amount of precision (and number of tables) in the Log Tables, as well as the number of Lg0 to LgJ log estimates, the precision of the log_fraction and the specifics of the log domain packager. At this time, the focus of scientific and engineering calculations projected to run on exascale computers and high performance computers is double precision. Also note, the initialization step may use any number of bits to calculate y0, from 1 to N<55. However, table sizes favor N<12 and preferably N<=10, given contemporary memory technologies. The future could well be different and preferences for N may change. The above discussion used N=3 to simplify the derivation, not limit it to just that value of N.
(112) Summarizing, the apparatus may include a log2 calculator adapted to receive a floating point operand and to generate a log domain operand corresponding to the floating point operand with a floating point standard, comprising a component extractor adapted to respond to the floating point operand by generating an exponent, an indicator collection, a mantissa representing 1+x, where x is greater than or equal to 0 and x is less than 1, an initial product estimate Cur_product 0, and an initial factor estimate y[0]. The log2 calculator may include at least one step calculator adapted to determine a subsequent product estimate Cur_product j+1, a subsequent factor estimate y[j+1] in response to receiving the mantissa 1+x, the Cur_Product j, for j ranging from 0 up to J−1, wherein the J is at least 7; a log table calculator adapted to respond to receiving the y[0], to y[J] by generating a log fraction as the sum of log2 table entries accessed by y[k], for k ranging from 0 to the J; and a domain packager responding to the exponent, the indicator collection, and the log fraction to generate the log domain operand. The log2 calculator may include from one to 19 instances of the step calculator. The initial factor estimate may include at least the top L bits of x, wherein the L is a member of the group consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12.
(113) An Exp2 Circuit Implementation:
(114)
where
(115)
(116) Define 6 numbers,
(117)
tor each of tne tractional part of outputs Xk of these 6 tables. So
RawMantissa=Π.sub.k=1.sup.6(1+X.sub.k)=(1+X.sub.1)(1+X.sub.2)(1+X.sub.3)(1+X.sub.4)(1+X.sub.5)(1+X.sub.6)
(118) First, let's make the following definitions
(119)
(120)
RawMantissa=(1+X.sub.1)(1+X.sub.2)(1+X.sub.3)(1+X.sub.4)(1+X.sub.5)(1+X.sub.6)=1+S.sub.0+X.sub.1*S.sub.1+X.sub.2*S.sub.2+X.sub.3*S.sub.3+X.sub.4*S.sub.4+X.sub.5*S.sub.5+X.sub.6*S.sub.6+X.sub.1*(X.sub.2*S.sub.2+X.sub.3*S.sub.3+X.sub.4*S.sub.4+X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.2*(X.sub.3*S.sub.3+X.sub.4*S.sub.4+X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.3*(X.sub.4*S.sub.4+X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.4*(X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.5*X.sub.6*S.sub.6+X.sub.1*X.sub.2*(X.sub.3*S.sub.3+X.sub.4*S.sub.4+X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.2*X.sub.3*(X.sub.4*S.sub.4+X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.3*X.sub.4*(X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.4*X.sub.5*(X.sub.6*S.sub.6)+X.sub.1*X.sub.2*X.sub.3*(X.sub.4*S.sub.4+X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.2*X.sub.3*X.sub.4*(X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.3*X.sub.4*X.sub.5*X.sub.6*S.sub.6+X.sub.1*X.sub.2*X.sub.3*X.sub.4*(X.sub.5*S.sub.5+X.sub.6*S.sub.6)+X.sub.2*X.sub.3*X.sub.4*X.sub.5*X.sub.6*S.sub.6+X.sub.1*X.sub.2*X.sub.3*X.sub.4*X.sub.5*X.sub.6*S.sub.6
(121) Now to resolve how to calculate the raw mantissa with the least logic in a formally verifiable manner, because the input space cannot be exhaustive examined, verified or tested. This implementation is derived from the above formula and analysis of the facts below. The maximum table entries for the six tables were calculated using Wolfram Alpha and Mathematica as a starting point for analysis of the required circuitry.
(122) TABLE-US-00001 No of Max bit Max k bits from top X Max Value (hex) Max S.sub.k Remarks 1 59 0 <2.sup.0 f f4ea ca43 91b5 da.33 <2.sup.−8 Largest very close to 1.0 2 50 9 <2.sup.−9 5 89c0 1bd1 29f8.da <2.sup.−17 3 41 18 <2.sup.−18 2 c465 b8f1 67.05 <2.sup.−26 4 32 27 <2.sup.−27 1 6232 bde6. fc <2.sup.−35 5 23 36 <2.sup.−36 b 1195e.eb <2.sup.−44 6 14 45 <2.sup.−45 588c.af 0 Largest entry in smallest magnitude table
(123) Observe table row 6, any product involving S.sub.6=0 has no effect on the RawMantissa.
(124) This simplifies the above formula as follows
(125)
(126) Collecting the terms involving X.sub.5*S.sub.5 from the RawMantissa formula we have following multiplied by X.sub.5*S.sub.5:
1+X.sub.1+X.sub.2+X.sub.3+X.sub.4+X.sub.1*X.sub.2+X.sub.2*X.sub.3+X.sub.3*X.sub.4+X.sub.1*X.sub.2*X.sub.3+X.sub.2*X.sub.3*X.sub.4+X.sub.1*X.sub.2*X.sub.3*X.sub.4
(127) Observe the table row 5, X.sub.6=S.sub.5<2.sup.−44 and Max(X.sub.5S.sub.5)<2.sup.−362.sup.−44=2.sup.−80. Recall that 0≤X.sub.j≤max(X.sub.j)<1 for j=1:4. So that if we count up all terms multiplied by X.sub.5*S.sub.5, the above sum is less than 11, which means that these terms have no significance on the total for the RawMantissa. This simplifies the above formula as follows:
(128)
(129) Collecting the terms involving X.sub.4*S.sub.4 from the RawMantissa formula we have following multiplied by X.sub.4*S.sub.4:1+X.sub.1+X.sub.2+X.sub.3+X.sub.1*X.sub.2+X.sub.2*X.sub.3+X.sub.1*X.sub.2*X.sub.3.
(130) Observe row 4, Max(X.sub.4S.sub.4)<2.sup.−272.sup.−35=2.sup.−62 . Again applying the insight that we have 7 terms, each of which is less than 1, this sum of terms can have no effect because 7*2.sup.−62<2.sup.−62+3=2 .sup.−59 which makes the sum of these product terms less than numbers 5 bits below the guard bit of the RawMantissa. Based upon this insight the formula becomes
(131)
(132) Collecting the terms involving X.sub.3*S.sub.3 from the RawMantissa formula we have following multiplied by X.sub.3*S.sub.3: 1+X.sub.1+X.sub.2+X.sub.1*X.sub.2 Observe row 3, Max(X.sub.3S.sub.3)<2.sup.−182.sup.−26=2.sup.−44. The sum of these terms can affect the raw mantissa. However, it overlaps with the other terms only in the bottom 10 bits of the raw mantissa, leaving aside the issues of carry propagation, which are performed in the exp carry propagate adder. Observe row 2, Max(X.sub.2S.sub.2)<2.sup.−92.sup.−17=2.sup.−26. This means that all the product terms involving X.sub.2S.sub.2 affect the raw mantissa. However, they overlap with the other terms only in the bottom 28 bits with the raw mantissa. Observe row 1, Max(X.sub.1S.sub.1)<2.sup.02.sup.−8=2.sup.−8. This means that all the product terms involving X.sub.1S.sub.1 affect the raw mantissa. However, they overlap with the other terms only in the bottom 46 bits with the raw mantissa.
(133)
(134)
(135) Summarizing, the apparatus may include an exp2 calculator adapted to receive at least one of the log domain operand and the logALU result as an exp2 input and generate an exp2 approximation accurately in the floating point standard. The exp2 input includes a log fraction part represented as a sum of fp*2{circumflex over ( )}(−9*p), with p ranging from 1 to 6. The exp2 approximation may include a mantissa calculation as including at least part of the product of exp2(ƒ.sub.p/2.sup.9*p) where p ranges from 1 to 6. The exp2 approximation may be an accurate representation of the multiplication of two floating point operands in a matrix multiplication used in Block LU Decomposition of a matrix also known as Linpack.
(136)
(137)
(138)
(139) Summarizing the apparatus of the invention may include a logALU adapted to respond to the log domain operand to generate a log domain parameter based upon a parameter instruction. The log domain operand includes a log number. And the parameter instruction directs generating the log domain parameter as a log domain sum including the log number shifted by at least one bit position. The parameter instruction directs further generating the log domain parameter as a log domain sum further including the log number shifted by a second bit position. The logALU may be further adapted to respond to receiving a second log domain operand by log-adding the log domain parameter to the second log domain number to generate a logALU result.
(140)
(141)
(142)
(143) One skilled in the art of non-linear function evaluation will recognize that the logALU can be extended not only to process multiple parameters on each execution wave front, but also to generate a succession of logALU outputs to form the non-additive components of a non-linear function, which may either be farther processed in the log domain or transferred into the FP domain through operation of the log2FP converter and/or the exp2 calculator to the FP domain. The Nla, or more specifically the LogALU may further signal any combination of the following: identify which non-linear function is being calculated, the start of function evaluation, the term count of the logALU result and the end of the function evaluation. All of the discussed extensions are within the scope of the invention and may be claimed now or in subsequent patent application, possibly as a divisional of this patent application.
(144) There are several demands regarding a high performance numerical computer, in particular, exascale computing and data mining may require runtime condition vector generation. Efficient runtime testing is now required. Determining the differences in a vector from some standard is essential for efficient testing of cores to insure that every component capable of performing a calculation gives exactly the same answer. Any difference points to a problem that needs to be solved. Since a package includes 128 bits of data, up to 128 element vectors can be loaded and then analyzed into one package, where everything that matches is set to a one hot code or a one cold code. Something needs to build this package bit vector. Additionally, data mining of a database may require assembling a collection of test results for a database entry to determine the entry's embodiment as an example of one or more patterns.
(145) Addressing these requirements may be done by extending the C-adders to include a condition accumulator operating on a small bit window, possibly of no more than 8 bits although possibly as large as 64 to 128 bits. When each condition accumulation is completed, the small bit window is sent to one or more of the following. The process state calculator may be configured to use as a condition state to further direct the process state calculations based upon a parameter location and/or a next state location in the simulation node. A package accumulator associated with the C-adder to append the small bit window. Once the package accumulator has enough data, the accumulated package is sent out of the core to report the runtime test state.
(146)
(147)
(148)
(149) There is a persistent problem with floating point addition of three numbers. Suppose there are three floating point numbers A, B and C, with A=−B*(1-2.sup.−20) and C 32 B*2.sup.−50. If A and C are added first, after rounding, only the top 3-4 bits of C have an effect on the sum. Whereas is A and B are added first the result is B*2.sup.−20. Then adding that result to C shows the effect of C much more thoroughly. A straightforward implementation of a three operand floating point adder aligns to two smaller mantissas simultaneously to the largest, adds these aligned mantissa to the largest magnitude mantissa and then rounds and finalizes the result. Given the example A, B and C, most of the significance of C is lost, even though it should be part of the result. To address this weakness of floating point addition requires improving the addition circuitry. This improvement is about the same size/complexity as the straightforward implementation.
(150)
(151) Tests reveal that it may be advantageous to automate accumulation of additive results for vectors whose lengths may vary in run-time. For example, FIR calculations may change the number of taps whose corresponding products need to be accumulated, and matrix inversion by Guassian elimination involves finding the maximum magnitude of the next column starting from the diagonal to determine the next pivot element, which varies from N to 2 entries as the algorithm progresses. A finite state machine, referred to as an accumulation manager can automate accumulating varying numbers of FP operands and/or packages for a simultaneous process. The accumulation manager may be configured to manage several queues, and a C-adder to complete the varying sums and/or comparisons requiring accumulation for the owning simultaneous process. Preferably in many implementations, the number of pipe stages in the adders cannot affect the ordering of the operands, nor adder operations. Further delineation of the structure and functions of the accumulation manager are enabled for one of ordinary skill in computer logic design from this document's disclosure. Certain implementations of the SMP data cores may include one or more instances of the accumulation manager.
(152)
(153) Summarizing: the apparatus may include a floating point adder adapted to receive an operand collection and generate a floating point add result from the operand collection, where the operand collection includes a first operand, a second operand and a third operand, comprising: an operand comparator adapted to compare exponents of the operand collection to determine a magnitude ordering of the operand collection, wherein the magnitude ordering determines a largest operand, a second largest operand and at least one smaller operand; a first adder phase adapted to perform a mantissa aligned addition of the largest operand and the second largest operand to generate a resulting operand including a resulting exponent and a resulting mantissa; and a second adder phase adapted to perform a second mantissa aligned addition of the resulting operand and the smaller operand to at least partly generate the floating point add result. Each of the operand collection represents at least one member of the group consisting of half precision floating point numbers, half precision floating point numbers with at least one guard bit, single precision floating point numbers, single precision floating point numbers with at least one guard bit, double precision floating point numbers, double precision floating point numbers with at least one guard bit, extended precision floating point numbers, and extended precision floating point numbers with at least one guard bit.
(154) The apparatus may include, but is not limited to, an accumulation management circuit adapted to respond to at least two feedback output ports and a desired accumulation count. The accumulation management circuit to adapted control a floating point adder to generate a floating point add result. The accumulation management circuit comprising an accumulation status indicator. And the accumulation management circuit adapted to respond to the desired accumulation count and the accumulation status by operating the feedback output ports and the floating point adder to generate the floating point add result implementing the desired accumulator count of floating point operands.
(155) Any or all of the DPC, AC, LM, in particular, the accumulation management circuit and/or the floating point adder is implemented with at least one of a Field Programmable Gate Array (FPGA), a semiconductor standard cell library, and a molecular gate network. The semiconductor standard cell library may implement a semiconductor process including at least one semiconductor device. The device may include at least one of a transistor, a memristor, and a photonic switch.
(156) The accumulation manager circuit may be adapted to control a comparison circuit including the floating point adder to further generate a floating point comparison result. And the accumulation manager circuit is adapted to respond to the desired accumulation count and the accumulation status by operating the feedback output ports and the floating point adder to generate the floating point comparison result implementing the desired accumulator count of the floating point operands.
(157) Energy Management in the SMP cores and PEM: Each of these PEM, and each of their SMP cores, is guaranteed to use minimal energy by the useage vectors.
(158) The SMP data core typically contains two adders, one may be owned by the parallel part, and the other owned by the sequential part. The parallel part may also own the multiplier as in FIR, dot products, FFTs and matrix inversion. To simplify programming, both adders can perform the same operations. These include an inline comparison that may be chained, without branching, to calculate the pivot for matrix inversion, or the maximum, or minimum, of a vector or matrix. These inline comparisons do not require flushing the arithmetic pipes before branching can be determined, which significantly reduces energy consumption. Other components shown include but are not limited to a reciprocal and reciprocal square root calculator, which constitutes an introduction to the basic data cores.
(159) The core architecture presented here does not require energy inefficient caches. Instruction caches are replaced by a simultaneous process mechanism providing huge virtual VLIW instruction spaces at each instruction pipe. The core also supports optimal resource sequencing and use, which replaces superscalar instruction interpreters. In one sense, a multithreaded processor can be seen as simultaneously executing the parallel part PP and the sequential part SP, as shown in
(160) A process state calculator, adapted to generate one state index and the associated loop outputs, can be implemented in about 10K gates, which is discussed next. Many core implementations may include three or four process state calculators, completely replacing the instruction caches, superscalar instruction interpreter, and multithread controller of a contemporary parallel processor core. These complex legacy mechanisms are no longer needed where one program dominates all the resources. This architecture's co-design utilizes software to take over what was previously done with hardware. At compile time, dependency analysis determines what needs to be done in the presence of available data. This compile time analysis directs code generation to create the process ownership, the process state calculator's configuration, loop controls, and the local instruction memory contents for the SMP core. In SMP cores, only the resources actually performing the computations, consume power. A data processor chip that includes between 500 and 600 of these cores is feasible to manufacture with existing technologies. While much remains to be done, this points the way to a new class of data processing cores that can meet the challenge of exascale and beyond.
(161)
(162) The process trigger generator receives the next process state, the core/PEM status signals, and the next loop state, from which the highest to lowest non-null process triggers are generated. In the initial implementations, these process trigger signals will probably be one-hot, although they could also be one-cold. Initially, the process states may be 6 bits and there are 63=2.sup.6−1 process trigger signals. In an implementation with process states of 8 bits, there are 255=2.sup.8−1 process trigger signals.
(163) The prioritizer responds to the process triggers by generating the next process state, the no operation signal, and the loop commands. The next process state is the number of the highest priority process trigger that is hot. The no operation signal is hot when all of the process trigger signals are cold. In the initial implementation, if the highest process trigger is hot, then the next process state is 63. If only the lowest non-null process trigger is hot, then the next process state is 1. If none of the process triggers are hot, the next process state is 0 and the no operation signal is hot.
(164) The loop commands may be generated as follows: The highest priority process trigger that is hot selects the state loop command for that priority signal to generate the loop commands. If no process trigger is hot, the loop commands are straight zeros, indicating no loop operation is performed by any of the loop calculators in the process.
(165) The loop calculator block responds to the loop command by performing its next state calculation. This calculation generates the next loop state and the loop index outputs. The next loop state vector is sent to the process trigger generator for use in generating the process triggers. The loop index outputs are sent to the execution wave front for use in memory addressing and other integer related operations.
(166)
(167)
(168)
(169) The stimulus enable signal j is the logical product of the appropriate combination of the signals S3, S2, S1, S0, each consisting of four 1-hot signals decoding the corresponding bit pair of the selector number. Two nand gates are shown receiving the stimulus enable j and the stimulus pair j to generate the negatively gated stimulus pair j. These negatively gated stimulus pairs are presented to the fixed dual OR plane with negative inputs to generate the stimuli pair. In some cases, the negative gates and negative OR planes may be implemented with positive and positive input OR planes.
(170) TABLE-US-00002 Bit pair S(0) S(1) S(2) S(3) 00 Hot Cold Cold Cold 0 01 Cold Hot Cold Cold 1 10 Cold Cold Hot Cold 2 11 Cold Cold Cold Hot 3
(171) The above table shows logic values in terms of hot and cold, which may vary from one implementation to another. Here are two interpretations of hot and cold that may be used: First: Hot=1, Cold=0; Second: Hot=0, Cold=1
(172)
(173)
(174) Part(I,J,K)=OR(Sdec(I,J,L)and C(I,J,K,L) at (task id*16+Prog_one) for L=0, . . . , 3)
(175) Stimuli(K)=OR(AND(Part(I,J,K), for I=0, . . . , 15) , for J=0, . . . , 3)
(176)
(177) The loop command generator of
(178) In some implementations the table accessed by the task ID and the program zone may only be accessed by the task ID. In others, the task ID and program zone are both used to address the table. The task No Op stimulus may be used in various ways, depending upon the implementation. In some implementations, there is no masking of the raw next state and the raw loop commands, and the No Op is transmitted directly where needed. In other implementations, the next process state and/or the raw loop commands are masked if the task No Op stimulus indicates a No Op. In these implementations, the No Op may, or may not, be transmitted as part of the execution wave front.
(179)
(180) Embodiments of the invention separate the loop counting from the loop index and its output. Each process loop calculator responds to a separate loop command generated within the process state calculator to generate the following: a loop index, a loop index output and a next loop state. The loop index output may be used in accessing memories and creating the operand packages. Loop counting is always count down, making zero detection the determination of a loop's end. The initial condition is indexed from a table of several loop initial states, allowing the reuse of the loop calculator as the process state progresses. These entries act to constrain the looping into smaller loop components, thereby removing the need for conditional execution of ranges of loop indexes. Compilation of Fortran loops has to account for conditional execution of the loop body based upon index conditions. To make this efficient in any LSM, each loop initialization table is given 16 entries. The loop index output calculator may add, or subtract, integer increments other than 1, supporting stopping at a boundary, rolling over and sign reversing at the loop index boundary. Each loop calculator of
(181) Each loop calculator responds to the loop command that may be a 2 bit loop command code from the process state calculator: 00 inactive, 01 next loop step, 10 next initial state, and llforce loop end. Branching becomes a matter of changing the process state, which alters what instructions are fetched locally for the owned resources of the process as the execution wave front moves through them. Looping requirements for Fortran are satisfied in the loop calculators of the process state calculator. Assuming 32 bit down counters, the four loop calculators of one of these processes may be cascaded to provide 2.sup.128 iterations. Rather than branch on an index condition, each loop calculator may have up to 16 sub loops and the process state calculator may respond to the ending each sub loop iteration differently. This provides a good target for conditional processing of loops by Fortran compilers.
(182) Summarizing the Basic Features of Each Process State Calculator: It automatically responds to changes in task ID, program zone and/or task command. It is efficiently implemented with FPGA emulators and with CMOS standard cell libraries. All the power for the next execution wave front is gated off with the no operation signal. The process state and the index output independently change. One adder driven by 1 process state calculator may respond to 16 queue status pairs to add 3.sup.16=43,046,721>2.sup.25 numbers. Vector dot products may be summed with just part of one PEM from product results originating anywhere in the EASM. The use of the queues to stimulate process state change removes the need for multiple chip synchronization. Every data process acts based upon the availability of data to it process and its ability to handle the results. With 32 bit down counters, the four loop calculators of one of these processes may be cascaded to provide 2.sup.128 iterations. Rather than branch on an index condition, each loop calculator has up to 16 sub loops and the process state calculator may respond to the ending of each sub loop iteration differently.
(183) Consider an extension of the core architecture that supports local recursive processes. Assume that the local feed queues are configurable as either queues or as stacks on a task-by-task basis. The functional distinction between a queue and a stack is that the queue is a First In-First Out (FIFO) structure whereas a stack is a List In First Out (LIFO) structure. To properly handle the arithmetic requirements, both require the ability to remove their top 3 entries, and both may operate successfully adding one entry at a time.
(184) As a first step into communication, note that all the processes within a PEM are able to communicate with any other process within the PEM using the local feedback mechanism of the PEM.
(185) Summarizing, PEMs of the DPC are adapted to implement a local North East West South (NEWS) local feed network adapted to stimulate and respond to the cores within the PEMs. The NEWS local feed network may be adapted to wrap around from top to bottom within the DPC, wrap around with a twist from top to bottom within the DPC, or wrap around with an offset from top to bottom within the DPC. The DPC may include a configuration state retained over time to configure the NEWS local feed network to operate as one of wrap around from top to bottom within the DPC, wrap around with a twist from top to bottom within the DPC, and wrap around with an offset from top to bottom within the DPC.
(186) There are several problems inherited by existing communications systems within super computers: 1:the standard, message-based communication protocols, stalls both transmission and reception of messages, so that transmission and delivery occurs over multiple clock cycles. 2:standard message formats support variable length data payloads that add a substantial complexity to message transfers and processing. 3:the use of routers to move the messages across standard communications networks do not provide any certainty about the latency to traverse the router from message input to output. 4:communication failures into, within and out of routers are very difficult to handle and almost inevitably engender the intervention of more systems components to roll back to the last point of known good transfers, and in a number of cases, this may not be possible, instead causing large scale crashing of the system. 5:many communication systems grow in complexity faster than the number of clients for that system, causing the communications manufacturing cost, as well as energy consumption to grow more than linearly to the number of data processors.
(187) These problems must be solved to achieve exascale performance of even the first benchmark program, Linpack as some implementation of Block LU Decomposition. To simplify this discussion, all the communication nodes, the sources, the destinations of all the messages in these supercomputers satisfy the following requirements. Note that in some implementations there may be other messaging protocols used to provide additional non-critical path communication, say to provide cabinet status across an Ethernet channel.
(188)
(189)
(190)
(191)
(192)
(193) The Data Processor Chip (DPC) may include an interface, an internal network, at least Npem of Programmable Execution Modules (PEMs). The interface adapted to transfer a signal bundle into and out of the DPC at a data bandwidth of two numbers for each of Nchannels on each local clock cycle with a clock period of at most 2 ns, where the NChannels is at least 8, and the number is at least 32 bits. The internal network couples to the interface and is adapted to communicate across the interface without stalling the data bandwidth. The internal network may include a binary graph of internal nodes (landing modules), each of the landing modules adapted to communicate across up to 3 three links, each adapted to bi-directionally transfer the data bandwidth. Each of the PEMs may include at least Ncore-per-module cores and a module communication interface (stairway) adapted to support communication into and out of the internal network at the data bandwidth, where the Npem is at least 64, where Ncore-per-module is at least one. Each of the cores may be adapted to operate at least two simultaneous and independent processes owning separate instructed resources of the core configured to locally implement part of the Block LU Decomposition as a block processor of a block of Nblock rows and Nblock columns of numbers adapted to respond to channel receptions of at least one of the channels at the module communication interface, where Nblock is at least 8.
(194) The DPC may be adapted to create the system configured to execute a version of Block LU Decomposition with partial pivoting of a matrix A with at least N rows and at least N columns of the number by performing at least ¼ exaflop for a sustained run time of at least 8 hours by using at least NDPC of DPC, wherein the number implements double precision floating point. Wherein the N is at least 16 K*K, wherein the K is 1024, and the NDPC is at least ¼ K*K.
(195) Each of the cores may adapted to perform at least one exaflop divided by the product of NDPC multiplied by Npem multiplied by Ncore-per-module per the clock period. The internal network may be adapted for simultaneous communication across each of the internal nodes and the links for simultaneous data bandwidth delivery to and from the module communication interface of each of the core modules. And Npem is at least 32 and Ncore-per-module is at least 1.
(196)
(197) The two SMPC cores are labeled SMPC core 1 situated above the second instance labeled SMPC core 2. On the left side, the OMP 2 of the SMPC core 2 is aligned with the IMP 1 of SMPC Core 1 to communicate in a first direction through the channel labeled as channel direction 1. On the right side, the OMP1 of the SMPC core 1 is aligned with the IMP 2 of the SMPC core 2 to communicate in a second direction through the channel labeled as channel direction 2.
(198) The operations of the left side begin with the outgoing payload 2 being presented to OMP 2, which responds by generating transmitted message 1, which is transported in channel direction 1 to create the received message 1 presented to IMP 1. The IMP 1 responds to the received message 1 by generating a first ERror In (ERI 1), a good payload 1 and destination controls 1, for at least two first destinations, labeled as 1.sup.st In dest 1, 1.sup.st In dest . . . , and 1.sup.st In dest InDn1, where InDn1 is at least two. The good data payload 1 may be sent and/or presented to one or more of the first destinations based upon the destination controls 1.
(199) The operations of the right side begin with the outgoing payload 1 being presented to OMP 1, which responds by generating transmitted message 2, which is transported in channel direction 2 to create the received message 2 presented to IMP 2. The IMP 2 responds to the received message 2 by generating a second ERror In (ERI 2), a second good payload 2 and destination controls 2 for at least two second destinations, labeled as 2.sup.nd In dest 1, 2.sup.nd In dest . . . , and 2.sup.nd In dest InDn2, where InDn2 is at least two. The good data payload 2 may be sent and/or presented to one or more of the second destinations based upon the destination controls 2.
(200) Each of the transmitted messages 1 and 2 have the same structure. Transmitted message k, for k=1 to 2, includes an ECC k for the data payload k and the context k. Each of the received messages 1 and 2 have the same structure. Received message k, for k=1 to 2, includes an ECC k for the data payload k and the context k.
(201) Note that the activities and structure of the left side of
(202)
(203)
(204) The Incoming Message Processor (IMP) 1, of SMPC core 1, includes an Incoming Message Frontend 1 (IMF 1), and an incoming routing pipe 1. The IMF 1 includes a message incoming interface 1 and an Error Detecting and/or Correcting (EDC) pipe 1.
(205) The Outgoing Message Processor 2 (OMP 2), of SMPC core 2, includes an Outgoing Message Backend 2 (OMB 2) and an outgoing context generator 2. The OMB 2 includes the outgoing Error Control Code (ECC) generator 2 and a message outgoing interface 2.
(206) The spare SMPC core includes a Spare Incoming Message Processor (SIMP). The SIMP includes a replacement for the message incoming interface 1, and a replacement for the incoming EDC pipe 1. Note, that the SIMP may not include a replacement for the incoming routing pipe 1, which may differ from one channel to the next.
(207) The spare SMPC core includes a Spare Outgoing Message Processor (SOMP) that can replace the outgoing ECC generator 2 and the message outgoing interface 2. Note, that the SOMP may not include a replacement context generator 2, which may differ from one channel to next.
(208) Assume that no errors have been reported by the IMP 1 asserting ERI 1. In this situation, the left hand side indicates the components operated for this communication activity. Starting from the bottom, in SMPC core 2, the outgoing context generator 2 responds to an outgoing payload 2 and possibly an outgoing process state and also possibly, loop outputs of the simultaneous process to generate the outgoing data payload and context. The outgoing ECC generator 2 responds to the outgoing data payload and the context by generating the message to transmit, which includes the outgoing data payload, the context and the ECC for the payload and context. The message outgoing interface 2 responds to the message to transmit by generating the transmitted message 1 traversing the channel in channel direction 1 to create the received message 1.
(209) At the SMPC core 1, the message incoming interface 1 responds to the received message by generating the received raw message, including an ECC, a data payload and a context. The incoming EDC pipe 1 responds to the received raw message by generating the ERI 1, and a corrected message that includes a good data payload and a good context. The incoming routing pipe 1 responds to the ERI 1, and the corrected message as follows. If the ERI 1 is asserted, the corrected message is not delivered into the destinations. If the ERI 1 is not asserted, the corrected message is used to generate the good data payload 1 the destination controls 1, which are then used to deliver the good data payload 1 to the first input destinations as shown in
(210) However, over time this channel direction 1 may be in error, or about to begin to be in error, and the ERI 1 signal may be asserted. When the ERI 1 signal is asserted, OMP2 responds a short time later by setting Destination Error 2 (DestEr2). After DestEr2 is set, the right side shows the SIMP, the Spare channel direction 1 and the SOMP replacing the IMF 1, the Channel direction 1 and the OMB 2 on the left side. DestEr 2 may be the state of a memory. The memory may retain its contents until reset or written, and may persist in retaining its content with or without power being provided.
(211)
(212) Over time, the channel direction 1 may be in error, or about to begin to be in error, and the ERI 1 signal may be asserted. When the ERI 1 signal is asserted, IMP1 sets SrcER1 and OMP2 responds a short time later by setting Destination Error 2 (DestEr2). This triggers the fault resilient mode of operation, using the right side components to replace the left side components of
(213) One skilled in the art may recognize that a specific program may not allocate for use all of the data channels or each of the channel directions in at least some of the bundles. Extensions of the circuitry shown in
(214) Power to unused components of the Input Message Processors (IMPs) and the Output Message Processors (OMPs) may preferably be gated off in a manner similar to the discussion of gating off power in an SMP core found in
(215)
(216) The stairway of
(217)
(218) For each stairway in, labeled Bnd 0:2 stairway in, for each of the incoming message processors IMP shown in
(219)
(220)
(221)
(222)
(223) The invention includes at least one channel including Ndata optical fibers (fibers) and Nedc Error Detection and/or Correction (EDC) fibers, wherein the Ndata is at least 8 and the Nedc is at least one. Nchannels may be at least 16. Ndata may be at least 16. Nedc may be at least two. Nedc may be at least four. At least one channel for control and/or status may include a control channel and a status channel. The apparatus may further include the two channels for control and/or status including a task control and/or status channel and a transfer control and/or status channel. The apparatus may further include the bundle coupled to a first harness coupling, and the bundle coupled to a second harness coupling opposite the first harness coupling, each of the first harness coupling and the second coupling adapted to optically transfer all of the fibers included in the bundle. A printed circuit board (PCB) including at least one of the bundles adapted to present the first harness coupling on one side of the PCB. The PCB includes at least two of the bundles.
(224) The Landing Module (LM) may include a local clock cycle with a local clock period and at least three link interfaces, each adapted to communicate with a link simultaneously sending and/or receiving each of Nchannels of data payloads sufficient to transfer two double precision numbers (referred to hereafter as numbers) per local clock cycle, where the Nchannels is at least 8. Each of the link interfaces includes a link input interface and a link output interface, at least one spare link input interface, at least one spare link output interface and a fault recovery circuit. The fault recovery circuit is adapted to control the link interfaces to respond to at least one output channel fault and/or at least one input channel fault in the link interface by using a spare channel within the link interface and resending a recent history of an output channel associated with the output channel fault, and/or using the spare channel within the link interface to repeat reception of the recent history of an input channel associated with the input channel fault.
(225) Each of the link input interfaces responds to receiving messages as synchronized input messages to the local clock cycle, and further may include an error correction and detection pipeline adapted to receive the synchronized input messages and generate error corrected output messages and an error detection signal, and a message routing pipeline adapted to successively respond to each of the error corrected output messages to generate a routing decision for each of the error corrected output messages. Each of the link input interfaces further includes a link synchronizer adapted to receive the messages and generate the synchronized input messages to the local clock cycle in response to receiving the messages. Each of the link output interfaces may include a message fault generator adapted to respond to at least one of the error detection signal of the link interface for transmission from the link interface by asserting an output channel fault, and an output message prioritizer configured to respond to each of the routing decisions of the error corrected messages of each of the link input interfaces to perform generating an output message for transmission by the link interface, and/or queuing the output message in a link output queue. At least one of the output message prioritizer may be further configured to respond to each of the routing decisions of the error corrected messages of each of the link input interfaces to further perform possibly queuing a second of the output message for later transmission.
(226) A chip may include at least one LM. The DPC may be such a chip. An integrated landing module may be the chip, referred to as the LM chip, or simply as an LM. A module stack may include at least one LM chip. A node stack may include at least one of the LM chips.
(227)
(228)
(229)
(230)
(231)
(232)
(233)
(234)
(235) When parallel processing became something other than a computer research activity, there was a common memory model and of a main memory and possibly localized, smaller memories. A location for memory contents usually meant where did it live in the big system. That perspective has several problems today. First, assume that a big memory is a unit of 64 Gbytes or more, but the system memory capacity is roughly a millions that size. There is no single main memory, because it's access would be forever bottlenecked. Instead, consider the term intermediate memory. Intermediate memory is always part way to whatever is most local and to whatever else can be reached by the communication networks of the cabinet and system as a whole. Intermediate memory frequently needs to perform two very important operations: First, sequester data to be later used in subsequent transfers. Second, perform intermediate calculations to locally determine which of several paths will be later needed in transfer operations.
(236) The apparatus addressing these needs includes a QUAd Link Anticipating Memory node (QUADLAM), comprising: a first External Link Coupling (ELC), a second ELC, a third ELC, and a fourth ELC, as members of an external link group, each adapted for optical transmission; a first, second and third Landing Module (LM); the first ELC and the second ELC communicatively interfaced to the first LM; the third ELC and the fourth ELC communicatively interfaced to the second LM; both a third Link coupling of the first LM and a fourth link coupling of the second LM communicatively interfaced to the third LM as link couplings; an anticipator including an anticipator link coupling, communicative interfaced to the third LM; and a Memory unit array (MUA) communicatively coupled to the anticipator and adapted for memory access by the anticipator. The memory unit array includes at least one Dynamic Ram accessible by the anticipator.
(237) The anticipator may be adapted to respond to a system performance requirement by a system for an algorithm, with the system including the QuadLam, and the anticipator may be configured to receive an incremental state of the algorithm from at least one member of the external link group. The anticipator may be configured to respond to the incremental state by creating an anticipated requirement. And the anticipator is configured to respond to the anticipated requirement by directing the system to achieve the system performance requirement. The anticipator may be further configured to respond to the anticipated requirement by at least one of the anticipator configuring the first landing module; the anticipator configuring the second landing module; the anticipator configuring the third landing module; and the anticipator configuring the memory access to the MUA.
(238) At least one of the first, second and third LM includes the following: A local clock cycle with a local clock period. At least three link interfaces, each adapted to communicate with a link simultaneously sending and/or receiving each of Nchannels of data payloads sufficient to transfer two double precision numbers (referred to hereafter as numbers) per local clock cycle, where the Nchannels is at least 8.
(239) Each of the link interfaces includes a link input interface and a link output interface, at least one spare link input interface, at least one spare link output interface and a fault recovery circuit. The fault recovery circuit is adapted to control the link interfaces to respond to at least one output channel fault and/or at least one input channel fault in the link interface by using a spare channel within the link interface and resending a recent history of an output channel associated with the output channel fault, and/or using the spare channel within the link interface to repeat reception of the recent history of an input channel associated with the input channel fault.
(240) The apparatus of the invention includes, but is not limited to, a system including multiple system components and a communication network communicatively coupling the multiple system components. Each of the system components is coupled by a QUADLAM to create at least part of the communication network. The communication network includes a binary tree formed by the QUADLAM to the system components using three members of the external link group of the QUADLAM. And the QUADLAM is distinct for distinct pairs of the system components.
(241) The system may be configured to achieve a second system performance by a second algorithm by configuring the coupling of at least two of the QUADLAMs fourth, unused member of the external link group to each other. The communication network includes the coupling of the at least two of the QUADLAMs fourth, unused member of the external link group to each other.
(242) The system may be further configured to a third system performance by a third algorithm by a bidirectional switch adapted to select another coupling of at least two of the QUADLAMs fourth, unused member of the external link group to each other. The system may include the bidirectional switch.
(243) At least one of the system components is included in at least one of a printed circuit board (PCB), a row of the PCB, a shelf of at least one of the row, a rack of at least one of the shelves, a cabinet of at least one of the racks.
(244) As used herein, the following design rules are used to specify a system may achieve exascale performance. These design rules are referred to as the Exascale Design Rules. Multiplications must not stall and everything else must keep up. Every hardware element must be simple and regular with as few exceptions and special cases as possible. Exascale architectures must enable the programmer to succeed at every level of integration. There may be no hidden programmed states. The architecture must be organized to make debugging the program, at every level, only about the inputs, instructions, and outputs of each instructed resource. The program, therefore the programmer, must be in control, not only of the data processes, but also communication network structures, memory access and task management, at every level of integration. Feedback must be separate hardware from the arithmetic units, and must be configured and controlled by the programmer. Fault detection must be in every exascale program's task management. Fault recovery must also be part of every exascale program's task management. The Exascale Algorithm State Machine (EASM) must partition into many local state machines simultaneously responding to task commands, the local process state, and the local availability of data. Instruction processing must be in terms of the process state of the local state machine, which is part of the EASM. System state snapshots must minimize system overhead and support run-time rollback within each data processor chip. Given a tradeoff between a small increase in complexity in a component and the opportunity for resilience to flaws in that component, resilience wins, particularly if there was no resilience before. While these design rules are necessary for exascale, when implemented, they greatly reduce the size and increase the efficiency of petascale computers.