METHOD OF INTERLEAVED PROCESSING ON A GENERAL-PURPOSE COMPUTING CORE
20230367604 · 2023-11-16
Inventors
Cpc classification
G06F9/3885
PHYSICS
G06F9/3895
PHYSICS
International classification
Abstract
A method of “interleaved processing” (IP) is proposed which generalizes the functional principle of memory interleaving by extending the interleaved memory system into the processor chip and prepending each write access to one of the extended interleaved memory banks by a data transforming operation. The method opens a new dimension of large scale software parallelization and is implemented in autonomous processing units called “parallel processing channels” (PPC) that integrate processor and memory at a very low machine balance—which solves the memory wall problem— and execute on-chip machine transactions at a 1 Flop/cycle throughput. IP computing systems are linearly performance scalable and capable of pipelined execution of very large and complex HPC workloads. They have unique performance advantages in strided vector, tensor, and data set operations; for relevant HPC workload types, up to 10×-100× per-Watt single-processor performance gains compared to today's technologies are expected.
Claims
1. A programmable data processing apparatus (70) comprising: a combined data processing and storage unit (72) comprising at least one arithmetic logic unit (41) capable of executing data-transforming operations specified by an instruction set architecture of the apparatus (70), such as arithmetic, logic, or other foundational data-transforming operations on data of all data types supported by the apparatus (70) as per the instruction set architecture of the apparatus (70), and at least one memory system (42) for storing the data and code pertaining to said operations, and a programmable control unit (71) for executing further, for example, control-related operations specified by the instruction set architecture other than the data-transforming operations, such as pointer dereferencing, data fetch, data store, comparison, instruction fetch, decode, dispatch or branch, the control unit (71) being adapted to, under control by code fetched from the processing and storage unit (72) by data transport means (76), obtain operand data from the processing and storage unit (72) by operand data transport means (77, 78), and to control the processing and storage unit (72), by a set of mutually independent parallel instruction transport means (75), using transaction control words specifying the data-transforming operations to be performed, the operand data to be processed by said data-transforming operations, and destination locations or addresses for results of said data-transforming operations, characterized in that the apparatus (70) is configured to, in at least one operational mode, write the results of the data-transforming operations directly to a destination address within the memory system (42) and any setup of the processing and storage unit (72) with a set (73) of mutually independent arithmetic logic units (41) for executing the data-transforming operations in parallel, with a set (74) of mutually independent memory systems (42) for storing the results of said parallel data-transforming operations in parallel, and with a set (79) of mutually independent parallel data transport means (47) for directly transferring the results of said parallel data-transforming operations from the set (73) of mutually independent arithmetic logic units (41) to the set (74) of mutually independent memory systems (42) in the at least one operational mode of the apparatus (70) is adapted to execute at least one very long transaction word comprising a set of parallel transactions issued by the control unit (71) via the set of instruction transport means (75) to the set (73) of mutually independent arithmetic logic units (41) of the processing and storage unit (72), said transaction word comprising at least one transaction which differs from at least one second transaction within the same transaction word with respect to its instruction type.
2. The apparatus (70) of claim 1, wherein the at least one transaction word comprises at least one transaction which causes the arithmetic logic unit (41) to write the value of one operand directly to one memory system (42) of the set (74) of memory systems.
3. The apparatus (70) of claim 1, wherein each arithmetic logic unit (41) of the set (73) of mutually independent arithmetic logic units is connected to one instruction transport means (43) of the set (75) of mutually independent parallel instruction transport means (75) and said instruction transport means (43) is connected exclusively to said arithmetic logic unit (41).
4. The apparatus (70) of claim 1, wherein each memory system (42) of the set (74) of memory systems is connected to one arithmetic logic unit (41) of the set (73) of mutually independent arithmetic logic units via one data transport means (47) of the set (79) of mutually independent parallel data transport means (75) and said arithmetic logic unit (41) is connected exclusively to said memory system (42) via said data transport means (47), the apparatus (70) preferably comprising a prime number of arithmetic logic units (41).
5. The apparatus (70) of claim 1 claims, wherein at least one of the memory systems (42) is partitioned into a set of mutually independent memory subsystems (53) interconnected by data transport means (50, 51).
6. The apparatus (70) of claim 4, wherein at least one of the memory subsystems (53) has further independent data transport means (44, 45) for exchanging data with units of the apparatus (70) other than the connected arithmetic logic unit (41), such as a further control unit (71), or I/O interfaces such as to a further apparatus (70), file system, processor interconnect, user console, or other external real-time data source or sink.
7. The apparatus (70) of claim 1, wherein the control unit (71) is further adapted to read the data and code from the at least one memory system (42) of the set (74) of memory systems via a set of mutually independent parallel data transport means (44, 45).
8. The apparatus (70) of claim 4, wherein the control unit (71), by means of an interleaved memory abstraction layer (190), is adapted to access the mutually independent memory systems (42) of the set (74) of memory systems as one single unified block of interleaved logical memory such that any logical memory address uniquely specifies the memory system (42) of the set (74) of memory systems and the destination address within said memory system (42) physically holding the data.
9. The apparatus (70) of claim 8, wherein the control unit (71) comprises an address decoder implementing the interleaved memory abstraction layer (190).
10. A method of using the programmable data processing apparatus (70) of claim 1, wherein, by means of a pipelined memory abstraction layer (193), a virtual data flow is generated, said virtual data flow being controlled by an index and an address transformation defined within a coherent memory block forming a pipeline (194) such that said index is either subtracted from or added to a relative address located within the pipeline (194), the transformation yielding a latch address.
11. The method of claim 10, wherein by a further address transformation, for example, by means of a modulo transformation, the pipeline (194) is organized as a ring buffer.
12. The method of claim 11, wherein latch addresses are used, as per the instruction set architecture of the apparatus (70), as instruction source and destination addresses, such as by means of address decoder hardware adapted to, depending on the index associated with the pipeline (194), translating the latch addresses into logical addresses.
13. The method of claim 12, wherein the pipeline (194) is associated with at least one software loop for either incrementing or decrementing the index associated with the pipeline (194).
14. The method of claim 13, wherein the software loop is executed by a dedicated thread.
15. The method of claim 11, wherein pipelines (194) communicate, preferably via channels.
16. The method of claim 15, wherein the communicating pipelines (194) form a network.
17. The method of claim 11, wherein control flow information is embedded into data streams processed by or communicated between pipelines.
18. The method of claim 17, wherein the control flow information is embedded into a data stream by augmenting each data word (222) of said stream by a control tag (221), the augmented data word (220) being stored, processed and communicated as a whole, and the control tag (221) comprising a combination of bits (223, 224, 225) for controlling execution, such as at least one bit (223) controlling loop termination, one bit (224) controlling conditional execution of operations on the data word (222), or one bit (225) indicating that the data word (222) contains further control information.
19. The method of claim 10 wherein, such as by means of a stack memory abstraction layer (191), memory blocks (192) are allocated as an addressable stack of data referenced relatively to a top of the stack.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0092] The invention is presented by way of example, and not by way of limitation, in the figures of the accompanying drawings:
[0093]
[0094]
[0095]
[0096]
[0097]
[0098]
[0099]
[0100]
[0101]
[0102]
[0103]
[0104]
[0105]
[0106]
[0107]
[0108]
[0109]
[0110]
[0111]
[0112]
[0113]
[0114]
[0115]
[0116]
[0117]
[0118]
[0119]
[0120]
[0121]
[0122]
[0123]
[0124]
[0125]
DESCRIPTION OF EMBODIMENTS
[0126] With reference to the known method of memory interleaving—by which memory is organized as a set of parallel and independent memory banks on a DRAM module—each such interleaved DRAM bank (56) is extended as illustrated in
[0127] Each memory system (42) thus is a stand-alone unit which interacts with other units of the processor chip only through independent ports (44-47).
[0128] As drafted in
[0129] The assembly (40) is a stand-alone distributed processing element which interacts with its on-chip environment only via independent ports (43-46). Such an element (40) is called a “parallel processing channel” (PPC).
[0130] When creating a processor-memory system with 3D chip packaging and differential serial processor-to-memory interconnects (52), low-MB PPCs can be built. This is exemplified in
[0131] The method of interleaved processing (IP) is drafted in
[0132] Furthermore, high bandwidth parallel memory mapped I/O can be directly integrated in the memory system (74) which makes the design well-suited for HPC signal processing and future high-speed HPC interconnects.
[0133] The linear performance scalability of IP is illustrated in
[0134] A second dimension of performance scaling is associated to the choice of the number n of bit lanes and transmission rate of the interconnect (52) which shifts the RP horizontally as indicated by the circle raster in
[0135] IP cores (70) use memory interleaving: each logical memory address adraccessed by software is decomposed by the CU (71) into a quotient a=adr/m and a remainder b=adr mod m. The latter b selects the PPC (40) from which to read data via its local ports (44-46) or to which to issue instructions via local port (43); the former quotient a selects the local memory address in the memory (42) of said selected PPC (40) where the data is physically located. This is illustrated by
[0136] As drafted in
[0137] The latter principle also applies to all machine instructions that compute a result as the compiler then will have to specify which logical address to write said result to. As according to
[0138] According to the invention, the basic operation mode of the core (70) is the general purpose processing (GPP) of a stack machine (SM) described by PAT1.
[0139] Now, the “virtual register stack” of the SM of PAT1 is addressable and handled by the CU (71); it therefore can be used to pass information from the software down to the hardware via a fast parallel interface (77, 78).
[0140] Thus, the invention proposes the definition of SIMD machine instructions in the IP core's ISA that are parameterized by descriptors that are located on the stack. The general structure of such a descriptor is exemplified in
[0141]
[0142]
[0143] Said mathematical condition refers to an aliasing phenomenon caused by a periodicity conflict between memory access stride s and interleaving stride m which does not occur if both strides s and m are relatively prime as demonstrated in
[0144] If on the other hand m is prime, e. g. m=31, the performance is constant except for cases in which one of the vector strides involved is an integer multiple of m (
[0145] Therefore, to avoid aliasing as best as possible, m should be chosen as a prime number. An efficient address decoding mechanism (cf.
[0146] As an alternative to using explicit machine operation descriptors (150)—which is intended for the implementation of special procedures at the machine level—pointers to data descriptors (151) may also be directly passed to scalar machine instructions. In such a case, said instructions are executed as SIMD operations (e. g. vector dot product, vector scalar product, matrix vector multiplication, tensor contraction, and so forth).
[0147] One way to signal to the CU whether a value provided to an instruction represents a scalar value or a pointer to a data descriptor (151) is using typed data e. g. as disclosed in PAT2 Said method additionally allows polymorphic data processing as well as pointer safe, type safe, and cyber secure memory addressing.
[0148] With SIMD instructions, Fortran array, BLAS, and SQL primitives can be defined leading to a lean but yet powerful ISA which allows programming at a level of abstraction which hides (and internally manages) hardware details like the PPC count m of the targeted IP core. In other words, IP SIMD processing is “vector length agnostic” as vectors are treated as whole objects and need not to be explicitly decomposed at the SW level into chunks that fit into a (hardware-dependent) “vector unit”; this job is transparently performed in the background by the CU (71).
[0149] To further ensure software portability between various IP cores (
[0150]
[0151] This means that in the steady state each iteration of the pipelined loop simultaneously issues three input reads (a, b, c), four multiplications, one negation, two subtractions, one square root calculation, one addition, two divisions, and two output writes (x1, x2). On a core (83), these operations are encoded in a VLIW which implements the entire DFG—including data paths—within one single parallel superinstruction (
[0152] Before starting the loop, said VLIW is fetched by the CU via instruction ports (44) and decoded. During loop execution, the CU iterates said VLIW without further instruction load or decode. Data read/writes are pipelined.
[0153] As explained above, the parallel instructions of the DFG are interpreted as virtual operators (VO) and for an intuitive understanding of systolic loop pipelining, the timing diagram of
[0154] Evidently, buffer pipelines synchronize all data transports between the (internally pipelined) VOs of the DFG on a systolic cycle-to-cycle basis. In an ASIC implementation of
[0155] Translating said FIFO synchronization concept to general purpose IP, buffer pipelines are, according to the invention, implemented as chained sets of “pipelined memory” addresses called “latches”. In
[0156] In pipelined memory, data moves from latch to latch each time a loop iteration starts; but instead of physically moving said data, this behavior is virtually created by a memory address mapping method as follows:
[0157] According to
[0158] While executing the loop, the CU (71) uses a hardware loop counter i which is set to 0 when entering and incremented when iterating the loop. Each latch of a pipelined memory block (194) is dynamically mapped to a relative logical memory address ret per ret=(latch+i)∧(2{circumflex over ( )}n−1); in other words, the relative logical address ret of the latch comprises the lower n bits of the loop-dependent sum latch #i.
[0159]
[0160] In other words, latch 44 sequentially writes to memory without overwriting older data as required for the functioning of a buffer pipeline. It does so at least for 2{circumflex over ( )}n=2{circumflex over ( )}7=128 iterations which is sufficient to not interfere with any of the other latches 0 . . . 80 required by the timing diagram of
[0161] Next, following the result computed at i=1000 in “mult output pipeline”, with each loop iteration, said result appears to move from latch to latch where it can be accessed at latches=44, 43, 42, 41, 40, . . . This mimics the desired FIFO data transport behavior of the buffer pipeline.
[0162] In iteration 1004—i. e. four cycles after the write operation at i=1000−the value stored in rel=20 can be read in from latch 40 by the subtraction operator (“sub reads”) for further processing. Looking back in time, in the first iteration at i=1000 data that had been calculated by the multiplication at i=996 and—at that time—stored under relative logical address rel=16 is read from the subtraction's input latch 40. This is how both multiplication and subtraction can work simultaneously without interfering via their data dependence: both operators are decoupled because they simultaneously process different loop stages (L e. iterations of the original loop).
[0163] From an LLVM compiler's point of view, each edge of the DFG of
[0164] Buffer pipelines can also implement “loop-carried dependences”. In a pipelined loop, the effects of the latter are best understood in “wavefront notation” by which each travelling wavefront is numbered by an index k assigned by the loop iteration index i at which wavefront k enters the pipeline in stage 0. Wavefront and pipelined loop indices are interrelated by k=i−s (s=stage). Each wavefront k represents one iteration k of the software-defined “algorithmic loop”. As “loop-carried dependence” means that algorithmic loop iteration k accesses data computed by some earlier iteration k−Δk, at some stage s, wavefront k accesses wavefront k−Δk. Such an access is implemented by reading from a latch that is positioned Δk stages to the right of said wavefront k.
[0165] Also, buffer pipelines are used when expanding complex operations like division, square root, sine, cosine, polynomial, digital filter, and so forth.
[0166] Finally, buffer pipelines can compensate runtime variations in the latency of operations. As specified above, ALU operations are required to have a guaranteed minimum latency and an upper latency limit above which the operation—when unfinished—stalls the CU until it has delivered its result. When planning the buffer write operation of an operator A, the compiler has to plan its latch allocation for the minimum latency of A; when planning a later buffer read by some operator B which consumes the result of A the compiler instead has to assume the maximum latency of A. The latch-to-address translation is performed by the CU and accordingly communicated to A. Therefore, A will always write its result to the same logical memory address, regardless of A's latency; integrity of B's read access is guarded by the above rules.
[0167] The above-mentioned runtime variations in the latency of ALU operations—called “jitter”—are needed to resolve memory access conflicts caused by latency differences between the ALU operation types. Taking
[0168] The same is true for non-ALU operations like e. g. memory read and write as exemplified in
[0169] Apart from the above-mentioned hardware buffering—in this case: in the CU instead of in the ALUs—and neglecting the costly option of increasing port numbers the following solutions of mitigating memory access conflicts are viable which have to be supported by ISA, compiler, and CU hardware: eliminating redundant memory read accesses—e. g. in
[0175] Notably in this context, the timing diagram
[0179] In effect, memory pipelining offers a trade-off between memory and throughput. In principle, the pipelined memory could have an arbitrarily large access latency; this releases all restrictions regarding the size of the interleaved M1 memory block—which is mainly intended for use as pipelined memory reservoir—except for data transport energy restrictions.
[0180] But while this is true for throughput-oriented workloads it may be not for real-time applications in which case a second trade-off must be stricken regarding M1 latency versus throughput.
[0181] To summarize: using latches instead of logical addresses, loop-invariant machine code can be compiled (cf.
[0182] One a general note, the above-explained method of memory pipelining is not restricted to handling word-sized (on a 64 bit machine: 8 B) data items only but can also be applied to wider data types as e. g. quad floating point or 64 bit complex numbers. Then, a second pipelined memory block (194) must be allocated which “rotates” twice as fast as the standard block—i. e. transforms latches to relative logical addresses with a doubled step width: rel=(latch+2*i)∧(2{circumflex over ( )}n—1). Then, in assembler code, latch types need to be distinguished by suffixes, e. g. ‘21p’ denoting a single-step latch and ‘13p2’ a double-step latch.
[0183]
[0184] Table (a) shows a case in which the number of algorithmic iterations N is greater than the number of pipelined loop stages S. Under this condition, “prolog” (loop filling), “kernel” (steady state), and “epilog” (loop draining) phases can clearly be distinguished and a static compiler method (Rau) would assemble S prolog VLIW, one kernel VLIW plus control structure, and S epilog VLIW; well-known downsides of such static compilation are: [0185] The more stages S the loop has the bigger is its prolog/epilog overhead in relation to the kernel with respect to code size; [0186] Epilog latency is “dead time” as during epilog the loop cannot be restarted as shown in
[0188]
[0189] The control tag (221) consists of three control bits, namely: [0190] TERM (223), used to stream parallel control flow information for algorithmic loop termination in conditional pipelined loops; [0191] NULL (224) which is not only used to stream parallel control flow information within predicated conditional statements but also for the scenarios of
[0193] With control/data, the scenarios of dynamic loop control of
[0194] As said non-blocking mode avoids pipeline stalls and drains the pipeline as fast as possible when reading from data sources with jittering latency, it can be used e. g. for real-time signal processing. In non-blocking mode, the pipeline is always “live”, clearing out wavefronts as fast as possible, but has the disadvantage of also processing idle wavefronts which costs data transport energy along paths (75, 77-79) and overall performance (as the pipeline never deschedules and hence blocks all other threads).
[0195] In order to minimize energy costs and processing overhead, the pipeline automatically switches to the below-explained “blocking” mode when it is fully drained. In blocking mode, when stalling, a thread is descheduled by the temporal multithreading (TMT) scheduler and only rescheduled when new data arrives and the pipeline resumes its former non-blocking mode.
[0196] Additional hardware support to reduce data transport energy of idle data may take into account that in the idle case only the NULL bit needs to be stored and communicated via the internal busses.
[0197] In pure data processing workloads, instead of latency, optimal resource utilization is of interest. For such workloads the “blocking” operation mode is suitable in which the pipeline is descheduled when no input is available; the pipeline processes valid wavefronts only. Hereafter, blocking mode is assumed as the default IP pipeline operation mode.
[0198] A pipeline can be drained by an “end of transmission” (EOT) wavefront, i. e. a wavefront which comprises only EOT data (
[0199] Synchronization of all RD and WR operators (which may be necessary when channel or physical I/O is involved) is enabled by the fact that VLIW execution can be stalled by each machine instruction—here: RD or WR-it comprises; said synchronization also works in the above-mentioned case in which one pipelined loop iteration consists of a sequence of VLIW.
[0200] Notably, all I/O operators like RD and WR are executed by the CU (71) which as a central unit coordinates the above-mentioned synchronization. The CU also holds pipeline state information, e. g. “blocking vs. non-blocking mode” (discussed above), “SOT may interrupt draining vs. full drain always” etc.
[0201] Accordingly, packet processing can be implemented with “start/end of package” (SOP/EOP) wavefronts including the option to either drain the pipeline after each packet or to keep it filled in blocking mode both modes having perks and drawbacks with respect to latency-bound real-time vs. throughput-bound standard data processing, as discussed above.
[0202] TERM is used in pipelined conditional loops as follows: A comparison operator determines whether the condition to exit the algorithmic loop is met and accordingly sets the TERM bit on data paths which lead to “conditional write” (CWR) and/or “conditional read” (CRD) operators.
[0203] Over CWR operators, the conditional algorithmic loop can deliver results. CWRs write out TERM=1 data only. In the result the TERM bit is cleared.
[0204] A CRD operator is a “two-way selector switch” which feeds a wavefront back into the pipeline for further iteration if TERM=0 or replaces it by a read-in wavefront to start the next algorithmic loop iteration (TERM=1).
[0205]
[0206] Parallel execution of many control flows as exemplified in
[0207] In anticipation of the below-explained concept of “superpipelining” it is noted that embedding an IP pipeline's internal control flow into a dedicated WL control/data stream allows to “split” any IP pipeline into a sequence of concatenated IP pipelines which can be executed by parallel threads that then may—if deemed appropriate—be distributed over a net of distinct cores or even processors. This option of “control flow distribution” and “pipelining pipelines” is a unique feature of the invention and expected to be useful when pipelining large complex workloads (workload balancing as discussed below).
[0208] In an IP pipeline, state-of-the-art predication methods can be applied to transform algorithmic conditional if-else statements into elementary blocks of “predicated instructions” which are executed if a certain condition holds and otherwise skipped. Such predicated instructions are executed/skipped without branching and thus can be used within the IP pipeline concept.
[0209] Now within the context of the invention, control/data carried predication—i. e. conditional operator execution—is available per NULL bit (
[0215] Now as disclosed below, each of said CDFGs can be compiled into an independent new IP pipeline which will be executed by a parallel thread. The channel communication between sender and each of the two receiver IP pipelines is guarded by “multichannel fork” operators that behave like a multichannel version of the write operator WR by sending only valid data and skipping idle data. In such a parallel setup, the join operator has to be replaced by a “multichannel join” operator which accepts only valid input data from both parallel IP pipelines and ignores idle data.
[0216] Because of latency differences of the CDFG pipelines, communication, and multithreading/multiprocessing jitter, such a parallel setup will tend to reorder the original wavefront sequence and thus, “reserializing” by the “multichannel join” will be necessary. Overflow protection is automatically provided by the CSP (“communicating sequential processes”, Roscoe) data flow execution model under which, according to the invention, parallel threads are run. Multichannel fork and join operators are implemented by IP SIMD machine instructions that accept channel lists as arguments (see above and cf. Occam).
[0217] Compared to predicated execution, guarded IP pipelines save energy. It nevertheless may sometimes be desirable to mix predicated with guarded execution e. g. in complex switch or other nested conditional statements.
[0218] An alternative to a guarded IP pipeline is the conditional sequential call to a subroutine. A use case for such a sequential call might be a handler routine which is invoked by the fork operator when receiving an ESC word on (252). Such a sequential handler call temporarily suspends parallel IP pipelined operation by invoking the GPP stack machine mode but offers flexibility for sporadic system responses which cost too much energy and performance when implemented in predicated code or parallel pipelines.
[0219] In principle, an IP stack machine could sequentially run any number of IP pipelines but according to the invention only one IP pipeline per IP stack machine is used. This simplifies the memory allocation scheme of
[0220] Secondly, according to PAT1 stack memory is accessed via TOS-relative (TOS=top of stack) addresses, which are typically “small”. Thus, an ISA can be defined in which stack and pipelined memory accesses can be freely intermixed without giving rise to exceedingly wide instruction sizes; address types can be distinguished by postfixed letters ‘s’ and ‘p’ in the machine code (
[0221] Thirdly, using the “safe pointer” method disclosed in PAT2, access to allocations in raw interleaved memory blocks (190) can be indirected via “safe pointer” objects which are located on the stack (and hence accessed via “small” stack addresses). The latter method can be used to mix read and write accesses to memory allocations in (190) with access to pipelined memory in VLIWs as needed for the execution of
[0222] According to PAT1, thread switches require switching special purpose register (SPR) sets which hold the thread context consisting of instruction pointer, thread and other IDs, memory allocation and other information. The CU maintains a SPR set queue for the presently active and some number of presently waiting threads. Said SPR set queue is managed by the TMT scheduler which is partly implemented in hardware to ensure software-configured but at runtime mostly software-independent execution of the CSP data flow model (Roscoe, Occam). An arbitrary number of SPR sets—one per thread—are kept in a backup SPR block in raw interleaved memory (190); over the CU's parallel ports SPR sets can be saved to and restored from said SPR backup block.
[0223] Along with the SPR set queue, the TMT scheduler also prepares the waiting threads by allocating/preloading their pipelined memory blocks (194) from their backup positions into fast access/low latency M1 SPM, accordingly adjusting the relevant pointers in their SPR sets. Similarly, M1 stack segments are allocated and preloaded.
[0224] Now when allocating such temporary M1 memory blocks during runtime, said blocks can be aligned in such a way that only local DMA-operated internal memory ports (
[0225] Accordingly, the TMT scheduler can always maintain a certain number of threads waiting in the “ready to run in M1 mode” state in which their M1 blocks are preloaded. M1 blocks can be configured to performance needs. In principle, threads could also be completely—or temporarily—executed out of M2, M2 or M4 (DRAM) without being subject to M1 pre/offload.
[0226] At runtime, thread switches only require altering a thread pointer (TP) that selects which element of the SPR set queue—i. e. which thread—will be executed in the next cycle. Accordingly, in the intended standard use case of “low thread switching frequencies”, TMT operates with zero-cycle thread switches which allows the above-mentioned fine-grained latency hiding by thread switching in cases of a CU stall where, as long as the stall lasts, the CU (71) issues VLIWs of other threads to the PPC set (72).
[0227] In this context, as usual in CSP implementations (Roscoe, Occam), CU stalls caused by a stalling inter-thread channel cause the TMT scheduler to deschedule the stalling thread. Then, SPR set and M1 memory blocks occupied by said thread can—but need not to—be offloaded, i. e. written back to their original backup allocations, waiting there until the thread is reloaded and made ready for its next execution round.
[0228] Thereby, the TMT scheduler manages M1 as a queue for the pipelined memory blocks and topmost stack segments of the active and the waiting threads which it selects according to the CSP data flow execution scheme by availability of communicated data packets and priority. This activity is supported by an additional address decoding mechanism which presents the M1 SPM to the software as (virtually) infinite memory which is mapped to the finite M1 memory volume by a modulo operation thus allowing for temporary M1 memory allocation blocks which “wrap” at the M1 boundaries making optimal use of the available M1 memory space.
[0229] As all stack and pipelined memory accesses are managed by virtualization layers (191, 193), the above-described physical M1 memory allocation and loading/offloading activity of the TMT scheduler is invisible to the operating system, middleware, and all software layers which access said memory types per relative addresses, namely stack addresses and latches which—at the runtime of each thread—always point to the same physical data, regardless of actual physical data location. This mechanism was hinted above when claiming a “memory virtualization technique which replaces the energy and chip area intensive data lookup circuitry found in today's caches”. The only explicit activity required to be executed by the software, in this context, is to load often-used heap data (190) onto the stack (194)—where it remains accessible via powerful pointer operators (PAT2) but at minimum latency (PAT1)- and offload it after use. In principle, said activity could be eliminated by re-introducing cache-type memory elements which, if desired, requires consideration regarding area/energy overhead versus application-dependent performance improvement.
[0230] On top of the above-described TMT, “simultaneous multithreading” (SMT) can, as shown in
[0235] According to the invention, software is understood as a parallel net of sequential threads—i. e. processes that sequentially execute VLIWs—which obey the “communicating sequential processes” (CSP) paradigm (Roscoe). Like in transputer technology (Occam), CSP is a foundational building principle of the ISA. This means that channel communication and other OS primitives related to parallel processing are implemented as machine instructions which according to the invention comply with contemporary message passing interface (MPI) technology.
[0236] Now, most of the relevant computer workloads execute repetitive work which is typically organized in the form of loops. Sometimes, such loops can comprise complex control structures and hierarchical levels of nested subroutine calls which at the lower levels might reach into the middleware and operating system realm. It is shown by the next step of the invention how any loop can be deconstructed in a way that it can be executed by a distributed network of IP pipelines which is called an “IP superpipeline”.
[0237] At first again reconsidering the example of
[0238]
[0239] The caller-callee channel (263) is fixed but as the callee is a subroutine—which in principle could be called by other callers as well—its output channel must be dynamically vectorized as it might have to deliver the results to consignees which are not known at compile time. To do so, the SM which operates the caller's IP pipeline opens the call via a CALL sent to the callee and transmits dynamic channel (264) configuration regarding the targeted consignee per LINK (
[0240] The callee's pipeline, while responding to the callers connection request, triggers the exception handlers required to dynamically configure its output channels and then processes the caller's data delivering its sqrt results to the consignee. During call execution, the callee encodes the caller’ ID and call number in its WL (cf.
[0241] Finally, the caller closes the call per RET which is answered by ACK.
[0242] The above-discussed
[0243] The above-hinted recursion method can also be applied to inner loops of a non-elementary loop. Again regarding
[0244]
[0245] With this final step, the recursive deconstruction method is complete and can be applied to any outer loop which—organized as IP superpipeline—can be distributed over a network of IP cores, levelling out and optimizing core and processor loads to achieve the best possible VHTC performance. On the software layer, the “placement” of the individual pipelines on cores and processors is invisible; it is configured at the start of the software but can be altered at runtime which is useful for dynamic workload balancing.
[0246] At the topmost level, a task will always start in sequential von Neumann GPP stack machine mode as one initial thread. The above recursive loop deconstruction method is optional and may, according to the invention, be applied for at most one outer loop which is comprised in said initial thread; this limitation to “one loop only” is not necessary but a design choice which simplifies memory management (at most one pipelined memory allocation per thread). If several outer loops are selected for deconstruction in said initial thread then the initial thread needs to be decomposed into several communicating threads (cf. the caller-consignee setup of
[0247] Notably, in each step, the above-explained recursive deconstruction method creates new threads, for each of which an independent decision can be made whether to process its input control/data in sequential GPP or in parallel VHTC pipeline mode (referring back, the same is true for the example stated in
[0248] Applications of the invention thus are expected to in practice consist of a balanced mixture of GPP, SIMD, and VHTC which also has implications regarding IP core hardware design: [0249] IP cores designed for throughput-oriented “number crunching” applications in which latency is less important are expected to have large M1 to accommodate for large (and many) pipelined memory blocks which sustain VHTC. Due to the large M1 latency, such cores will underperform when operated under real-time conditions in GPP mode, especially if compared to traditional register machine cores (as registers have very low latencies).
[0250] On the other hand, IP cores designed for real-time processing workloads will have very small M1 SPM, possibly reaching down to the size of typical register files. In such cores, the TMT will allocate its temporary pipelined memory blocks in M2 SPM but not—as discussed above—in M1. As pipelined VHTC processing is almost insensitive to buffer latency, this measure will only slightly degrade VHTC while yielding good GPP real-time performance.
[0251] The latter IP core type is expected to be found in industrial applications with artificial intelligence aspects, the former rather in the data center.