METHOD OF INTERLEAVED PROCESSING ON A GENERAL-PURPOSE COMPUTING CORE

20230367604 · 2023-11-16

    Inventors

    Cpc classification

    International classification

    Abstract

    A method of “interleaved processing” (IP) is proposed which generalizes the functional principle of memory interleaving by extending the interleaved memory system into the processor chip and prepending each write access to one of the extended interleaved memory banks by a data transforming operation. The method opens a new dimension of large scale software parallelization and is implemented in autonomous processing units called “parallel processing channels” (PPC) that integrate processor and memory at a very low machine balance—which solves the memory wall problem— and execute on-chip machine transactions at a 1 Flop/cycle throughput. IP computing systems are linearly performance scalable and capable of pipelined execution of very large and complex HPC workloads. They have unique performance advantages in strided vector, tensor, and data set operations; for relevant HPC workload types, up to 10×-100× per-Watt single-processor performance gains compared to today's technologies are expected.

    Claims

    1. A programmable data processing apparatus (70) comprising: a combined data processing and storage unit (72) comprising at least one arithmetic logic unit (41) capable of executing data-transforming operations specified by an instruction set architecture of the apparatus (70), such as arithmetic, logic, or other foundational data-transforming operations on data of all data types supported by the apparatus (70) as per the instruction set architecture of the apparatus (70), and at least one memory system (42) for storing the data and code pertaining to said operations, and a programmable control unit (71) for executing further, for example, control-related operations specified by the instruction set architecture other than the data-transforming operations, such as pointer dereferencing, data fetch, data store, comparison, instruction fetch, decode, dispatch or branch, the control unit (71) being adapted to, under control by code fetched from the processing and storage unit (72) by data transport means (76), obtain operand data from the processing and storage unit (72) by operand data transport means (77, 78), and to control the processing and storage unit (72), by a set of mutually independent parallel instruction transport means (75), using transaction control words specifying the data-transforming operations to be performed, the operand data to be processed by said data-transforming operations, and destination locations or addresses for results of said data-transforming operations, characterized in that the apparatus (70) is configured to, in at least one operational mode, write the results of the data-transforming operations directly to a destination address within the memory system (42) and any setup of the processing and storage unit (72) with a set (73) of mutually independent arithmetic logic units (41) for executing the data-transforming operations in parallel, with a set (74) of mutually independent memory systems (42) for storing the results of said parallel data-transforming operations in parallel, and with a set (79) of mutually independent parallel data transport means (47) for directly transferring the results of said parallel data-transforming operations from the set (73) of mutually independent arithmetic logic units (41) to the set (74) of mutually independent memory systems (42) in the at least one operational mode of the apparatus (70) is adapted to execute at least one very long transaction word comprising a set of parallel transactions issued by the control unit (71) via the set of instruction transport means (75) to the set (73) of mutually independent arithmetic logic units (41) of the processing and storage unit (72), said transaction word comprising at least one transaction which differs from at least one second transaction within the same transaction word with respect to its instruction type.

    2. The apparatus (70) of claim 1, wherein the at least one transaction word comprises at least one transaction which causes the arithmetic logic unit (41) to write the value of one operand directly to one memory system (42) of the set (74) of memory systems.

    3. The apparatus (70) of claim 1, wherein each arithmetic logic unit (41) of the set (73) of mutually independent arithmetic logic units is connected to one instruction transport means (43) of the set (75) of mutually independent parallel instruction transport means (75) and said instruction transport means (43) is connected exclusively to said arithmetic logic unit (41).

    4. The apparatus (70) of claim 1, wherein each memory system (42) of the set (74) of memory systems is connected to one arithmetic logic unit (41) of the set (73) of mutually independent arithmetic logic units via one data transport means (47) of the set (79) of mutually independent parallel data transport means (75) and said arithmetic logic unit (41) is connected exclusively to said memory system (42) via said data transport means (47), the apparatus (70) preferably comprising a prime number of arithmetic logic units (41).

    5. The apparatus (70) of claim 1 claims, wherein at least one of the memory systems (42) is partitioned into a set of mutually independent memory subsystems (53) interconnected by data transport means (50, 51).

    6. The apparatus (70) of claim 4, wherein at least one of the memory subsystems (53) has further independent data transport means (44, 45) for exchanging data with units of the apparatus (70) other than the connected arithmetic logic unit (41), such as a further control unit (71), or I/O interfaces such as to a further apparatus (70), file system, processor interconnect, user console, or other external real-time data source or sink.

    7. The apparatus (70) of claim 1, wherein the control unit (71) is further adapted to read the data and code from the at least one memory system (42) of the set (74) of memory systems via a set of mutually independent parallel data transport means (44, 45).

    8. The apparatus (70) of claim 4, wherein the control unit (71), by means of an interleaved memory abstraction layer (190), is adapted to access the mutually independent memory systems (42) of the set (74) of memory systems as one single unified block of interleaved logical memory such that any logical memory address uniquely specifies the memory system (42) of the set (74) of memory systems and the destination address within said memory system (42) physically holding the data.

    9. The apparatus (70) of claim 8, wherein the control unit (71) comprises an address decoder implementing the interleaved memory abstraction layer (190).

    10. A method of using the programmable data processing apparatus (70) of claim 1, wherein, by means of a pipelined memory abstraction layer (193), a virtual data flow is generated, said virtual data flow being controlled by an index and an address transformation defined within a coherent memory block forming a pipeline (194) such that said index is either subtracted from or added to a relative address located within the pipeline (194), the transformation yielding a latch address.

    11. The method of claim 10, wherein by a further address transformation, for example, by means of a modulo transformation, the pipeline (194) is organized as a ring buffer.

    12. The method of claim 11, wherein latch addresses are used, as per the instruction set architecture of the apparatus (70), as instruction source and destination addresses, such as by means of address decoder hardware adapted to, depending on the index associated with the pipeline (194), translating the latch addresses into logical addresses.

    13. The method of claim 12, wherein the pipeline (194) is associated with at least one software loop for either incrementing or decrementing the index associated with the pipeline (194).

    14. The method of claim 13, wherein the software loop is executed by a dedicated thread.

    15. The method of claim 11, wherein pipelines (194) communicate, preferably via channels.

    16. The method of claim 15, wherein the communicating pipelines (194) form a network.

    17. The method of claim 11, wherein control flow information is embedded into data streams processed by or communicated between pipelines.

    18. The method of claim 17, wherein the control flow information is embedded into a data stream by augmenting each data word (222) of said stream by a control tag (221), the augmented data word (220) being stored, processed and communicated as a whole, and the control tag (221) comprising a combination of bits (223, 224, 225) for controlling execution, such as at least one bit (223) controlling loop termination, one bit (224) controlling conditional execution of operations on the data word (222), or one bit (225) indicating that the data word (222) contains further control information.

    19. The method of claim 10 wherein, such as by means of a stack memory abstraction layer (191), memory blocks (192) are allocated as an addressable stack of data referenced relatively to a top of the stack.

    Description

    BRIEF DESCRIPTION OF DRAWINGS

    [0092] The invention is presented by way of example, and not by way of limitation, in the figures of the accompanying drawings:

    [0093] FIG. 1: PRIOR ART—today's top 25 supercomputers and evaluations

    [0094] FIG. 2: PRIOR ART—rooflines of leading data processing architectures

    [0095] FIG. 3: PRIOR ART—read-outs from FIG. 2 for relevant HPC workloads

    [0096] FIG. 4: Design of a parallel processing channel (PPC)

    [0097] FIG. 5: Distributed hierarchical memory system of one PPC

    [0098] FIG. 6: Simple processing core built with one single PPC

    [0099] FIG. 7: Interleaved Processing (IP) core built from m PPCs

    [0100] FIG. 8: Performance scalability—rooflines of various IP cores

    [0101] FIG. 9: The IP cores of FIG. 8: expected data

    [0102] FIG. 10: Interleaved Processing is based on memory word interleaving

    [0103] FIG. 11: Addition of unit stride vectors—Dense PPC port usage

    [0104] FIG. 12: Addition of strided vectors—Dense PPC port usage

    [0105] FIG. 13: Aliasing effects caused by a non-prime PPC count m

    [0106] FIG. 14: Aliasing effects caused by a prime PPC count m

    [0107] FIG. 15: Types of descriptors involved in a vector addition

    [0108] FIG. 16: Data Flow Graph (DFG) of an elementary loop body

    [0109] FIG. 17: Systolic timing diagram for the DFG of FIG. 16

    [0110] FIG. 18: PRIOR ART—data of the numerical operations in FIG. 17

    [0111] FIG. 19: Memory abstraction layers

    [0112] FIG. 20: Pipelined memory—the functioning of one buffer pipeline

    [0113] FIG. 21: Advanced operation modes of a pipelined loop

    [0114] FIG. 22: Binary layout of a “control/data” word

    [0115] FIG. 23: Preliminary draft of an ESC encoding scheme

    [0116] FIG. 24: Scheduling of a pipelined conditional loop

    [0117] FIG. 25: Conditional fork operator and a join operator

    [0118] FIG. 26: IP superpipeline of a parallel subroutine call

    [0119] FIG. 27: IP superpipeline of a parallel if-else statement

    [0120] FIG. 28: Technological classification according to Flynn's taxonomy

    [0121] FIG. 29: IP operational modes

    [0122] FIG. 30: Simultaneous Multithreading (SMT) on IP cores

    [0123] FIG. 31: SMT on IP core (83)—prioritized per-PPC combination of VLTWs

    [0124] FIG. 32: Scheduling—necessary ALU jitter

    [0125] FIG. 33: Compiled DFG of FIG. 16—VLIW assembler code

    DESCRIPTION OF EMBODIMENTS

    [0126] With reference to the known method of memory interleaving—by which memory is organized as a set of parallel and independent memory banks on a DRAM module—each such interleaved DRAM bank (56) is extended as illustrated in FIG. 5 to form a distributed hierarchical memory system (42) that reaches from the DRAM bank (56) into a processor chip via a private chip-to-chip interconnect (52); the same “principle of privacy” further applying to the on-chip SRAM banks (53-55) of each distributed hierarchical memory system (42) and its internal interconnects (50, 51).

    [0127] Each memory system (42) thus is a stand-alone unit which interacts with other units of the processor chip only through independent ports (44-47).

    [0128] As drafted in FIG. 4, each memory system (42) is augmented by a private local data processing unit ALU (41) which consists of pipelined elementary processing elements (adders, multipliers etc.) and sustains a 1 Flop/cycle throughput for all types of data transformation it can execute according to the processor's instruction set architecture (ISA) (for compatibility with existing literature the term “Flop”—floating point operation—used in this application always includes any type of data transforming operation “op”). Furthermore, the latencies of each of said types of data transformation are required to have a firmly guaranteed lower bound and an upper limit above which the ALU is guaranteed to signal a stall request to the CU (71).

    [0129] The assembly (40) is a stand-alone distributed processing element which interacts with its on-chip environment only via independent ports (43-46). Such an element (40) is called a “parallel processing channel” (PPC).

    [0130] When creating a processor-memory system with 3D chip packaging and differential serial processor-to-memory interconnects (52), low-MB PPCs can be built. This is exemplified in FIG. 9, row (87), in which a 2 GHz clock, 1 Flop/cycle, and therefore a 2 GFlop/s throughput for the ALU (41) was assumed, and for the interconnect (52) a bidirectional differential bit link of 2×25 Gb/s bandwidth, which yields MB=(2 GFlop/s)/(6.25 GB/s), i. e. an (exemplary) machine balance of MB=0.32 Flop/B.

    [0131] The method of interleaved processing (IP) is drafted in FIG. 7 from which it is clear that when building cores (70) with m PPCs (72), the accumulated processing and data transfer bandwidths both scale linearly with m. Thus, core-level IP is linearly performance scalable (LPS). With such cores (70), LPS multicore/DRAM packages with thousands of PPCs can be built, a rough order of magnitude for the upper limit being given by comparison with modern GPU designs (80×32 FP64 ALUs in a Nvidia Tesla V100).

    [0132] Furthermore, high bandwidth parallel memory mapped I/O can be directly integrated in the memory system (74) which makes the design well-suited for HPC signal processing and future high-speed HPC interconnects.

    [0133] The linear performance scalability of IP is illustrated in FIG. 8 for cores (80, 81, and 83-87)—PPC counts m are noted in FIG. 9—and visualized by the circumstance that altering the PPC count m of a core shifts its ridge point RP vertically—i. e. without altering the core's MB.

    [0134] A second dimension of performance scaling is associated to the choice of the number n of bit lanes and transmission rate of the interconnect (52) which shifts the RP horizontally as indicated by the circle raster in FIG. 8. By such scaling, the MB can be adapted to workload AI requirements.

    [0135] IP cores (70) use memory interleaving: each logical memory address adraccessed by software is decomposed by the CU (71) into a quotient a=adr/m and a remainder b=adr mod m. The latter b selects the PPC (40) from which to read data via its local ports (44-46) or to which to issue instructions via local port (43); the former quotient a selects the local memory address in the memory (42) of said selected PPC (40) where the data is physically located. This is illustrated by FIG. 10 which exemplifies the distribution of a logical address range of adr=0 . . . 154 over all local PPC memories for an IP core (83) with a PPC count of m=31.

    [0136] As drafted in FIG. 7, all read and write accesses to memory (42) are unarbitrated. The latter are issued indirectly as data write instructions to the ALU (41) of the current PPC (40) which executes the pursuant write operation—again unarbitratedly—via local write port (47) to memory (42).

    [0137] The latter principle also applies to all machine instructions that compute a result as the compiler then will have to specify which logical address to write said result to. As according to FIG. 7, the latter logical address can only be accessed by its local ALU (41), the same address decomposition of the CU (71) that governs memory interleaving when reading and writing data also selects which PPC an instruction is sent to for execution. This is the method of “interleaved processing” IP.

    [0138] According to the invention, the basic operation mode of the core (70) is the general purpose processing (GPP) of a stack machine (SM) described by PAT1. FIG. 19 illustrates how on top of the above-explained interleaved memory abstraction layer (190), a second memory abstraction layer (191) implements the specific “virtual register stack” addressing method of PAT1 which means that the SM inherits the core's VLIW processing capabilities.

    [0139] Now, the “virtual register stack” of the SM of PAT1 is addressable and handled by the CU (71); it therefore can be used to pass information from the software down to the hardware via a fast parallel interface (77, 78).

    [0140] Thus, the invention proposes the definition of SIMD machine instructions in the IP core's ISA that are parameterized by descriptors that are located on the stack. The general structure of such a descriptor is exemplified in FIG. 15 for a vector addition x=y+z: with a pointer to the descriptor (150) of said exemplary vector addition, the CU can read in its operation type selector (153)—“vector addition”—and pointers (154) to descriptors (151) of the vector data that have to be processed. In the given example, these describe vectors x, y, z each by a pointer (156) to memory (159), vector length (157), and stride (158). Via pointers (156), the CU can access the components of the vectors x, y, z for execution.

    [0141] FIG. 11 exemplifies a SIMD vector addition x=y+z on a core (83) which has 31 PPCs; vectors x, y, z have 100 components each and unit stride. Data accesses 1-3 of the addition fully utilize all 2×31 read ports (77, 78) and 2×31 write ports (75, 79). Thus, the core executes the first addition steps with full m=31 Flop/cycle throughput, regardless of “alignment”.

    [0142] FIG. 12 illustrates that under one below-explained mathematical condition all of the features of a unit-stride vector addition are preserved in the case of a strided vector addition. The only thing that changes are the patterns of the m-fold interleaved memory accesses but all ports are still fully utilized; the frist three steps of the shown strided vector addition are still executed with full m=Flop/cycle throughput, regardless of “alignment”:

    [0143] Said mathematical condition refers to an aliasing phenomenon caused by a periodicity conflict between memory access stride s and interleaving stride m which does not occur if both strides s and m are relatively prime as demonstrated in FIG. 13 for m=32 PPCs and varying vector strides s.

    [0144] If on the other hand m is prime, e. g. m=31, the performance is constant except for cases in which one of the vector strides involved is an integer multiple of m (FIG. 14). In such a case, all components of the vector reside in one of the memories (74) only, and accordingly, performance drops to the level of a scalar m=1 core (87). In all other cases, it is mathematically guaranteed that any m subsequent accesses to the vector components of x, y, z are evenly distributed over the memory set (74) such that the data transfer bandwidths of all four port sets (77, 78, 75, 79) are fully exploited.

    [0145] Therefore, to avoid aliasing as best as possible, m should be chosen as a prime number. An efficient address decoding mechanism (cf. FIG. 10) for prime m is described by Diamond and a selection of IP cores with such prime numbered PPC counts m is presented in FIGS. 8 and 9.

    [0146] As an alternative to using explicit machine operation descriptors (150)—which is intended for the implementation of special procedures at the machine level—pointers to data descriptors (151) may also be directly passed to scalar machine instructions. In such a case, said instructions are executed as SIMD operations (e. g. vector dot product, vector scalar product, matrix vector multiplication, tensor contraction, and so forth).

    [0147] One way to signal to the CU whether a value provided to an instruction represents a scalar value or a pointer to a data descriptor (151) is using typed data e. g. as disclosed in PAT2 Said method additionally allows polymorphic data processing as well as pointer safe, type safe, and cyber secure memory addressing.

    [0148] With SIMD instructions, Fortran array, BLAS, and SQL primitives can be defined leading to a lean but yet powerful ISA which allows programming at a level of abstraction which hides (and internally manages) hardware details like the PPC count m of the targeted IP core. In other words, IP SIMD processing is “vector length agnostic” as vectors are treated as whole objects and need not to be explicitly decomposed at the SW level into chunks that fit into a (hardware-dependent) “vector unit”; this job is transparently performed in the background by the CU (71).

    [0149] To further ensure software portability between various IP cores (FIG. 9) with respect to SIMD processing it is not necessary to implement the full range of SIMD instructions in CU hardware on each single IP core; a CU may alternatively emulate such instructions by exception handler routines. Such a method may be recursively invoked giving it a broad applicability.

    [0150] FIG. 16 shows—in the form of a data flow graph (DFG)—an exemplary algorithm which for vectors a, b, c calculates roots x1, x2 of the quadratic equation a*x{circumflex over ( )}2+b*x+c=0; and FIG. 17 shows a timing diagram for said DFG which illustrates how its software-defined algorithm can be executed—when iterated in a systolic pipelined loop—at 11 Flop/cycle throughput on an IP core with sufficient PPC resources.

    [0151] This means that in the steady state each iteration of the pipelined loop simultaneously issues three input reads (a, b, c), four multiplications, one negation, two subtractions, one square root calculation, one addition, two divisions, and two output writes (x1, x2). On a core (83), these operations are encoded in a VLIW which implements the entire DFG—including data paths—within one single parallel superinstruction (FIG. 33) (this is a simple example; as mentioned above, square roots and divisions are expanded into sequences of 1 Flop/cycle FMA—fused multiply-add—operations like e. g. in the Intel ltanium architecture which increases Flops and throughput —i. e. the ILP—of the pipelined loop implementation FIG. 33 of FIG. 16).

    [0152] Before starting the loop, said VLIW is fetched by the CU via instruction ports (44) and decoded. During loop execution, the CU iterates said VLIW without further instruction load or decode. Data read/writes are pipelined.

    [0153] As explained above, the parallel instructions of the DFG are interpreted as virtual operators (VO) and for an intuitive understanding of systolic loop pipelining, the timing diagram of FIG. 17 might be imagined as a scheme of parallel “assembly lanes” (lanes 0-3) comprising “production machines” (input, numerical, output operators) interconnected by “conveyor belts” (“buffer pipelines”) and mechanisms to pass objects between the parallel assembly lanes (arrows). Arrows indicate synchronized data transfer.

    [0154] Evidently, buffer pipelines synchronize all data transports between the (internally pipelined) VOs of the DFG on a systolic cycle-to-cycle basis. In an ASIC implementation of FIG. 17, said buffer pipelines would be built as FiFOs which make sure that inputs a, b, c of each read-in are correctly processed in-time without interfering with any inputs a′, b′, c′ of other read-ins or any of the intermediate results derived from such a′, b′, c′.

    [0155] Translating said FIFO synchronization concept to general purpose IP, buffer pipelines are, according to the invention, implemented as chained sets of “pipelined memory” addresses called “latches”. In FIG. 17, said latches are shown enumerated in the buffer pipelines.

    [0156] In pipelined memory, data moves from latch to latch each time a loop iteration starts; but instead of physically moving said data, this behavior is virtually created by a memory address mapping method as follows:

    [0157] According to FIG. 19, pipelined memory is allocated from interleaved memory (190) in blocks (194) of 2{circumflex over ( )}n logical words (n is a positive integer and on a 64 bit machine, one logical word comprises 8 B). If the pipelined memory block (194) is allocated at a logical offset off, any relative logical address ret pointing to a logical word in said memory block corresponds to an absolute logical address adr=rel+off

    [0158] While executing the loop, the CU (71) uses a hardware loop counter i which is set to 0 when entering and incremented when iterating the loop. Each latch of a pipelined memory block (194) is dynamically mapped to a relative logical memory address ret per ret=(latch+i)∧(2{circumflex over ( )}n−1); in other words, the relative logical address ret of the latch comprises the lower n bits of the loop-dependent sum latch #i.

    [0159] FIG. 20 illustrates the dynamic relation for latch 44—in FIG. 17 located in lane 1 after the multiplication—for loop iterations i starting at i=1000. According to a dynamic mapping with n=7, at i=1000 the multiplication result is written to rel=20 as noted under “mult writes” where it is shown that during the following loop iterations, subsequent writes to latch 44 are directed to subsequent logical addresses rel=21, 22, 23, 24, . . . .

    [0160] In other words, latch 44 sequentially writes to memory without overwriting older data as required for the functioning of a buffer pipeline. It does so at least for 2{circumflex over ( )}n=2{circumflex over ( )}7=128 iterations which is sufficient to not interfere with any of the other latches 0 . . . 80 required by the timing diagram of FIG. 17.

    [0161] Next, following the result computed at i=1000 in “mult output pipeline”, with each loop iteration, said result appears to move from latch to latch where it can be accessed at latches=44, 43, 42, 41, 40, . . . This mimics the desired FIFO data transport behavior of the buffer pipeline.

    [0162] In iteration 1004—i. e. four cycles after the write operation at i=1000−the value stored in rel=20 can be read in from latch 40 by the subtraction operator (“sub reads”) for further processing. Looking back in time, in the first iteration at i=1000 data that had been calculated by the multiplication at i=996 and—at that time—stored under relative logical address rel=16 is read from the subtraction's input latch 40. This is how both multiplication and subtraction can work simultaneously without interfering via their data dependence: both operators are decoupled because they simultaneously process different loop stages (L e. iterations of the original loop).

    [0163] From an LLVM compiler's point of view, each edge of the DFG of FIG. 16 is a fully SSA-compatible data transport implemented as a buffer pipeline. The compiler has to ensure that buffer pipelines do not overlap and uses the freedom of choice when allocating buffer pipelines within one pipelined memory block to resolve resource conflicts regarding ALUs and read/write ports of the given PPC set.

    [0164] Buffer pipelines can also implement “loop-carried dependences”. In a pipelined loop, the effects of the latter are best understood in “wavefront notation” by which each travelling wavefront is numbered by an index k assigned by the loop iteration index i at which wavefront k enters the pipeline in stage 0. Wavefront and pipelined loop indices are interrelated by k=i−s (s=stage). Each wavefront k represents one iteration k of the software-defined “algorithmic loop”. As “loop-carried dependence” means that algorithmic loop iteration k accesses data computed by some earlier iteration k−Δk, at some stage s, wavefront k accesses wavefront k−Δk. Such an access is implemented by reading from a latch that is positioned Δk stages to the right of said wavefront k.

    [0165] Also, buffer pipelines are used when expanding complex operations like division, square root, sine, cosine, polynomial, digital filter, and so forth.

    [0166] Finally, buffer pipelines can compensate runtime variations in the latency of operations. As specified above, ALU operations are required to have a guaranteed minimum latency and an upper latency limit above which the operation—when unfinished—stalls the CU until it has delivered its result. When planning the buffer write operation of an operator A, the compiler has to plan its latch allocation for the minimum latency of A; when planning a later buffer read by some operator B which consumes the result of A the compiler instead has to assume the maximum latency of A. The latch-to-address translation is performed by the CU and accordingly communicated to A. Therefore, A will always write its result to the same logical memory address, regardless of A's latency; integrity of B's read access is guarded by the above rules.

    [0167] The above-mentioned runtime variations in the latency of ALU operations—called “jitter”—are needed to resolve memory access conflicts caused by latency differences between the ALU operation types. Taking FIG. 18 as an example, a multiplication issued in one cycle n would write out its result in cycle n+5. If unattended, in the very same cycle n+5 an addition issued in cycle n+2 would also need write access to memory. This resource conflict can be resolved by delaying the write-out of the later operation (addition) to the next cycle or even later as exemplified in FIG. 32 which shows how the ALU schedules local bus accesses on a first-come-first-serve basis. For stall-free operation, each subunit of the ALU (adder, multiplier etc.) has to be buffered by a hardware output FIFO; each such FIFO must at least provide enough capacity to—where necessary—extend the latency of the subunit to the maximum latency among the subunits (in FIG. 18: sqrt latency of 21 cycles).

    [0168] The same is true for non-ALU operations like e. g. memory read and write as exemplified in FIG. 33 in which the systolic timing diagram of FIG. 17 is compiled into assembler code. It was assumed here that pointers to the input and output variables a, b, c, x1, x2 of the loop are located on the addressable GPP stack (PAT1) in top-of-stack-relative positions 0 s . . . 4s which in logical address space are fixed. As in the same address space latches “virtually move” with each iteration (cf. FIG. 20, “mult writes”), access conflicts on memory ports (77-79) and transaction ports (75) are inevitable; this applies not only to pointers a, b, c, x1, x2 but also to their dereferenced values.

    [0169] Apart from the above-mentioned hardware buffering—in this case: in the CU instead of in the ALUs—and neglecting the costly option of increasing port numbers the following solutions of mitigating memory access conflicts are viable which have to be supported by ISA, compiler, and CU hardware: eliminating redundant memory read accesses—e. g. in FIG. 33: [0170] multiple read accesses to 45p, 48p, 38p, Op needed for opcodes #5, #6, #10, #11, #12, #13 where the CU can internally distribute read-in values to construct said opcodes; [0171] eliminating memory read accesses by contracting opcodes, e. g. opcode #0 with #3 to mutt[0s]++, 217, 31p and so on; [0172] using pre-loadable special purpose I/O registers for pointers, e. g. mutt [0r]++, 2h, 31p with 0r denoting such an I/O register; using instruction read ports (76) for data read—as explained above, during loop execution the CU iterates pre-loaded VLIWs without further instruction load/decode, so these ports are “free” (this method is also useful when evaluating polynomials etc.); [0173] increasing port capacity by using more VLIWs per loop iteration; [0174] allowing pipeline stalls—as indicated above and explained below the TMT scheduler is capable of cycle-to-cycle thread switches, therefore, some other thread can utilize the PPC set as long as such a stall lasts.

    [0175] Notably in this context, the timing diagram FIG. 17 oversimplifies matters regarding M1 memory access and CU latencies as e. g. in lane 2, latch 34, data between both multiplications have to be transferred from an ALU (41) via local port (47) to M1 memory (51) and from there via one of the ports (45, 46) through the CU (71) and instruction port (43) to another ALU (41). But as long as all operations involved are fully pipelined, this overall delay can be balanced off by accordingly [0176] planning with overall operator latencies which include memory read and write access latencies; [0177] partly overlapping operations as the second operation must be issued before the result of the first is available—timing has to be adjusted in such a way that the memory handshake fits; [0178] decoupling the memory handshake by allocating a second latch for it—thereby, under no circumstances data will be overwritten before having been consumed.

    [0179] In effect, memory pipelining offers a trade-off between memory and throughput. In principle, the pipelined memory could have an arbitrarily large access latency; this releases all restrictions regarding the size of the interleaved M1 memory block—which is mainly intended for use as pipelined memory reservoir—except for data transport energy restrictions.

    [0180] But while this is true for throughput-oriented workloads it may be not for real-time applications in which case a second trade-off must be stricken regarding M1 latency versus throughput.

    [0181] To summarize: using latches instead of logical addresses, loop-invariant machine code can be compiled (cf. FIG. 33) that virtually executes all data transports needed to implement the systolic data flow depicted in FIG. 17 without having to physically move the data in dedicated hardware buffers. Functionally, this method resembles the “register renaming” scheme of “rotating register files” (Rau; Itanium) but here, it is applied to potentially many and large allocations of interleaved memory—i. e. standard SRAM—instead of a small number of dedicated registers (Itanium: 96) and hence can serve as a foundational basis for the below-described VHTC method.

    [0182] One a general note, the above-explained method of memory pipelining is not restricted to handling word-sized (on a 64 bit machine: 8 B) data items only but can also be applied to wider data types as e. g. quad floating point or 64 bit complex numbers. Then, a second pipelined memory block (194) must be allocated which “rotates” twice as fast as the standard block—i. e. transforms latches to relative logical addresses with a doubled step width: rel=(latch+2*i)∧(2{circumflex over ( )}n—1). Then, in assembler code, latch types need to be distinguished by suffixes, e. g. ‘21p’ denoting a single-step latch and ‘13p2’ a double-step latch.

    [0183] FIG. 21 shows that in general, pipelined loops need dynamic loop control.

    [0184] Table (a) shows a case in which the number of algorithmic iterations N is greater than the number of pipelined loop stages S. Under this condition, “prolog” (loop filling), “kernel” (steady state), and “epilog” (loop draining) phases can clearly be distinguished and a static compiler method (Rau) would assemble S prolog VLIW, one kernel VLIW plus control structure, and S epilog VLIW; well-known downsides of such static compilation are: [0185] The more stages S the loop has the bigger is its prolog/epilog overhead in relation to the kernel with respect to code size; [0186] Epilog latency is “dead time” as during epilog the loop cannot be restarted as shown in FIG. 21 (b) in which an “old” data packet is drained (epilog, right side) while a “new” data packet already fills the pipeline (prolog, left side); [0187] Scenario (c)—packet is shorter than the pipeline—and the multi-sized multi-packet scenario (d) cannot be implemented as both require parallel execution of multiple control flows.

    [0188] FIG. 22 discloses how all of the above-discussed scenarios (a-d) can be handled by embedded control tags (221) which augment each data word (222) thereby creating a control/data word (220) which is processed as a whole, atomic item of information. Accordingly, each component of the IP core (70) has to support such control/data words (220) (which hereinafter sometimes are also simply referred to as “data”).

    [0189] The control tag (221) consists of three control bits, namely: [0190] TERM (223), used to stream parallel control flow information for algorithmic loop termination in conditional pipelined loops; [0191] NULL (224) which is not only used to stream parallel control flow information within predicated conditional statements but also for the scenarios of FIG. 21; [0192] ESC (225) which is used to embed various types of information into the data stream. A draft of the encoding of such information is shown in FIG. 23. Generally, numerical operators ignore ESC data and pass it on. ESC data is prioritized over standard (ESC=0) data if a numerical operator receives conflicting types at its inputs. Conflicting ESC data at operator inputs constitute an error.

    [0193] With control/data, the scenarios of dynamic loop control of FIG. 21 are managed by introducing a “non-blocking” operation mode of the read (RD) operator in which it writes “valid data” (NULL=0) to the pipeline if a read was successful and “idle data” (NULL=1) if unsuccessful (which may be the case if the read operation stalls). Subsequent operators then either process the data in their predefined way (case: valid) or transmit idle data to the next operator (case: idle). Accordingly, the write (WR) operator acts on valid data only but skips idle data.

    [0194] As said non-blocking mode avoids pipeline stalls and drains the pipeline as fast as possible when reading from data sources with jittering latency, it can be used e. g. for real-time signal processing. In non-blocking mode, the pipeline is always “live”, clearing out wavefronts as fast as possible, but has the disadvantage of also processing idle wavefronts which costs data transport energy along paths (75, 77-79) and overall performance (as the pipeline never deschedules and hence blocks all other threads).

    [0195] In order to minimize energy costs and processing overhead, the pipeline automatically switches to the below-explained “blocking” mode when it is fully drained. In blocking mode, when stalling, a thread is descheduled by the temporal multithreading (TMT) scheduler and only rescheduled when new data arrives and the pipeline resumes its former non-blocking mode.

    [0196] Additional hardware support to reduce data transport energy of idle data may take into account that in the idle case only the NULL bit needs to be stored and communicated via the internal busses.

    [0197] In pure data processing workloads, instead of latency, optimal resource utilization is of interest. For such workloads the “blocking” operation mode is suitable in which the pipeline is descheduled when no input is available; the pipeline processes valid wavefronts only. Hereafter, blocking mode is assumed as the default IP pipeline operation mode.

    [0198] A pipeline can be drained by an “end of transmission” (EOT) wavefront, i. e. a wavefront which comprises only EOT data (FIG. 23). After detecting (and passing on) the EOT wavefront through its set of RD operators, the pipeline switches from blocking to the above-described non-blocking mode in which it is drained (L e. filled with idle wavefronts) until either a “start of transmission” (SOT) wavefront is received—in which case the IP pipeline returns to blocking mode as shown in FIG. 21 (b)—or the EOT wavefront has reached the set of WR operators which passes the EOT wavefront on. All of the valid wavefronts have then been written out; the pipelined loop accordingly terminates; execution control is passed on to its surrounding algorithm (L e. to the GPP stack machine which executes the pipelined loop which upon EOT terminates so that the next following sequential stack machine instruction can be executed).

    [0199] Synchronization of all RD and WR operators (which may be necessary when channel or physical I/O is involved) is enabled by the fact that VLIW execution can be stalled by each machine instruction—here: RD or WR-it comprises; said synchronization also works in the above-mentioned case in which one pipelined loop iteration consists of a sequence of VLIW.

    [0200] Notably, all I/O operators like RD and WR are executed by the CU (71) which as a central unit coordinates the above-mentioned synchronization. The CU also holds pipeline state information, e. g. “blocking vs. non-blocking mode” (discussed above), “SOT may interrupt draining vs. full drain always” etc.

    [0201] Accordingly, packet processing can be implemented with “start/end of package” (SOP/EOP) wavefronts including the option to either drain the pipeline after each packet or to keep it filled in blocking mode both modes having perks and drawbacks with respect to latency-bound real-time vs. throughput-bound standard data processing, as discussed above.

    [0202] TERM is used in pipelined conditional loops as follows: A comparison operator determines whether the condition to exit the algorithmic loop is met and accordingly sets the TERM bit on data paths which lead to “conditional write” (CWR) and/or “conditional read” (CRD) operators.

    [0203] Over CWR operators, the conditional algorithmic loop can deliver results. CWRs write out TERM=1 data only. In the result the TERM bit is cleared.

    [0204] A CRD operator is a “two-way selector switch” which feeds a wavefront back into the pipeline for further iteration if TERM=0 or replaces it by a read-in wavefront to start the next algorithmic loop iteration (TERM=1).

    [0205] FIG. 24 demonstrates how, using CRD and CWR operators, a pipelined conditional loop can gaplessly process wavefronts but will tend to reorder their sequence. This problem is solved by adding a “wavefront lane” (WL) to the DFG (cf. FIG. 17) on which upon read-in each wavefront is tagged by a serial WAVE number (FIG. 23). Said WAVE number is evaluated by the CWR operator when restoring the original wavefront sequence in its output buffer. Output buffer overflow protection requests—to temporarily block CRD input and instead insert idle wavefronts—are communicated over WL via NULL bit. The above-discussed TERM bit is also signaled via WL, and optionally, intermediate loop results can be written out using additional WLs. Simply put, WLs are used as the physical carrier of the pipelined loop's parallel control flow.

    [0206] Parallel execution of many control flows as exemplified in FIG. 24 cannot be implemented on von Neumann architectures. The pipelined loop thus constitutes the above-announced new generic type of an elementary data processing structure called “IP pipeline”.

    [0207] In anticipation of the below-explained concept of “superpipelining” it is noted that embedding an IP pipeline's internal control flow into a dedicated WL control/data stream allows to “split” any IP pipeline into a sequence of concatenated IP pipelines which can be executed by parallel threads that then may—if deemed appropriate—be distributed over a net of distinct cores or even processors. This option of “control flow distribution” and “pipelining pipelines” is a unique feature of the invention and expected to be useful when pipelining large complex workloads (workload balancing as discussed below).

    [0208] In an IP pipeline, state-of-the-art predication methods can be applied to transform algorithmic conditional if-else statements into elementary blocks of “predicated instructions” which are executed if a certain condition holds and otherwise skipped. Such predicated instructions are executed/skipped without branching and thus can be used within the IP pipeline concept.

    [0209] Now within the context of the invention, control/data carried predication—i. e. conditional operator execution—is available per NULL bit (FIG. 22). Accordingly, algorithmic branch instructions “if” and “else” are translated into “fork” and “join” operators which constitute control/data flow elements of the DFG and are executed by the CU (FIG. 25): [0210] The fork operator (250) conditionally overwrites data in a buffer pipeline (253)—which in FIG. 25 appears as both input and output of the fork—and feeds a second buffer pipeline (254) with data. To do so, it reads a data word W from (253). If W is idle it writes idle words to both (253) and (254). If on the other hand W is valid, then—depending on the content of the control word (252)—fork conditionally either leaves (253) unaltered but writes an idle word to (254) or, alternatively, writes an idle word to (253) and copies the valid word W to (254). (253, 254) can be viewed as “if” and “else” branch pipelines of the fork operator. [0211] The join operator (251) unites “if” and “else” branch pipelines (253, 254) to one result pipeline (255) by discarding all idle data and promoting valid data only—whichever branch provides it. [0212] The decision which of the “branches” (253, 254) is selected to process the data is made by the fork operator (250) according to “control data” (252) depending on fork's comparison type (“equal to zero”, “less than zero”, and so on) as specified by the algorithm of the IP pipeline. Alternatively, a fork can also decide based on ESC control data (252); this additional option allows to use control flow (CALL, LINK, RET, TRAP), protocol (SOT, EOT, SOP, EOT), or administrative information like system status messages etc. for the conditional handling of all kinds of data flow states (FIG. 23). [0213] According to the generating algorithmic if-else statement, the pair of “if” and “else” branches (253, 254) may be substituted by a pair of “conditional” DFGs (CDFG) which are not connected to each other but exchange control/data with the surrounding global DFG via a multichannel version of the fork operator (250) which has only one control input (252) but n inputs/“if branches” (253) and n “else branches” (254), a set of m joins (251) delivering m outputs (255)—m not necessarily being equal to n—and an optional set of k additional unconditional inputs that are available to both CDFGs. [0214] Finally note that in the context of the above-discussed structure “speculative execution” can be implemented by making the fork operator unconditional (L e. removing it) and instead either using a “conditional join” or the above join plus intermediate operators which at some point conditionally “nullify” the “if” and “else” data streams (conditional NULL operator).

    [0215] Now as disclosed below, each of said CDFGs can be compiled into an independent new IP pipeline which will be executed by a parallel thread. The channel communication between sender and each of the two receiver IP pipelines is guarded by “multichannel fork” operators that behave like a multichannel version of the write operator WR by sending only valid data and skipping idle data. In such a parallel setup, the join operator has to be replaced by a “multichannel join” operator which accepts only valid input data from both parallel IP pipelines and ignores idle data.

    [0216] Because of latency differences of the CDFG pipelines, communication, and multithreading/multiprocessing jitter, such a parallel setup will tend to reorder the original wavefront sequence and thus, “reserializing” by the “multichannel join” will be necessary. Overflow protection is automatically provided by the CSP (“communicating sequential processes”, Roscoe) data flow execution model under which, according to the invention, parallel threads are run. Multichannel fork and join operators are implemented by IP SIMD machine instructions that accept channel lists as arguments (see above and cf. Occam).

    [0217] Compared to predicated execution, guarded IP pipelines save energy. It nevertheless may sometimes be desirable to mix predicated with guarded execution e. g. in complex switch or other nested conditional statements.

    [0218] An alternative to a guarded IP pipeline is the conditional sequential call to a subroutine. A use case for such a sequential call might be a handler routine which is invoked by the fork operator when receiving an ESC word on (252). Such a sequential handler call temporarily suspends parallel IP pipelined operation by invoking the GPP stack machine mode but offers flexibility for sporadic system responses which cost too much energy and performance when implemented in predicated code or parallel pipelines.

    [0219] In principle, an IP stack machine could sequentially run any number of IP pipelines but according to the invention only one IP pipeline per IP stack machine is used. This simplifies the memory allocation scheme of FIG. 19, as each block of pipelined memory (194) is linked to one block of stack memory (192). The logical memory address offset of said block (194) and its size thus are intrinsic properties of the stack machine which are stored in its special purpose register (SPR) set. Therefore, to access a latch of said pipelined memory block a machine instruction only needs to specify its (relative) latch address—which e. g. in FIG. 33 comprises only 7 bit.

    [0220] Secondly, according to PAT1 stack memory is accessed via TOS-relative (TOS=top of stack) addresses, which are typically “small”. Thus, an ISA can be defined in which stack and pipelined memory accesses can be freely intermixed without giving rise to exceedingly wide instruction sizes; address types can be distinguished by postfixed letters ‘s’ and ‘p’ in the machine code (FIG. 33) and by type bits in the respective address fields.

    [0221] Thirdly, using the “safe pointer” method disclosed in PAT2, access to allocations in raw interleaved memory blocks (190) can be indirected via “safe pointer” objects which are located on the stack (and hence accessed via “small” stack addresses). The latter method can be used to mix read and write accesses to memory allocations in (190) with access to pipelined memory in VLIWs as needed for the execution of FIG. 17.

    [0222] According to PAT1, thread switches require switching special purpose register (SPR) sets which hold the thread context consisting of instruction pointer, thread and other IDs, memory allocation and other information. The CU maintains a SPR set queue for the presently active and some number of presently waiting threads. Said SPR set queue is managed by the TMT scheduler which is partly implemented in hardware to ensure software-configured but at runtime mostly software-independent execution of the CSP data flow model (Roscoe, Occam). An arbitrary number of SPR sets—one per thread—are kept in a backup SPR block in raw interleaved memory (190); over the CU's parallel ports SPR sets can be saved to and restored from said SPR backup block.

    [0223] Along with the SPR set queue, the TMT scheduler also prepares the waiting threads by allocating/preloading their pipelined memory blocks (194) from their backup positions into fast access/low latency M1 SPM, accordingly adjusting the relevant pointers in their SPR sets. Similarly, M1 stack segments are allocated and preloaded.

    [0224] Now when allocating such temporary M1 memory blocks during runtime, said blocks can be aligned in such a way that only local DMA-operated internal memory ports (FIG. 5) are needed for M1 pre- and offloading. In that case, the latter TMT activities do not compete with software M1 stack and pipelined memory accesses—which use a different port set (FIG. 5)—and hence can be performed as a parallel background hardware activity (provided that enough M1 capacity is available and the thread switching frequencies are not “too high”).

    [0225] Accordingly, the TMT scheduler can always maintain a certain number of threads waiting in the “ready to run in M1 mode” state in which their M1 blocks are preloaded. M1 blocks can be configured to performance needs. In principle, threads could also be completely—or temporarily—executed out of M2, M2 or M4 (DRAM) without being subject to M1 pre/offload.

    [0226] At runtime, thread switches only require altering a thread pointer (TP) that selects which element of the SPR set queue—i. e. which thread—will be executed in the next cycle. Accordingly, in the intended standard use case of “low thread switching frequencies”, TMT operates with zero-cycle thread switches which allows the above-mentioned fine-grained latency hiding by thread switching in cases of a CU stall where, as long as the stall lasts, the CU (71) issues VLIWs of other threads to the PPC set (72).

    [0227] In this context, as usual in CSP implementations (Roscoe, Occam), CU stalls caused by a stalling inter-thread channel cause the TMT scheduler to deschedule the stalling thread. Then, SPR set and M1 memory blocks occupied by said thread can—but need not to—be offloaded, i. e. written back to their original backup allocations, waiting there until the thread is reloaded and made ready for its next execution round.

    [0228] Thereby, the TMT scheduler manages M1 as a queue for the pipelined memory blocks and topmost stack segments of the active and the waiting threads which it selects according to the CSP data flow execution scheme by availability of communicated data packets and priority. This activity is supported by an additional address decoding mechanism which presents the M1 SPM to the software as (virtually) infinite memory which is mapped to the finite M1 memory volume by a modulo operation thus allowing for temporary M1 memory allocation blocks which “wrap” at the M1 boundaries making optimal use of the available M1 memory space.

    [0229] As all stack and pipelined memory accesses are managed by virtualization layers (191, 193), the above-described physical M1 memory allocation and loading/offloading activity of the TMT scheduler is invisible to the operating system, middleware, and all software layers which access said memory types per relative addresses, namely stack addresses and latches which—at the runtime of each thread—always point to the same physical data, regardless of actual physical data location. This mechanism was hinted above when claiming a “memory virtualization technique which replaces the energy and chip area intensive data lookup circuitry found in today's caches”. The only explicit activity required to be executed by the software, in this context, is to load often-used heap data (190) onto the stack (194)—where it remains accessible via powerful pointer operators (PAT2) but at minimum latency (PAT1)- and offload it after use. In principle, said activity could be eliminated by re-introducing cache-type memory elements which, if desired, requires consideration regarding area/energy overhead versus application-dependent performance improvement.

    [0230] On top of the above-described TMT, “simultaneous multithreading” (SMT) can, as shown in FIG. 30, be performed by a set (311) of control units (71) which parallelly process p threads but share read (76-78) and write ports (75) via a “control dispatch unit” (CDU) (312). According to thread priority, said CDU grants the p CUs (71) read access and combines instructions of the p VLTWs issued by the p CUs (71) into one single VLTW. It does so in a prioritized way on a per-PPC basis by “filling in gaps” as exemplified in FIG. 31. Notably: [0231] The illustrated SMT method guarantees that the highest priority non-empty VLTW will always fully execute. [0232] Flop/VLTW densities of (latency bound) GPP threads are limited by available ILP and expected to be low (in practice, superscalar von Neumann processors seldom achieve an ILP above 2). GPP ILP can be increased by increasing block lengths via predicated processing of conditional constructs (this is a known technique). [0233] There may be cycles in which a GPP thread issues no VLTW (“empty VL TW” in FIG. 30 exemplified in the priority 2 VLTW). This can be understood by example of FIG. 17: In GPP mode, for the duration of the sqrt calculation no instruction can be issued. Also, conditional branching will produce empty VLTWs (note that the architecture proposed by the invention does not need to rely on branch prediction: conditional branches are “waited out”). [0234] For VHTC threads, SMT does not conflict with memory pipelining as each VHTC thread maintains an independent pipelined memory block and loop counter i which is incremented upon completion of a loop iteration—regardless how many cycles said iteration takes.

    [0235] According to the invention, software is understood as a parallel net of sequential threads—i. e. processes that sequentially execute VLIWs—which obey the “communicating sequential processes” (CSP) paradigm (Roscoe). Like in transputer technology (Occam), CSP is a foundational building principle of the ISA. This means that channel communication and other OS primitives related to parallel processing are implemented as machine instructions which according to the invention comply with contemporary message passing interface (MPI) technology.

    [0236] Now, most of the relevant computer workloads execute repetitive work which is typically organized in the form of loops. Sometimes, such loops can comprise complex control structures and hierarchical levels of nested subroutine calls which at the lower levels might reach into the middleware and operating system realm. It is shown by the next step of the invention how any loop can be deconstructed in a way that it can be executed by a distributed network of IP pipelines which is called an “IP superpipeline”.

    [0237] At first again reconsidering the example of FIG. 17, its square root (“sqrt”) operator could in principle be implemented by a subroutine which performs the square root calculation by iteration; according to the invention, such an iteration can be executed by a conditional IP pipeline.

    [0238] FIG. 26 shows a channel layout for the operation of said “sqrt pipeline”. Avoiding deadlocks, the IP pipeline shown in FIG. 17 is split in a “caller” (260, stages 0-15) and a “consignee” of the sqrt call (262, stages 37-56). Channels (263-265) are buffered. For systolic operation, the buffer pipeline of lane 3 is shortened to latches 24-31 and the buffer pipeline of lane 0 to latches 69-80. Both lanes are sent from the caller (260) to the consignee (262) via channel (265). Also, the caller sends its lane 1 data via channel (262) to the callee (261) which on its part sends the computed sqrt results to the consignee (262).

    [0239] The caller-callee channel (263) is fixed but as the callee is a subroutine—which in principle could be called by other callers as well—its output channel must be dynamically vectorized as it might have to deliver the results to consignees which are not known at compile time. To do so, the SM which operates the caller's IP pipeline opens the call via a CALL sent to the callee and transmits dynamic channel (264) configuration regarding the targeted consignee per LINK (FIG. 23). After the callee has accordingly established the required dynamic channel configuration with the consignee it sends an ACK to the caller which ensuingly starts its IP pipeline.

    [0240] The callee's pipeline, while responding to the callers connection request, triggers the exception handlers required to dynamically configure its output channels and then processes the caller's data delivering its sqrt results to the consignee. During call execution, the callee encodes the caller’ ID and call number in its WL (cf. FIG. 23). The callee's CWR operator accordingly vectorizes its output to the configured proper consignee channel (264).

    [0241] Finally, the caller closes the call per RET which is answered by ACK.

    [0242] The above-discussed FIG. 26 shows a simple IP superpipelining example which outlines a method how to recursively deconstruct a non-elementary algorithmic loop which has a hierarchical control structure that consists of nested subroutine calls but has no conditional constructs (for or while loop, if-else or switch statements etc.). The resulting IP superpipeline is a net of IP pipelines which communicate via CSP channels; each IP pipeline is operated by a GPP stack machine which is run in a parallel thread.

    [0243] The above-hinted recursion method can also be applied to inner loops of a non-elementary loop. Again regarding FIG. 17, instead of the above-discussed subroutine call, the sqrt operator could alternatively have been algorithmically specified by a while-loop comprised in the DFG of FIG. 16. In this case factoring out and externalizing said while-loop as a “private” subroutine, the decomposition would in principle again lead to the plot of FIG. 26, the only distinction consisting in the fact that for said “private” subroutine no dynamic channel configuration (CALL, LINK, RET, ACK) is needed as now the callee-to-consignee channel is known at compile time and hence can be configured as a fixed CSP channel.

    [0244] FIG. 27 illustrates how the conditional if-else structure shown in FIG. 25 translates when compiling the above-mentioned CDFGs as IP pipelines (271, 271) which are used as “private” subroutines. This scheme can be generalized to cover nested conditional structures and switch statements; one possible method that can be applied in the latter cases is, as already mentioned, splitting the statement in a (predicated) prelude that calculates an execution index and a follow-up consisting of a vectorized call to a set of indexed conditional branch pipelines (271, 272, . . . ).

    [0245] With this final step, the recursive deconstruction method is complete and can be applied to any outer loop which—organized as IP superpipeline—can be distributed over a network of IP cores, levelling out and optimizing core and processor loads to achieve the best possible VHTC performance. On the software layer, the “placement” of the individual pipelines on cores and processors is invisible; it is configured at the start of the software but can be altered at runtime which is useful for dynamic workload balancing.

    [0246] At the topmost level, a task will always start in sequential von Neumann GPP stack machine mode as one initial thread. The above recursive loop deconstruction method is optional and may, according to the invention, be applied for at most one outer loop which is comprised in said initial thread; this limitation to “one loop only” is not necessary but a design choice which simplifies memory management (at most one pipelined memory allocation per thread). If several outer loops are selected for deconstruction in said initial thread then the initial thread needs to be decomposed into several communicating threads (cf. the caller-consignee setup of FIG. 26, 27) of which each comprises at most one loop selected for deconstruction.

    [0247] Notably, in each step, the above-explained recursive deconstruction method creates new threads, for each of which an independent decision can be made whether to process its input control/data in sequential GPP or in parallel VHTC pipeline mode (referring back, the same is true for the example stated in FIG. 16). This freedom of choice allows the compiler to tune the code of each such thread according to workload's characteristics in terms of a trade-off between throughput, latency, energy consumption and memory requirements:

    [0248] Applications of the invention thus are expected to in practice consist of a balanced mixture of GPP, SIMD, and VHTC which also has implications regarding IP core hardware design: [0249] IP cores designed for throughput-oriented “number crunching” applications in which latency is less important are expected to have large M1 to accommodate for large (and many) pipelined memory blocks which sustain VHTC. Due to the large M1 latency, such cores will underperform when operated under real-time conditions in GPP mode, especially if compared to traditional register machine cores (as registers have very low latencies).

    [0250] On the other hand, IP cores designed for real-time processing workloads will have very small M1 SPM, possibly reaching down to the size of typical register files. In such cores, the TMT will allocate its temporary pipelined memory blocks in M2 SPM but not—as discussed above—in M1. As pipelined VHTC processing is almost insensitive to buffer latency, this measure will only slightly degrade VHTC while yielding good GPP real-time performance.

    [0251] The latter IP core type is expected to be found in industrial applications with artificial intelligence aspects, the former rather in the data center.