Thread optimized multiprocessor architecture
09934196 ยท 2018-04-03
Inventors
Cpc classification
G06F9/38
PHYSICS
G06F9/3885
PHYSICS
G06F15/80
PHYSICS
G06F12/0848
PHYSICS
G06F9/30145
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06F9/30032
PHYSICS
G06F13/28
PHYSICS
International classification
Abstract
In one aspect, the invention comprises a thread optimized multiprocessor prepared by a semiconductor manufacturing process, comprising the steps of: (a) interconnecting less than 4 layers of metal on at least one die; (b) embedding at least one processor in said at least one die; and (c) mounting said at least one die on a dual inline memory module.
Claims
1. A thread optimized multiprocessor prepared by a semiconductor manufacturing process, comprising the steps of: (a) interconnecting less than 16 layers of metal on a least one die; (b) embedding at least on processor in said at least one die; and (c) mounting said at least one die on a dual inline memory module.
2. A thread optimized multiprocessor prepared by a semiconductor manufacturing process, comprising the steps of: (a) interconnecting less than 8 layers of metal on a least one die; (b) embedding at least on processor in said at least one die; and (c) mounting said at least one die on a dual inline memory module.
3. A thread optimized multiprocessor prepared by a semiconductor manufacturing process, comprising the steps of: (a) interconnecting less than 4 layers of metal on a least one die; (b) embedding at least on processor in said at least one die; and (c) mounting said at least one die on a dual inline memory module.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS
(36) The TOMI architecture of at least one embodiment of the present invention preferably uses the minimum logic possible to operate as a general purpose computer. The most common operations are given priority. Most operations are visible, regular, and available for compiler optimization.
(37) In one embodiment, the TOMI architecture is a variation on accumulator, register, and stack architectures, as illustrated in
(38) 1. Like an accumulator architecture, the accumulator is always one of the operands, except for the increment instruction.
(39) 2. Like a register architecture, the destination is always one of the operand registers.
(40) 3. The accumulator and program counter are also in the register space and may therefore be operated on.
(41) 4. Three special registers auto-increment and auto-decrement and are useful for creating stacks and streams of input and output.
(42) 5. All instructions are 8-bits in length, simplifying and speeding instruction decode.
(43) 6. There is no BRANCH or JUMP instruction.
(44) 7. There are just seven basic instructions enabling 3 bits of operator selection from an 8-bit instruction, as illustrated by
(45) Some of the benefits of the preferred embodiment include:
(46) 1. All operations run at the maximum speed allowed by the logic, rather than being constricted by the equality necessitated by a pipeline. Logic operations are fastest.
(47) Math operations are next fastest. Operations requiring memory access are slowest.
(48) 2. The architecture scales to any data width, limited only by package pins, adder carry times, and usefulness.
(49) 3. The architecture is near the minimum possible functionality necessary to perform all the operations of a general purpose computer.
(50) 4. The architecture is very transparent, very regular, and most operations are available to the optimizing compiler.
(51) The architecture is designed to be simple enough to be replicated numerous times on a single monolithic chip. One embodiment embeds multiple copies of the CPU monolithically with memory. A simplified 32-bit CPU may be implemented in fewer than 1,500 gates, with most of the gates defining the registers. Nearly 1,000 TOMI CPUs of a preferred embodiment can be implemented using the same number of transistors as a single Intel Pentium 4.
(52) The reduced instruction set in the TOMI CPU is factored to execute the operations necessary for a general purpose computer. The smaller the instruction set is for a processor, the more efficiently it will run. The TOMI CPU is designed with an extraordinarily low number of instructions compared to modern processor architectures. For example, one embodiment of the TOMI CPU has 25 instructions, compared to the Intel Pentium processor which has 286 instructions, Intel Itanium Montecito processor with 195 instructions, the StrongARM processor with 127 instructions, and the IBM Cell processor with over 400 instructions.
(53) The basic set of instructions for the TOMI CPU are simplified and designed to execute in a single system clock cycle, as opposed to 30 clock cycles required for the latest generation Pentium processor. The TOMI CPU architecture is a pipelineless architecture. This architecture and single clock cycle instruction execution significantly reduce or eliminate stalls, dependencies and wasted clock cycles found in other parallel processing or pipelined architectures. While the basic instructions only require a single clock cycle to execute, as clock speeds increase (and clock cycle times decrease), the time required for the execution result to propagate through the circuit transistor gates for complex mathematical instructions (such as ADD) may reach the limit of a single clock cycle. In such instances, it may be optimal to allow two clock cycles for the execution of the particular instruction, so as not to slow down the execution of the faster instructions. This will depend upon the optimization of the CPU design to the system clock speed, manufacturing process, and circuit layout.
(54) The TOMI simplified instruction set allows a 32-bit TOMI CPU to be built with fewer than 5,000 transistors (not including the caches). A top-level schematic of an embodiment of a single 32-bit TOMI CPU is shown in
(55) Due to the compact size of the TOMI CPU, multiple CPUs can be built on the same silicon chip. This also allows multiple CPUs to be built on the same chip as the main memory, such as a DRAM, at little additional manufacturing cost, beyond the manufacturing cost of the DRAM chip itself. Thus, multiple TOMI CPUs can be placed on a single chip for parallel processing with only a minimal increase in die size for the DRAM chip and manufacturing cost. For example, a 512 MB DRAM contains approximately 700 million transistors. 64 TOMI CPUs (assuming 200,000 transistors for a single TOMI CPU) would only add 12.8 million transistors to any DRAM design. For a 512 MB DRAM, 64 TOMI CPUs would increase the die size by less than 5%.
(56) The TOMI CPUs are designed to be manufactured using existing inexpensive commodity memory manufacturing processes, such as for DRAM, SRAM, and FLASH memory devices. The low transistor count for a TOMI CPU means that the CPU can be built in a small area and can easily be interconnected in silicon using an inexpensive semiconductor manufacturing process with 2 layers of metal interconnect, rather than complex and expensive manufacturing processes used to manufacture large microprocessor chips (such as the Intel Pentium) utilizing 8 or more layers of metal interconnect or other logic processes. Modern DRAMs and other commodity memory chips utilize simpler and lower-cost semiconductor manufacturing processes with fewer layers (e.g., 2) of metal interconnect to achieve lower manufacturing costs, higher product volume, and higher product yields. The semiconductor manufacturing process for commodity memory devices is generally characterized by low current leakage device operation; while processes used to build modern microprocessors strive for high speed and high performance characteristics, rather than transistor-level low current leakage values. The ability of the TOMI CPU to be efficiently implemented with the same manufacturing process used for DRAMs and other memory devices allows the TOMI CPU to be embedded within an existing DRAM chip (or other memory chip) and to take advantage of low-cost, high yield memory chip manufacturing processes. This also provides the advantage that TOMI CPUs can be manufactured using the same packaging and device pin layout (conforming, for example, to JEDEC standards for memory devices), fabrication facilities, test fixtures, and test vectors currently in industry use for DRAMs and other memory chips. A preferred embodiment describes embedding CPUs into a legacy DRAM. This contrasts favorably with the alternative of embedding a DRAM into a legacy microprocessor. Due to its high transistor count compared to a preferred embodiment of the present invention, a legacy microprocessor would likely generate much greater levels of electrical noise and heat which can adversely affect the performance and reliability of the embedded DRAM. Furthermore, a DRAM embedded on a legacy microprocessor would likely require a process with 8 or more layers of metal interconnect compared to a preferred embodiment's 2 layers. In comparison, the resulting DRAM embedded into a legacy microprocessor would likely be higher cost, lower yield, higher power consumption, and, ultimately, lower performance.
(57) Another advantage of the preferred embodiment is that the TOMI CPUs are small enough (and require little power) that they can physically reside next to the DRAM (or other memory) circuitry and allow the CPUs access to the ultra-wide internal DRAM data bus. In modern DRAMs, this bus is 1024, 4096, or 8192 bits wide (or an integer multiple thereof), which also typically corresponds to the width of one row of data in the data banks within the DRAM design. (By comparison, the Intel Pentium data bus is 64 bits and the Intel Itanium bus is 128 bits wide.) The internal caches of the TOMI CPU can be sized to match the DRAM row size so that the CPU cache can be filled (or flushed) in a single DRAM memory read or write cycle. The TOMI CPU uses the ultra-wide internal DRAM data bus as the data bus for the TOMI CPU. The TOMI CPU caches may be designed to mirror the design of the DRAM row and/or column latch circuitry for efficient layout and circuit operation, including data transfers to the TOMI CPU caches.
(58) Another advantage of the preferred embodiment is the low level of electrical noise generated by the TOMI CPUs due to the low transistor count, and because the CPU utilizes the ultra-wide internal DRAM data bus to access memory, rather than constantly driving I/O circuitry to access off-chip memory for data. The on-chip CPU caches allow for immediate access to data for processing, thus minimizing the need for off-chip memory accesses.
(59) A design objective of processor architectures is to maximize processing capacity and speed, while minimizing the power required to achieve that processing speed. The TOMI CPU architecture is a high-speed processor, with extremely low power consumption. The power consumption of a processor is directly related to the number of transistors used in the design. The low transistor count for the TOMI CPU minimizes its power consumption. The simplified and efficient instruction set also allows the TOMI CPU to reduce its power consumption. In addition, the TOMI CPU caches and access to on-chip memory using the wide internal DRAM data bus eliminate the need to constantly drive the I/O circuitry for off-chip memory access. A single TOMI CPU operating at 1 GHz clock speed will consume approximately 20 to 25 milliwatts of power. In contrast, the Intel Pentium 4 processor requires 130 watts at 2.93 GHz, Intel Itanium processor requires 52 watts at 1.6 GHz, StrongARM processor requires 1 watt at 200 MHz, and the IBM Cell processor requires 100 watts at 3.2 GHz. It is well known that heat generation in a processor is directly related to the amount of power required by a processor. The extremely low power TOMI CPU architecture eliminates the need for fans, large heat sinks, and exotic cooling mechanisms found in current microprocessor architectures. At the same time, the low-power TOMI CPU architecture makes new low-power battery and solar powered applications feasible.
(60) Instruction Set
(61) The seven basic instructions in an exemplary instruction set are shown in
(62) Addressing Modes
(63)
(64) The addressing modes are:
(65) Immediate
(66) Register
(67) Register Indirect
(68) Register Indirect Auto-increment
(69) Register Indirect Auto-decrement
(70) Special Cases
(71) Register 0 and Register 1 both refer to the Program Counter (PC). In one embodiment, all operations with Register 0 (the PC) as the operand are conditional on the accumulator carry bit (C) equal to 1. If C=1, the old value of PC is swapped into the accumulator (ACC). All operations with Register 1 (the PC) as the operand are unconditional.
(72) In an alternative embodiment, write operations with Register 0 (PC) as the destination are conditional on the carry bit (C) equal to 0. If C=1, no operation is performed. If C=0, the value in the accumulator (ACC) is written to the PC, and program control changes to the new PC address. Write operations with Register 1 (PC) as the destination are unconditional. The value in the accumulator (ACC) is written to the PC, and program control changes to the new PC address.
(73) Read operations, with Register 0 as the source, load the value of the PC+3. In this way, the address of the top of a loop may be read and stored for later use. In most cases, the loop address will be pushed on the stack (S). Read operations with Register 1 as the source load the value pointed to by the next full word addressed by the PC. In this way, a 32-bit immediate operand may be loaded. The 32-bit immediate operand must be word aligned, but the LOADACC instruction may be at any byte position in the 4-byte word immediately preceding the 32-bit immediate operand. Following execution of the read, the PC will be incremented such that it addresses the first word aligned instruction following the 32-bit immediate operand.
(74) There is No Branch
(75) Branch and Jump operations are usually a problem for CPU designers because they require many bits of precious op-code space. The branching function may be created by loading the desired branch address into the ACC using LOADACC, xx and then using the STOREACC, PC instruction to effect the branch. A branch may be made conditional on the state of C by storing to Register 0.
(76) Skip
(77) A skip may be created by executing INC, PC. Execution will require two cycles, one for the current PC increment cycle to complete and one for the INC. A skip may be made conditional on the state of C by incrementing Register 0.
(78) A Relative Branch
(79) A relative branch may be created by loading the desired offset into the ACC and then executing the ADD, PC instruction. A relative branch may be made conditional on the state of C by adding to Register 0.
(80) Forward Branches
(81) Forward branches are more useful than rearward branches since the location of the rearward branches necessary for loops is easily captured by saving the PC as the program steps through the top of the loop the first time.
(82) A more efficient forward branch than the relative branch may be created by loading the least significant bits of the branch endpoint into the ACC and then storing to the PC. Since the PC can be accessed both conditionally or unconditionally depending upon the use of Register 0 or Register 1, the forward branch may also be conditional or unconditional by the selection of the PC register (Register 0 or Register 1) as the destination operand.
(83) For example:
(84) LOADI, #1C
(85) STOREACC, PC
(86) If the most significant bits of the ACC are zero, only the least significant 6 bits are transferred to the PC register. The most significant bits of the register remain unchanged if the least significant 6 bits of the current PC register are smaller than the ACC value to be loaded. If the least significant 6 bits of the current PC register are greater than the ACC value to be loaded, the current PC register is incremented, starting with the 7th bit.
(87) This effectively allows branches of up to 31 instructions forward. This method of forward branching should be used whenever possible because not only does it require 2 instructions versus 3 instructions for the relative branch, but it does not require a pass through the adder, which is one of the slowest operations.
(88) Loops
(89) The top of a loop may be saved using LOADACC, PC. The resulting pointer to the top of the looping construct may then be stored in a register or pushed into one of the autoindexing registers. At the bottom of the loop, the pointer may be retrieved with LOADACC, EA and restored to the PC using STOREACC, PC, thereby causing a backwards loop. The loop may be made conditional on the state of C by storing to Register 0 thereby causing a conditional backwards loop.
(90) Self Modifying Code
(91) It is possible to write self-modifying code using STOREACC, PC. An instruction may be created or fetched into the ACC and then stored into the PC where it will execute as the next instruction. This technique may be used to create a CASE construct.
(92) Assume a jump table array in memory consisting of N addresses and base address of JUMPTABLE. For convenience, JUMPTABLE might be in low memory so its address can be created with LOADI or a LOADI following by one or more right shifts, ADD, ACC.
(93) Assume that the index into the jump table is in ACC and the base address of the jump table is in a general purpose register named JUMPTABLE: TABLE-US-00001 ADD, JUMPTABLE Add the index to the base address of the jump table. LOADACC, (JUMPTABLE) Load the indexed address STOREACC, PC Execute the jump.
(94) If low order memory starting at 0000 is allocated to system calls, each system call may be executed as follows where SPECIAL_FUNCTION is the name of an immediate operand 0-63: TABLE-US-00002 LOADI, SPECIAL_FUNCTION Load the system call number LOADACC, (ACC) Load the address of the system call STOREACC, PC Jump to the function.
(95) Right Shift
(96) The basic architecture does not envision a right shift operation. Should such an operation be required, the solution of a preferred embodiment is to designate one of the general purpose registers as the right shift register. A STOREACC, RIGHTSHIFT would store the ACC right shifted a single position into the right shift register where its value could be read with LOADACC, RIGHTSHIFT.
(97) Architectural Scalability
(98) The TOMI architecture preferably features 8-bit instructions, but the data width need not be restricted.
(99) The preferred TOMI architecture is implemented as a Von Neumann memory configuration for simplicity, but a Harvard architecture implementation (with separate data and instruction buses) is also possible.
(100) Common Math Operations
(101) Two's complement math can be done several ways. A general purpose register may be preconfigured as all 1s and named ALLONES. The operand will be assumed to be in a register named OPERAND: TABLE-US-00003 LOADACC, ALLONES XOR, OPERAND INC, OPERAND The 2s complement is left in OPERAND.
(102) Common Compiler Constructions
(103) Most computer programs are generated by a compiler. Therefore, a useful computer architecture should be adept at common compiler constructions.
(104) A C compiler will normally maintain a stack for passing parameters to function calls. The S, X, or Y registers may be used as the stack pointer. The function call will push the parameters onto one of the autoindexing registers acting as the stack, using, for example: STOREACC, (X)+. Upon entering the function the parameters will be POP'd into general purpose registers for use.
(105) Stack Relative Addressing
(106) There will be times when there are more elements passed in the function call than can conveniently fit in the general purpose registers. For the purposes of the following example it is assumed that a stack push operation decrements the stack. If S is being used as the stack register, to read the Nth item relative to the top of the stack:
(107) LOADI, N
(108) STOREACC, X
(109) LOADACC, S
(110) ADD, X
(111) LOADACC, (X)
(112) Indexing into an Array
(113) Upon entry into the array function, the base address of the array is located in a general purpose register named ARRAY. To read the Nth element in the array:
(114) LOADI, N
(115) STOREACC, X
(116) LOADACC, ARRAY
(117) ADD, X
(118) LOADACC, (X)
(119) Indexing into an N Word Element Array
(120) Sometimes an array will be allocated for elements N words wide. The base address of the array is located in a general purpose register named ARRAY. To access the first word of the Nth element in a 5 word wide array: TABLE-US-00004 LOADI, N STOREACC, X Store in temporary register ADD, ACC Multiply by 2 ADD, ACC By 2 again=4 ADD, X plus 1=5 LOADACC, ARRAY ADD, X plus the base address of the array LOADACC, (X)
(121) Instruction Set Extensions
(122) Another embodiment of the invention includes extensions of the seven basic instructions shown in
(123) SAVELOOPThis instruction pushes the current value of the program counter onto the stack. Saveloop will most likely be executed at the top of a looping construct. At the bottom of the loop, the saved program counter value will be copied from the stack and stored to the program counter, effecting a backward jump to the top of the loop.
(124) SHIFTLOADBYTEThis instruction shifts the ACC left 8 bits, reads the 8-bit byte following the instruction, and places it into the least significant 8-bits of ACC. In this manner, long immediate operands may be loaded with a sequence of instructions. For example to load a 14-bit immediate operand:
(125) TABLE-US-00001 LOADI, #14 \\ Load the 6 most significant bits of the 14 bit operand. SHIFTLOADBYTE \\ Shift the 6 bits 8 positions to the left and load the following 8-bit value. CONSTANT #E8 \\ Eight bit immediate operand.
(126) The resulting hex value in ACC will be 14E8.
(127) LOOPThis instruction copies the top of the stack to the program counter. Loop will most likely be executed at the bottom of a looping construct following execution of Saveloop to store the program counter at the top of the loop. When Loop executes, the saved program counter will be copied from the stack and stored to the program counter, effecting a backward jump to the top of the loop.
(128) LOOP_IFThis instruction copies the top of the stack to the program counter. It performs a conditional loop, based on the value of C. Loop_if will most likely be executed at the bottom of a looping construct following execution of Saveloop to store the program counter at the top of the loop. When Loop_if executes, if C=0, the saved program counter will be copied from the stack and stored to the program counter, effecting a backward jump to the top of the loop. If C=1, the program counter will increment to point to the next sequential instruction.
(129) NOTACCComplement each bit of ACC. If ACC=0, set C to 1. Otherwise, set C to 0.
(130) ROTATELEFT8Rotate the ACC 8 bits left. At each rotate step, the MSB shifted out of ACC is shifted into the LSB of ACC.
(131) ORSTACKPerform a logical OR on ACC and the value in the top of stack. Place the result in ACC. If ACC=0, set C to 1. Otherwise, set C to 0.
(132) ORSTACK+Perform a logical OR on ACC and the value in the top of stack. Place the result in ACC. After the logical operation, increment the stack pointer, S. If ACC=0, set C to 1. Otherwise, set C to 0.
(133) RIGHTSHIFTACCShift ACC right by a single bit. The LSB of ACC is shifted into C.
(134) SETMSBSet the most significant bit of ACC. There is no change to C. This instruction is used in performing signed comparison.
(135) Local TOMI Caching
(136) A cache is memory smaller in size and faster in access than the main memory. The reduced access time and the locality of program and data accesses allow cache operations to increase performance of a preferred TOMI processor for many operations. From another perspective, a cache increases parallel processing performance by increasing independence of a TOMI processor from main memory. The relative performance of cache to main memory and the number of cycles the TOMI processor can execute before requiring another main memory load or store to or from cache determine the amount of performance increase due to TOMI processor parallelism.
(137) TOMI local caches enhance the performance increase due to TOMI processor parallelism. Each TOMI processor preferably has three associated local caches, as illustrated in
(138) Instructionassociated with the PC
(139) Sourceassociated with the X register
(140) Destinationassociated with the Y register
(141) Since the caches are associated with specific registers, rather than data or instruction fetches, the cache control logic is simplified and cache latency is significantly reduced. The optimal dimensions of these caches are application dependent. A typical embodiment may require 1024 bytes for each cache. In other words, 1024 instructions, and 256 32-bit words of source and destination. At least two factors determine the optimum size of the caches. First is the number of states the TOMI processor can cycle before another cache load or store operation is required. Second is the cost of the cache load or store operation from main memory relative to the number of TOMI processor execution cycles possible during the main memory operation.
(142) Embedding TOMI Processors in RAM
(143) In one embodiment, a wide bus connects the large embedded memory to the caches, so a load or store operation to the caches can occur quickly. With TOMI processors embedded in RAM, a load or store of an entire cache would consist of a single memory cycle to a RAM column. In one embodiment, the embedded memory will be responding to requests of 63 TOMI processors, so the response time of a cache load or store for one TOMI processor may be extended while the load or store of another TOMI processor cache completes.
(144) Caches may be stored and loaded based on changes of the associated memory addressing registers X, Y, PC, as illustrated in
(145) Cache Double Buffering
(146) A secondary cache may be loaded in anticipation of the cache load requirement. The two caches would be identical and alternately be selected and deselected based on the contents of the upper 14 bits of the PC. In the example above, when the upper 14 bits of the PC changed to match that of the data pre-cached in the secondary cache, the secondary cache would become selected as the primary cache. The old primary cache would now become the secondary cache. Since most computer programs linearly increase in memory, one embodiment of the invention would have the secondary cache always fetching the contents of the cache contents of main memory referenced by the upper 14 bits of the current PC plus 1.
(147) The addition of secondary caches will reduce the time the TOMI processor must wait for memory data to be fetched from main memory when moving outside of the boundary of the current cache. The addition of secondary caches nearly doubles the complexity of a TOMI processor. For an optimal system, this doubling of complexity should be offset by a corresponding doubling of performance of the TOMI processor. Otherwise, two simpler TOMI processors without secondary cache can be implemented with the same transistor count.
(148) High Speed Multiply, Floating Point Operations, Additional Functionality
(149) Integer multiplication and all floating point operations require many cycles to perform, even with special purpose hardware. Thus, these operations could be factored into other processors rather than included in the basic TOMI processor. However, a simple 16 bit.times.16 bit multiplier can be added to the TOMI CPU (using less than 1000 transistors) to provide additional functionality and versatility to the TOMI CPU architecture.
(150) Digital Signal Processing (DSP) operations often use deeply pipelined multipliers that produce a result every cycle even though the total multiplication may require many cycles. For signal processing applications that repeat the same algorithm over and over, such a multiplier architecture is optimal and may be incorporated as a peripheral processor to a TOMI processor, but it would likely increase complexity and reduce overall performance if it were incorporated directly in the TOMI processor.
(151) Accessing Adjacent Memory Banks
(152) The physical layout design of the memory circuit in memory chips is often designed so that the memory transistors are laid out in large banks of memory cells. The banks are usually organized as equal sized rectangular areas and placed in two or more columns on the chip. The layout of memory cells in large banks of cells may be used to speed up the memory read and/or write accesses.
(153) In one embodiment of the invention, one or more TOMI processors may be placed between two columns of memory cell banks in a memory chip. The row data lines of two memory banks may be interleaved such that using the logic shown in
(154) TOMI Interrupt Strategy
(155) An interrupt is an event external to the normal sequential operation of a processor that causes the processor to immediately change its sequence of operation. Examples of interrupts might be completion of an operation by an external device or an error condition by some hardware. Traditional processors go to great lengths to quickly stop normal sequential operation, save the state of current operation, begin performing some special operation to handle whatever event caused the interrupt, and when the special operation is completed restore the previous state and continue sequential operation. The primary metric of interrupt handling quality is the time to respond.
(156) Interrupts pose several problems for traditional processors. They make execution time indeterminate. They waste processor cycles storing and then restoring status. They complicate processor design and can introduce delays that slow every processor operation.
(157) Immediate interrupt response is unnecessary for most processors, with the exceptions being error handling and those processors directly interfacing to real world activity.
(158) In one embodiment of a multiprocessor TOMI system, only one processor possesses primary interrupt capability. All other processors run uninterrupted until they complete some assigned work and stop themselves or until they are stopped by the coordinating processor.
(159) Input/Output (I/O)
(160) In one embodiment of the TOMI processor environment, a single processor is responsible for all interfacing to the external world.
(161) Direct Memory Access (DMA) Control
(162) In one embodiment, immediate response to the external world in a TOMI processor system occurs via a DMA controller. A DMA controller, when requested by an external device, transfers data from the external device to the internal data bus for writing to the system RAM. The same controller also transfers data from the system RAM to the external device when requested. A DMA request would have the highest priority for internal bus access.
(163) Organizing an Array of TOMI Processors
(164) The TOMI processor of preferred embodiments of the invention is designed to be replicated in large numbers and combined with additional processing functionality, a very wide internal bus, and system memory on a monolithic chip. An exemplary memory map for such a system is illustrated in
(165) The memory map for each processor dedicates the first 32 locations (1F hex) to the local registers for that processor (see
(166) In one embodiment, 64 TOMI processors are implemented monolithically with memory. A single master processor is responsible for managing the other 63. When one of the slave processors is idle, it is not clocking so it consumes little or no power and generates little or no heat. On initialization, only the master processor is operational. The master begins fetching and executing instructions until a time that a thread should be started. Each thread has been precompiled and loaded into memory. To start a thread, the master allocates the thread to one of the TOMI CPUs.
(167) Processor Availability
(168) Coordination of the availability of TOMI processors to do work preferably is handled by the Processor Availability Table shown in
(169) 1. Push the calling parameters for a slave processor onto its stack, including but not limited to the execution address of the thread, the source memory, and the destination memory.
(170) 2. Start a slave processor.
(171) 3. Respond to a slave processor thread completion event either by polling or by responding to an interrupt.
(172) Requesting a Processor
(173) The coordinating processor may request a processor from the availability table.
(174) The number of the lowest processor with an available_flag set to 0 is returned. The coordinating processor may then set the available_flag associated with the available processor to 1, thereby starting the slave processor. If no processor is available, the request will return an error. Alternatively, processors may be allocated by the coordinating processor based upon a priority level associated with the requested work to be performed. Techniques to allocate resources based upon priority schemes are well-known in the art.
(175) Step-by-Step Starting a Slave Processor
(176) 1. Coordinating processor pushes the parameters for the thread to run onto its own stack. Parameters may include: starting address of the thread, source memory for the thread, destination memory for the thread, and last parameter_count.
(177) 2. Coordinating processor requests an available processor.
(178) 3. Processor allocation logic returns either the number of the numerically lowest slave processor that has both its associated available_flag set and its associated done_flag cleared, or an error.
(179) 4. If an error was returned, the coordination processor may either retry the request until a slave processor becomes available or perform some special operation to handle the error.
(180) 5. If an available processor number was returned, the coordinating processor clears the available_flag for the indicated processor. This operation pushes the parameter_count number of stack parameters to the stack of the selected slave processor. The done_flag is cleared to zero.
(181) 6. The slave processor retrieves the top stack item and transfers it to the slave processor's program counter.
(182) 7. The slave processor then fetches the memory column indicated by the program counter into the instruction cache.
(183) 8. The slave processor begins executing instructions from the beginning of the instruction cache. The first instructions will likely retrieve the calling parameters from the stack.
(184) 9. The slave processor executes the thread from the instruction cache. When the thread completes, it checks the state of its associated done_flag. If the done_flag is set, it waits until the done_flag is cleared, indicating the coordinating processor has handled any previous results.
(185) 10. If the interrupt vector associated with the slave processor is set to ?1, no interrupt will be created by setting the done_flag. The coordinating processor may therefore poll for the done_flag to be set.
(186) When the coordinating processor detects that the done_flag is set, it may handle the slave processor's results and possibly reassign the slave processor to do new work. When the slave processor's results have been processed, the associated coordinating processor will clear the associated done_flag.
(187) If the interrupt vector associated with the slave processor is not equal to ?1, setting the associated done_flag will cause the coordinating processor to be interrupted and begin executing an interrupt handler at the interrupt vector address.
(188) If the associated available_flag has been set also, the coordinating processor may also read the return parameters pushed on the slave processor's stack.
(189) The interrupt handler will handle the slave processor's results and possibly reassign the slave processor to do new work. When the slave processor's results have been processed, the interrupt handler running on the coordinating processor will clear the associated done_flag.
(190) 11. If the done_flag is clear, the slave processor sets its associated done_flag and saves the new start_time. The slave processor may continue to do work or may return to the available state. To return to the available state, the slave processor may push return parameters onto its stack, followed by a stack count and set its available_flag.
(191) Managing TOMI Processors Using Memory Mode Register
(192) One technique for implementing and managing multiple TOMI processors is to mount the TOMI processors on a dual inline memory module (DIMM) as shows in
(193) One or more bits can be allocated in a memory mode register to enable or disable the TOMI CPUs. For example, when the TOMI CPUs are disabled by the mode register, the memory containing the TOMI CPUs would function as a normal DRAM, SRAM or FLASH memory. When the mode register enables TOMI CPU initialization, the sequence will be performed as described in
(194) When the general purpose CPU has completed any reads or writes of the DRAM, SRAM, or FLASH memory, the external memory controller will write the mode register bit from HALT to RUN and execution of the TOMI CPUs will continue executing from where they left off.
(195) Adjusting Processor Clock Speed
(196) Processor clock speed determines processor power dissipation. The TOMI architecture enables low power dissipation by allowing all but one processor to be stopped. Furthermore, each processor other than the Master Processor may have its clock speed adjusted to optimize either performance or power consumption using the logic shown in
(197) Another Embodiment of TOMI Processor Management
(198) Some computer software algorithms are recursive. In other words, the primary function of the algorithm calls itself. The class of algorithms known as divide and conquer is often implemented using recursive techniques. Divide and conquer is applicable to searches and sorts of data, and certain mathematical functions. It may be possible to parallelize such algorithms with multiple processors such as those available with the TOMI architecture. In order to implement such algorithms, one TOMI CPU must be able to pass work to another TOMI CPU and receive the results from that CPU. Another embodiment of the TOMI processor allows any processor to be a Master Processor and treat any other available TOMI processor as a Slave Processor. Starting and stopping TOMI processors, communicating between processors, and managing independent and dependent threads is supported in this embodiment of processor management.
(199) Stopping a TOMI CPU
(200) A TOMI CPU may be stopped by writing all 1's to its PC. When a TOMI CPU is stopped, its clock is not running and it's not consuming power. Any TOMI CPU may write all 1's to its own PC.
(201) Starting a TOMI CPU
(202) A TOMI CPU may start executing when a value other than all 1's is written to its PC. The master processor has a value of 0 written to its PC when it is reset by the mode register as shown in
(203) Independent Processor Threads
(204) When multiple threads execute on a single general purpose processor, those threads may be very loosely coupled, only communicating infrequently. Rather than run, return results, and stop, some threads may run forever, delivering results continuously and in perpetuity. An example of such threads would be a network communications thread or a thread reading a mouse device. The mouse thread runs continuously and delivers mouse position and click information either to a shared memory area where it can be polled or to a callback routine where it is immediately presented.
(205) Such independent threads are primarily used to simplify programming rather than to accelerate performance. Similar threads can be executed on a multiprocessor system such as TOMI. The results may be delivered to a shared memory area. In some cases, communications may be accomplished through a shared variable.
(206) In the TOMI architecture, a shared variable may be more efficient than shared memory since the variable can avoid the necessity for a memory RAS cycle to load a complete row to the X_cache or Y_cache. An example of use of a variable would be an input pointer to a receive buffer for a TOMI CPU monitoring network traffic. The network monitoring CPU would increment the variable as data is received. The data consuming CPUs would read the variable from time to time and when enough data was present, perform an operation to load a memory row into either the X_cache or Y_cache.
(207) The received network data could then be read from the cache up to the value indicated by the shared variable.
(208) Dependent Processor Threads
(209) Some algorithms (such as those classified as divide and conquer) can achieve parallelism by simultaneously executing pieces of the algorithm on several processors and combining the results. In such a design, a single Master Processor will request work from several Slave Processors and then wait while the slaves perform the work in parallel. In this way, the Master Processor is dependent on work being completed by the slave processors.
(210) When a Slave Processor's work is completed, the Master Processor will read the partial results and combine them into the final result. This capability enables the TOMI architecture to efficiently process a class of algorithms known as divide and conquer. Some of the more common simple divide and conquer algorithms are searches and sorts.
(211) Multiprocessor Management Instruction Set Extensions
(212) A series of extensions to the basic TOMI instruction set enable both independent and dependent thread management. These instructions would be implemented using some of the available NOOP codes shown in
(213) GETNEXTPROCESSORThis instruction queries the processor availability table and loads the ACC with a number associated with the next available processor.
(214) SELECTPROCESSORThis instruction writes the ACC to the Processor Select Register. The Processor Select Register selects which processor will be evaluated by TESTDONE and READSHAREDVARIABLE.
(215) STARTPROCESSORThis instruction writes to the PC of the processor selected by the Processor Select Register. This instruction is most likely to be executed when a master processor wants to start a slave processor that is stopped. A slave processor stopped if its PC is all 1's. By writing a value to the PC of the slave processor, the master processor will cause the slave processor to begin running a program at the location of the written PC. If the operation was successful, the ACC will contain the PC value being written to the selected processor. If the operation was not successful, the ACC will contain ?1. The most likely reason for a failed operation is that the selected processor was unavailable.
(216) TESTDONEThis instruction tests the PC of the processor selected by the Processor Select Register and sets the C bit of the calling processor to 1 if the PC=all ones. In this manner a loop testing the PC may be created as follows: TABLE-US-00006 LOADI, processorNumber SELECTPROCESSOR LOADACC, LOOPADDRESS TOP TESTDONE STOREACC, PC_COND//Loop to TOP until the selected processor is stopped with PC=all 1s.
(217) TESTAVAILABLEThis instruction tests the available bit in the Processor Availability Table for the processor selected by the Processor Select Register and sets the C bit of the calling processor to 1 if the selected processor is available. In this manner a loop to test the availability for further work may be created as follows: LOADI, processor Number SELECTPROCESSOR LOADACC, LOOPADDRESS TOP TESTAVAILABLE STOREACC, PC_COND//Loop to TOP until the selected processor is available.
(218) SETAVAILABLEThis instruction sets the available bit in the Processor Availability Table for the processor selected by the Processor Select Register. This instruction will most likely be executed by one processor, which has requested that another processor do work as shown in
(219) READSHAREDVARIABLEThis instruction reads the Shared Variable of the processor selected by the Processor Select Register. This instruction will most likely be executed by one processor, which has requested that another processor do work. The Shared Variable may be read by any processor to determine the progress of the assigned work. As an example, the working processor might be assigned to process data being received from a high speed network. The Shared Variable would indicate the amount of data that had been read and was available to other processors. Each processor contains a Shared Variable. That Shared Variable may be read by any other processor, but a Shared Variable may only be written by its own processor.
(220) STORESHAREDVARIABLEThis instruction writes the value of the ACC to the Shared Variable of the processor executing the instruction. The Shared Variable may be read by any other processor and is used to communicate status and results to other processors.
(221) Interprocessor Communications Using Data Ready Latches
(222)
(223) The top half of
(224) The status of the shared register may be read into the C bit of either the selecting or selected processors. In operation, a processor writes data to its shared register thereby setting the associated data ready flag. The connected processor reads its shared register until the C bit indicates that the associated data ready flag has been set. The read operation clears the ready flag so the process can repeat.
(225) Arbitrating Processor Allocation
(226) As described above, a TOMI processor can delegate work to another TOMI processor whose availability has been determined by GETNEXTPROCESSOR.
(227) GETNEXTPROCESSOR determines if a processor is available. An available processor is one that is not currently performing work and is not holding results of previous work which have not yet been retrieved by READSHAREDVARIABLE.
(228)
(229) 1. The requesting processor executes a GETNEXTPROCESSOR instruction which pulls down the Request Next Available Processor line to the Arbitration Logic.
(230) 2. The Arbitration Logic produces a number corresponding to a TOMI CPU on the Next Available Processor lines.
(231) 3. The requesting processor executes a SELECTPROCESSOR instruction which stores the number in the requesting processor's PSR (Processor Select Register).
(232) 4. The requesting processor then executes a STARTPROCESSOR instruction which writes to the PC of the processor selected by the Processor Select Register. If the operation was successful, the number of the selected processor is also stored to the Arbitration Logic to indicate that the selected processor is no longer available to be assigned to do work. If the operation was unsuccessful, the reason is probably that the selected processor is not available. The requesting processor will execute another GETNEXTPROCESSOR to find another available processor.
(233) 5. When the selected processor executes STORESHAREDVARIABLE to make its results available, the Arbitration Logic is notified that the selected processor has results waiting to be read.
(234) 6. When the selected processor is stopped by writing ?1 to its PC, the Arbitration Logic is notified that the selected processor is available.
(235) 7. When the selected processor's results are retrieved by READSHAREDVARIABLE, the Arbitration Logic is notified that the selected processor's results have been read.
(236) Memory Locking
(237) TOMI processors read and write system memory through their caches. A completely cached column is read or written at one time. Any processor may read any portion of system memory. An individual processor may lock a column of memory for its exclusive writing. This locking mechanism avoids memory writing conflicts between processors.
(238) Suggested Applications
(239) Parallelism effectively accelerates applications that may be factored into independent pieces of work for individual processors. One application that factors nicely is image manipulation for robotic vision. Image manipulation algorithms include correlation, equalization, edge identification, and other operations. Many are performed by matrix manipulation. Very often the algorithms factor nicely, as illustrated in
(240) In
(241)
(242) In operation, the coordinating processor has allocated a DMA channel to transfer the image pixels from an external source to the internal system RAM every fixed amount of time. A typical speed of image capture might be 60 images per second.
(243) The coordinating processor then enables slave processor 1 by pushing the address of the imagemap to be used by the X register, the address of the processed image to be used by the Y register, the parameter count of 2, and the address of the image processing algorithm. The coordinating processor subsequently and similarly enables processors 2 through 25. The processors continue to execute independently and in parallel until the image processing algorithm is completed.
(244) When the algorithm is completed, each processor sets its associated done_flag in the Processor Availability Table. The results are handled by the coordinating processor, which is polling for completion or responding to an interrupt on the done event.
(245)
(246) TOMI Peripheral Controller Chip (TOMIPCC)
(247) One embodiment of the TOMI architecture embeds one or more TOMI CPUs in a standard DRAM die. The resulting die is packaged in the standard DRAM package with the standard DRAM pinout. The packaged part may be mounted on the standard DRAM DIMM (dual inline memory module).
(248) In operation, this embodiment behaves like a standard DRAM except when the embedded TOMI CPUs are enabled through the DRAM mode register. When the TOMI CPUs are enabled and operating, they execute programs loaded to the DRAM by an external processor. The results of TOMI CPU calculations are provided to the external processor through the shared DRAM.
(249) In some applications the external processor will be provided by a PC. In other applications, a specialized processor may be provided to perform the following functions for the TOMI DRAM chip(s).
(250) 1. Provide a mechanism to transfer data and instructions from an associated permanent storage device to the TOMI chip DRAM for use. In many systems, the permanent storage device may be a flash RAM.
(251) 2. Offer an input/output interface between the TOMI chip(s) and real world devices such as displays and networks.
(252) 3. Execute a small set of operating system functions necessary to coordinate TOMI CPU operations.
(253)
(254) The flash RAM will contain the dictionary of phonemes and syntax that define spoken language as well as the instructions necessary to interpret and translate the spoken language from one form to another.
(255) The TOMIPCC will receive the analog spoken language (or its equivalent), convert it to a digital representation, and present it to the TOMI DRAM for interpretation and translation. The resulting digitized speech will be passed back from the TOMI DRAM to the TOMIPCC, converted to an analog voice representation, and then output to the cellphone user.
(256)
(257) Such a system will likely be built with TOMI DIMMs (dual in-line memory modules) incorporating a plurality of TOMI CPU chips. The datapath to a standard 240-pin DIMM is 64 bits. Therefore, the TOMIPCC in a Memory-centric Database application will drive a 64-bit wide database, as indicated by D0-D63.
(258) It will be appreciated that the present invention has been described by way of example only and with reference to the accompanying drawings, and the invention is not limited by the specific embodiments described herein. As will be recognized by those skilled in the art, improvements and modifications may be made to the invention and the illustrative embodiments described herein without departing from the scope or spirit of the invention.