PROCESSOR FOR CONFIGURABLE PARALLEL COMPUTATIONS
20250272094 ยท 2025-08-28
Inventors
Cpc classification
G06F9/38873
PHYSICS
G06F9/30065
PHYSICS
International classification
Abstract
A programmable data processor includes multiple numerous configurable pipeline circuits each including numerous arithmetic and logic operator circuits that can be configured into an execution pipeline that can be controlled according to a state machine. Each configurable pipeline circuit also includes numerous building block circuits that can configured into a sequencer for the state machine. The building block circuits may include (i) state elements for representing a state in the state machine, and (ii) loop elements for representing a loop in the state machine.
Claims
1. A processor receiving a clock signal, comprising: a memory circuit having a plurality of independently accessible first and second memory portions, each memory portion holding data words in locations that the data words are individually addressable using a designated address in a linear address space; first and second address generation circuits each receiving an enable signal, a base address and an increment, each address generation circuit further comprising a counter circuit that (i) holds a count value which is incremented by an offset at each cycle of the clock signal; and (ii) sums the incremented count value to the base address to form a generated address; a first cross-bar switch network configurable at each cycle of the clock signal to route the generated address of each addressable generation circuit to one of the memory portions; and a programmable controller the control circuit executes a sequence of instructions that (i) configures the cross-bar switch network, selecting at least one of the memory portions to receive the generated address of one of the address generation circuits, and (ii) asserts the enable signal of at least one of the address generation circuits.
2. The processor of claim 1, wherein each address generation circuit further receives a maximum count value and an initial count value, wherein when the count value reaches the maximum count value, the address generation circuit (i) asserts a completion signal; and (ii) sets the count value to the initial count value in an immediately following cycle of the clock signal.
3. The processor of claim 2, wherein the programmable controller causes the completion signal of the first address generation circuit to be provided as the enable signal of the second address generation circuit.
4. The processor of claim 1, wherein each address generation circuit further receives a skip signal, which disables incrementing the count value during each clock cycle in which the skip signal is asserted.
5. The processor of claim 1, wherein one of the memory portions comprises one or more sections of static random-access memory (SRAM).
6. The processor of claim 1, wherein one of the memory portions further comprises a register file.
7. The processor of claim 6, wherein the register file comprises one or more groups of registers, each holding one of the data words.
8. The processor of claim 7, each group of registers being associated with designated addresses that can be generated as generated addresses of one of the address generation circuits.
9. A signal processor, comprising: a first external data interface for communicating with an external host processor; a second external data interface for receiving digitized data of an analog signal; a first set of registers; a first-level programmable interconnection network; a plurality of first-level processors, each first-level processor being configured (i) to access any of the first set of registers, and (ii) to be interconnected to the interface or to one or more of other first-level processors over the first programmable interconnection network, wherein each first-level processor comprises a plurality of second-level processors, each second-level processor comprising: a memory circuit; a second set of registers; a plurality of computational elements; a second programmable interconnection network; and a control circuit accessible to both the first and second set of registers and being configured to execute a sequence of instructions in the memory circuit by which the control circuit (i) configures the computational elements into one or more pipelines; (ii) configures the second programmable interconnection network (a) to interconnect the second-level processor to one or more second-level processors over their respective second programmable interconnection networks; and (iii) to interconnect the pipelines to the second external interface, thereby enabling the pipelines to access the digitized data.
10. The signal processor of claim 9, wherein the control circuit, by executing the sequence of instruction, configures the second programmable interconnection network to interconnect the second-level processor to the first programmable interconnection network, thereby interconnecting the second-level processor to one of the first-level processors.
11. The signal processor of claim 9, wherein the external host processor provides to each second-level processor their respective sequence of instructions over the first external interface.
12. The signal processor of claim 11, wherein each second-level processor further comprises a plurality of building blocks configurable to form one or more state machines, and wherein the sequence of instruction causes (i) the control circuit to further configure out of the building block circuits for each pipeline a state machine to control the pipeline; and (ii) synchronization among the state machines.
13. The signal processor of claim 12, wherein the synchronization among the state machines are achieved using a barrier mechanism.
14. The signal processor of claim 12, wherein pipelines involving two or more second-level processors are formed such that data is communication between second-level processors over the second programmable interconnection network.
15. The signal processor of claim 9, wherein each second-level processor further comprises an address generation circuit that generates algorithmically a sequence of addresses for accessing the memory circuit so as to access elements of a data structure stored in the memory circuit in a predetermined order.
16. The signal processor of claim 15, wherein the data structure comprises a multi-dimensional array of data.
17. The signal processor of claim 16, wherein the multi-dimensional array comprises a 2-dimensional matrix and wherein the elements are accessed in column-major order.
18. A configurable state machine for controlling a data processing operation, comprising: one or more sequencers each comprising a plurality of building block circuits that are programmable to connect with each other to define an operation for the sequencer, wherein each building block circuit comprises either (i) one or more state elements, each representing a state in the state machine, or (ii) one or more loop elements, each representing a loop in the state machine, wherein (i) each state element keeps track of a programmable duration for which the sequencer is to remain in the state represented by the state element, and (ii) each loop element keeps track of a number of iterations for which the sequencer is to traverse the loop represented by the loop element; and a plurality of configurable interconnection elements, wherein a selected group of the configurable interconnection elements are configured to interconnect the building block circuits.
19. The configurable state machine of claim 18, wherein the data processing operation is carried out in one or more digital circuits configurable into one or more pipelines.
20. A programmable data processor, comprising a plurality of configurable pipeline circuits, each configurable pipeline circuit comprising a plurality of configurable interconnection elements and a plurality of arithmetic and logic circuits, wherein the configurable pipeline circuit is configured by interconnecting a selected group of the arithmetic or logic circuits by the configurable interconnection elements into a pipeline for carrying out a predetermined arithmetic or logic function under control of a configurable state machine.
21. The programmable data processor of claim 20, wherein the configurable state machine comprises one or more sequencers each comprising a plurality of building block circuits that are programmable to connect with each other through the configurable interconnection elements to define an operation for the sequencer, wherein each building block circuit comprises either (i) one or more state elements, each representing a state in the state machine, or (ii) one or more loop elements, each representing a loop in the state machine, wherein (i) each state element keeps track of a programmable duration for which the sequencer is to remain in the state represented by the state element, and (ii) each loop element keeps track of a number of iterations for which the sequencer is to traverse the loop represented by the loop element.
22. The programmable data processor of claim 20, wherein each configurable state machine further comprises a control circuit that initiates operation of each sequencer of the configurable state machine.
23. The programmable data processor of claim 20, wherein a first set of results generated from an operation in a first one of the configurable pipeline circuits is provided as input data for an operation in a second one of the configurable pipeline circuits.
24. The programmable data processor of claim 23, wherein the control circuit of the first configurable pipeline circuit enters a first state in which the first configurable pipeline circuit suspends execution until the control circuit of a third one of the configurable pipeline circuits enters a second state, in which the third configurable pipeline circuit sends a predetermined vector.
25. The programmable data processor of claim 24, wherein the predetermined vector links the first state and the second state.
26. The programmable data processor of claim 24, further comprising a barrier controller circuit for providing a synchronizing signal to allow the control circuit of the first configurable pipeline circuit and the control circuit of the third configurable pipeline circuit to initiate operations of their respective state machines simultaneously.
27. The programmable data processor of claim 20, wherein the configurable pipeline circuits are organized into a plurality of groups, each group of the configurable pipeline circuits further comprising a stream processor, and wherein the stream processors are interconnected to each other by a plurality of stream processor-level programmable interconnection elements.
28. The programmable data processor of claim 27 wherein, within each group of configurable pipeline circuits, the pipeline of a first one of the configurable pipeline circuits is connected to the pipeline of a second one of configurable pipeline circuits by configuring their respective interconnection elements.
29. The programmable data processor of claim 27, wherein between a first group of configurable pipeline circuits and a second group of configurable pipeline circuits, a first one of the configurable pipeline circuits in the first group is connected to the pipeline of a configurable pipeline circuits of the second group by configuring both their respective interconnection elements and the stream processor-level interconnection elements.
30. The programmable data processor of claim 29, wherein both the first configurable pipeline circuit and the second configurable pipeline circuit are within one of the groups of the programmable pipeline processors.
31. The programmable data processor of claim 29, wherein the first configurable pipeline circuit is part of a first one of the groups of configurable pipeline circuits and wherein the second configurable pipeline circuit is part of a second one of the groups of configurable pipeline circuits.
32. The programmable data processor of claim 26, wherein the barrier controller implements a plurality of barriers, each barrier is associated with a predetermined number of configurable pipeline circuits allowed to wait on the barrier.
33. The programmable data processor of claim 21, wherein the programmable data processor provides a periodic timing signal serving each group of configurable pipeline circuits, wherein the programmable duration is specified by a number of cycles in the periodic timing signal.
34. The programmable data processor of claim 33, wherein each configurable pipeline circuit further comprises a gating circuit for the timing signal, the gating circuit selectively enabling and disabling propagation of the timing signal among the programmable arithmetic or logic circuits.
35. The programmable data processor of claim 20, wherein each configurable pipeline circuit further comprises a plurality of registers for storing operands and results.
36. The programmable data processor of claim 35, wherein each configurable pipeline circuit further comprises a memory circuit for storing the operands and the results.
37. The programmable data processor of claim 36, wherein the control circuit of each configurable pipeline circuit executes a program stored in the memory circuit, the program comprising instructions of a common instruction set.
38. The programmable data processor of claim 37, wherein the common instruction set includes a wait instruction to be executed by a control circuit of a first one of the configurable pipeline circuits and a release instruction to be executed by a second one of the configurable pipeline circuits, wherein upon executing the wait instruction, the first configurable pipeline circuit enters a first state in which the first pipeline circuit suspends execution of its pipeline until the second configurable pipeline circuit executes the release instruction whereby the second configurable pipeline circuit sends a predetermined vector.
39. The programmable data processor of claim 37, wherein the common instruction set comprises instructions for data transfers between the registers and the memory.
40. The programmable data processor of claim 20, further comprising an interface with an external host processor.
41. The programmable data processor of claim 40, further comprising a first plurality of configuration registers accessible to the external host processor, the configuration registers being provided for the external host processor to configure the pipeline in each configurable pipeline circuit in the programmable data processor.
42. The programmable data processor of claim 20, further comprising a plurality of look-up tables, each designated for storing a configuration of the programmable interconnection elements, wherein the control circuit retrieves one or more of the configurations from the look-up tables to configure the programmable interconnection elements.
43. The programmable data processor of claim 20, wherein each configurable pipeline circuit further comprises: a memory circuit having a plurality of independently accessible first and second memory portions, each memory portion holding data words in locations that the data words are individually addressable using a designated address in a linear address space; first and second address generation circuits each receiving an enable signal, a base address and an increment, each address generation circuit further comprising a counter circuit that (i) holds a count value which is incremented by an offset at each cycle of a clock signal; and (ii) sums the incremented count value to the base address to form a generated address; and a first cross-bar switch network configurable at each cycle of the clock signal to route the generated address of each addressable generation circuit to one of the memory portions; and wherein the control circuit executes a sequence of instructions that (i) configures the cross-bar switch network, selecting at least one of the memory portions to receive the generated address of one of the address generation circuits, and (ii) asserts the enable signal of at least one of the address generation circuits.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030] To facilitate cross-referencing between figures, like elements in the figures are provided like reference numerals.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0031] According to one embodiment of the present invention,
[0032] At the top level of
[0033] For convenience of reference, the registers in registers and logic circuits 15 are also referred herein as top-level registers. Processor 40 may be configured, for example, by the external host processor over AHB interface 10 writing into selected configuration registers in top-level registers 15. Thus, the external host processor may control operations of SPUs 101-1 to 101-4 and interconnection fabric 16. As discussed in further detail below, the external host processor may access local registers and memories in each SPUs over the same AHB interface 10.
[0034] A clock circuit, controllable through in top-level registers, provides a global timing signal (clock) that serves as a time base for all data processing circuits in processor 40. As shown in
[0035] In one embodiment, each SPU may be implemented structurally identical--but individually configurableto perform customized functions. For example, in one embodiment, SPU 14-1 and SPU 14-2 may be configured to receive digitized samples of an RF signal from phase and quadrature channels over RF and GPIO interface 11. Likewise, SPU 14-3 and SPU 14-4 may receive input data and may provide output data over input and output ports allocated on RF and GPIO interface 11. Over internal bus 17, SPUs 14-1 to 14-4 may each access top-level registers and logic circuits 15.
[0036]
[0037] SPU-level registers 25 may include configuration registers for configuring each of APC 24-1 to 24-4 and interconnection fabric 26. In one embodiment, APC 24-1 to 24-4 may access SPU-level registers 25 over internal bus 17. As discussed in further detail below, SPU 45 may include static random-access memory (SRAM) circuits (e.g., SRAM circuits in each of APC 14-1 to 14-4), that are accessible internally by each APC and accessible across SPUs over memory bus 20. As mentioned above, through its associated input and output ports to top-level interconnect fabric 16, interconnection fabric 26 of SPU 45 connects with other SPUs in processor 40. In addition, interconnection fabric 26 provides the additional input and output ports to allow SPU 45 to access or be accessed over the external interfaces (e.g., AHB interface 10 and RF and GPIO interfaces 11). Therefore, in one embodiment, with four APCs in each SPU to interconnect, interconnection fabric 26 may be implemented by a 66 cross-bar switch network.
[0038]
[0039] Interconnection fabric 26 also includes input port in6 and output port out6, which allows interconnection fabric to access or be accessed by the external interfaces. In one module, the APC module of SPU 14-1 is coupled to one of two RF channels, the APC module of SPU 14=2 is coupled to the other one of the two RF channels, and the APC module of SPU 14-3 is coupled to designated GPIO terminals of processor 40.
[0040] Interconnection fabric 16 and interconnection fabric 26 may be dynamically reconfigured by the host processor or by any APC in any of the SPUs. To configure interconnection fabric 16 from the host processor, the configuration-which is specified in a 50-bit vector communicated over a system bus-is written into a configuration register in registers 15 of processor 40. To configure from an APC, however, the configuration is stored one of a predetermined number of look-up tables.
[0041] As shown in
[0042] To dynamically configure interconnection fabric 16, control signal regfile designate causes multiplexer 19-2 to grant access to a selected one of the SPUs to place a 50-bit vector on the system bus. This 50-bit vector specifies one of look-up tables 18. The configuration bits in the selected look-up table are then loaded into and configures interconnection fabric 16. In this manner, interconnection fabric 16 is dynamically configured within one clock cycle without intervention by the external host device.
[0043]
[0044]
[0045] In some embodiments, all APCs in processor 40 may have substantially identical architecture. Each APC may include a controller (i.e., APC controller 31 or k-controller) that executes programs that control all operations of the APC. APC controller 31 may be implemented by any general-purpose processor, whether or not Turing-complete, a proprietary processor or controller, any commercially available third-party processor or controller (e.g., RISC-V or ARM), or any suitable derivation of the above. For certain applications (e.g., signal processing applications), a controller with a minimal instruction set is preferred.
[0046] As shown in
[0047] As shown in
[0048] As shown in
[0049] As shown in
[0050] As shown in
[0051] According to one embodiment one embodiment of the present invention, a generator that generates algorithmically a sequence of addresses for reading or writing data of a data structure stored in SRAM module 32 or flop matrices of register file 34 may be composed from one or more building-block address iterator circuits.
[0055] The logic circuitry supporting this operation may be automatically synthesized, for example, from a register-transfer level hardware description (e.g., Verilog), as known to those of ordinary skill in the art. In essence, 1-D iterator circuit 90 is provides a memory or register address z_out that is based on base address base and an offset value count_out, count_out being a count accumulated in an internal counter, which is incremented or decremented by numerical value stride in each clock cycle. Numerical value count_in sets the initial offset at the beginning of count accumulation. 1-D iterator circuit 90, for example, may be used to access elements of a two-dimensional matrix one element at a time in a row-major or column-major manner by suitably setting numerical values base, count_in and stride. 1-D iterator circuit 90 may also be used to generate an address sequence to access the transpose of the matrix. Numerical value count_out resets to 0 when it equals numerical value acc_max.
[0056] According to one embodiment of the present invention, two or more 1-D iterator circuits may be chained to extend the generated address sequence to a higher-dimensional matrix.
[0057] Operations of addressor module 38 may be controlled clock cycle-by-clock cycle and synchronized through a program instruction executed in APC controller 31, in conjunction with the operations of timing belt module 36. Consequently, the memory circuits (e.g., in SRAM module 32 and register file 34) may be accessed for reading or writing operation according to the generated address sequence. Address sequence may also be paused or resumed on a clock cycle-by-clock cycle basis. Thus, addressor module 38 provides the hardware to facilitate highly complex, structured computational processes to be programmed.
[0058] To achieve power efficiency, timing signal clock is separately gated at each APC (e.g., according to the dynamic clock-gating scheme). Within each APC, local SRAM circuits 32, the registers in register file 34, circuits in addressor module 38 are active only when timing signal clock is allowed to propagate. When an APC is not processing data, timing signal clock is often not propagated for power consideration. In one embodiment, gating of timing signal clock is controlled by one or more registers (clock-gating registers) in registers and logic circuits 15, where appropriate, under a static clock-gating scheme and, in each APC, by APC controller 31, through register files 32 and logic circuits 33, under a dynamic gating scheme. Even though timing signal clock may not be propagating in an APC, the contents of SRAM circuits 32 and register file 34 are held. In addition, some embodiments may implement in processor 40 a light sleep mode, a deep sleep mode, a powered down, or any combinations of these modes known to those skill in the art. SRAM circuits 32 may serve as local memory to support logic circuits 33 and may be shared with other APCs within SPU 45 and other SPUs in processor 40 over memory bus 20.
[0059] In addition, each APC includes a set of task-specific operators within logic circuits 33 that can be connected serially to form an execution pipeline, which performs a programmable sequence of arithmetic or logic operations data without intervention by either APC controller 31 or the host processor. The data may come into the execution pipeline as a data stream over the configured interconnection fabrics (i.e., interconnection fabric 16 and interconnection fabric 26), or retrieved from the local memory of the APC (i.e., SRAM module 32 and register file 34). At predetermined points during the computation on the pipeline (e.g., completion), APC controller 31 interrupts the host processor. The host processor may retrieve, for example, the results of the execution pipeline from the local memory over the AHB-Lite bus.
[0060] A proprietary minimal-instruction set APC controller 31 has the advantage of greater power efficiency over a commercially available microprocessor or controller (e.g., a RISC-V processor). The power efficiency results from, firstly, the task-specific operators on each APC may be optimized for the desired operations of the target application. For example, in a navigation application, optimized operations may be designed for calculating correlations between a digitized global navigation satellite system (GNSS) signal and a peak signal-to-noise ratio (PSNR) bitstream. Secondly, the execution pipeline may process data either directly from the external RF channels, or from data in its local memory. The execution pipeline is designed to process long sequences of data without intervention by the host processor. In other words, when the operators are optimized to the nature of the data in the target application, a locality of computation cam be achieved in the architecture of the present invention, leading to a much higher performance than is possible under a commercial processor. Without the ability to exploit the locality of data, a commercial processor is often bogged down by the frequent data accesses to data (e.g., the RF signals) in the memory over the system bus.
[0061] In some embodiments, each APC may configure both the interconnection fabrics at the baseband module level and at the SPU module level (i.e., both interconnection fabric 16 and interconnection fabric 26). In those embodiments, when an SPU and an APC both to configure the interconnection fabric, the SPU may yield. In this manner, each APC may reconfigure the interconnection fabric within its APC, and the interconnection fabric at the baseband level interconnection ports of its SPU to during the operations of the execution pipeline.
[0062] Processor 40 of
[0063] In one embodiment, the dynamic clock gating scheme is provided in each APC to synchronize the clock signals that are applied to the APC's registers 34, logic circuits 33, and timing belts 35. In one embodiment, memory bus 20, SRAM circuits 32, register file 34, addressor module 38, logic circuits 33, timing belt module 35, and APC controller 31 in each APC operate under the static clock-gating scheme. The dynamic clock gating scheme in each APC is controlled by one or more pre-specified signals sent over signal buses or cross-bar switches (e.g., cross-bar switches of interconnection fabric 26) of the configurable pipeline fabric (PLF). In general, the pre-specified signals are generated by an upstream APC. The dynamic clock gating scheme allows the associated gating circuits to be switched between active and inactive states cycle-by-cycle. Thus, the dynamic clock gating scheme provides a powerful additional synchronization mechanism. In one embodiment, APC controller 31 of each APC may override the dynamic clock-gating scheme by setting an instruction bit at the execution of a start_timing_belt instruction. The start_timing_belt instruction is described in further detail below.
[0064] As mentioned above, each APC may include APC controller 31 that is implemented as a programmable processor executing a relatively simple instruction set. In one embodiment, the APCs in an SPU are implemented structurally identicalbut individually configurableto facilitate performing different configurable functions by suitably connecting multiple task-specific operators in logic circuits 33. The task-specific operators may each be configured to perform one or more specific arithmetic or logic operations. These operators may take operands from either register file 34 or SRAM circuits 32 and may write back results into either register file 34 or SRAM circuits 32. Furthermore, the operators may be configured into a data processing pipeline (i.e., an execution pipeline). SPU 45 may extend the execution pipeline by connecting it with other execution pipelines configured within SPU 45. At the top level, one or more execution pipelines of each SPUs in processor 40 may also be connected to execution pipelines of other SPUs in processor 40.
[0065] Some task-specific operators, and their interconnections within the APC (i.e., with the first and second interconnectivity elements), may be configured as needed to any of multiple pre-set configurations. For example, during operation of an execution pipeline, under control of a state machine (e.g., timing belt module 35, discussed in further detail below), an operator and its interconnections may be reconfigured among its pre-set configurations. The configurations may be stored, for example, in lookup-tables and selected by control vectors programmed into configuration registers. For example, timing belt module 35 may issue the control vectors cycle-by-cycle, such that the interconnections of an operator may be changed on a cycle-by-cycle basis, thus allowing high flexibility in constructing complex streaming calculations.
[0066] According to one embodiment of the present invention, timing belt module 35 of each APC may include programmable circuits for configuring one or more sequencers, each sequencer implementing a state machine for controlling an execution pipeline in the APC. In one embodiment, the programming circuits in timing belt module 35 include at least two types of building blocks: (a) holders; and (b) passers. Each instance of each building block includes an internal counter. A holder building block is provided to represent a state in the state machine. In the normal course, each state is associated with a control vector that represents the values of all control signals provided to control the execution pipeline configured in logic circuits 33, including controlling the task-specific operators therein. A passer building block is provided to implement an iterative loop encompassing two or more states in the state machine.
[0067]
[0068]
[0069]
[0070] Once configured, computations on a large amount of data (e.g., a large array of signal samples or any intermediate data sets) may be carried out during data processing operations under a sequenceran example of the functions of timing belt module 35of the present invention, without intervention by APC controller 31, any of the SPUs, or the host processor.
[0071] In one embodiment, to activate an execution pipeline, a token is passed into the associated sequencer of the execution pipeline, concurrently with activating distribution of timing signal clock into the execution pipeline (e.g., through the static clock-gating scheme programmed into configuration registers in the SPU-level). Execution pipelines of multiple APCs, whether within the same SPU or across SPUs) may be synchronized using a barrier mechanism.
[0072] According to one embodiment of the present invention, barrier controller 17 is provided at the top-level of processor 40, as shown in
[0073] When an execution pipeline in an APC is ready for execution, its corresponding APC controller (e.g., APC controller 31 of APC 48 of
[0074] Two APCs connected by interconnection fabric (e.g., within interconnection fabric 26, or through the combination of interconnection fabric 26 and interconnection fabric 16) may synchronize their execution pipelines using the send_pulse and the wait_pulse instructions. APC controller 31 of one APC may suspend its execution pipeline by executing a wait_pulse instruction until it receives a corresponding code word from APC controller 31 of another APC executing the send_pulse instruction. Executing a send_pulse instruction writes a one-cycle vector (i.e., the corresponding code word) to the input internal data bus of the recipient APC over the relevant interconnection fabric. The one-cycle vector links the pair of send_pulse and wait_pulse instructions. In one embodiment, the one-cycle vector is 28-bit word and may encode configuration information to be exchanged between the APCs. The programs in the APCs should be carefully constructed such that wait_pulse instruction is executed prior to the corresponding send_pulse instruction is executed.
[0075] Any portion of an execution pipeline in an APC may be constructed by appropriately configuring interconnection fabric 26 or interconnection fabric 16. Data may be sourced, for example, from a data stream that over interconnection fabric 26, from local SRAM circuits 32, or from register file 34. Once a pipelined computation is complete, APC controller 31 may assert an interrupt to the host processor, which may then retrieve over AHB interface 10 by a memory read request that directs APC 48 to transfer the results of the pipelined computation from SRAM 32 over memory bus 20.
[0076] With the task-specific operators in logic circuits 33 tailored to allow configurations for performing a set of special purpose functions (e.g., calculating correlation and processing global navigational satellite system (GNSS) signals and peak signal-to-noise PSNR bitstreams), processor 40 provides a programmable power-efficient baseband processor. Furthermore, each APC in processor 40 may operate an execution pipeline that continuously processes long sequences of data received either directly from an RF channel or from its local memory (e.g., SRAM circuits 32), without intervention from the host processor, thereby achieving much greater efficiency than baseband processors of the prior art. In addition, during data processing operations, each APC of processor 40 may access and configure interconnection fabric 16 and interconnection fabric 26, thereby allowing the APC to affect or change its execution pipeline configuration involving other APCs, directly, or through the additional input and output ports of its SPU. In some embodiments, multiple processors, each similarly configured as processor 40, may operate together. In those configurations, each processor may likewise affect or change execution pipeline configurations among themselves.
[0077] In some embodiments, top-level registers 15 and SPU-level registers 25 may be mapped to a first region in a memory address space of the host processor. Likewise, the local SRAM circuits in the APCs may be mapped to a second region in the same memory address space of the host processor. Register file 34 in each APC may also be mapped to the same first region or to a separate region in the host processor's address space, as appropriate.
[0078] From the point of view of the host processor, processor 40 may be viewed as having three operational stages: (i) APC programming stage, (ii) APC running stage, and (iii) a result-fetching stage. In one embodiment, during the APC programming stage, the host processor loads instructions for each APC into SRAM circuits (e.g., SRAM module 32) accessible by that APC. A control circuit in each APC (e.g., APC controller 31) executes the loaded instructions. Those instructions include instructions for loading a bit stream into configuration registers associated with the execution pipeline, thereby configuring that portion of the execution pipeline.
[0079] In the APC running stage, the configured execution pipeline processes the data streams flowing into the execution pipelines (e.g., the digital samples from the RF signal source). Note that more than one execution pipeline may be configured and operated concurrently. At the completion of pipeline execution or under certain predetermined conditions, processor 40 asserts an interrupt signal to the host processor to indicate termination of the APC running stage. The host processor then initiates the result-fetching stage to retrieve the results of the computations in the execution pipeline, or to examine any exception conditions encountered in processor 40, as appropriate. Upon completion of the result-fetching stage, the host processor may initiate the next computational cycle by initiating another APC programming stage. In some embodiments, for each APC that is to be programmed, participate in pipeline execution, or provide results, the host processor (i) may write into the clock-gating registers at the beginning of the stageto activate the APC for the intended operation, and (ii) may write into the clock-gating registers at the end of the stage to deactivate the APC.
[0080] During the APC programming stage, the host processor programs the computation tasks to be carried out on processor 40. In particular, task-specific operators in each APC may be configured into an execution pipeline, the execution pipelines of the APCs in each SPU may be connected to form one or more extended pipelines. Likewise, the extended pipelines of the SPUs may also be connected with extended execution pipelines of other SPUs. Data processing operations in each APC of processor 40 are controlled by APC controller 31, which executes a sequence of instructions written into SRAM module 32 of the APC to carry out its control functions. APC controller 31's instruction set may include instructions for (i) data transfer among SRAM circuits 32 and register file 34, (ii) transfer of control (e.g., jump or branch instructions, including conditional transfers of control); (iii) raising an interrupt signal to the host processor; (iv) resetting state elements in logic circuits 33; (v) setting interconnection fabric 26 and interconnection fabric 16; (vi) arithmetic and logic instructions; and (vii) a synchronized beginning of execution in the APC for an execution pipeline. The synchronized beginning of execution of an execution pipeline may be initiated by the start_timing_belt instruction.
[0081] At the beginning of the APC running stage, the host processor configures interconnections among the SPUs and the APCs in interconnection fabric 16 or interconnection fabric 26, by appropriately writing into interconnection configuration registers in top-level registers 15 or the SPU-level registers 25. The interconnections in these interconnection fabrics may be fixed interconnections between SPUs and APCs (i.e., interconnections that stay unchanged throughout the APC running stage) or dynamically switched interconnections that may be effected by one or more APCs during the APC running stage. The host processor then sets a reset vector for each APC by writing into the reset vector registers in the top-level registers. A reset vector is a 16-bit address that is mapped to the location in SRAM circuits 32 of the first instruction in the program to be executed by APC controller 31 during the APC running stage.
[0082] The host processor then allows the APC controllers to run their respective programs by writing into a trigger register in top-level registers 15 of processor 40. In one embodiment, the trigger register is a 32-bit register, capable of supporting up to 32 APCs, with each bit being dedicated to a corresponding one of the implemented APCs. In one embodiment, a 1 in the corresponding bit in the trigger register signals that the APC is to be activated. Thus, all the activated APC are synchronized at the beginning of their respective executions. Synchronization of beginning of execution in execution pipelines of APCs within the same SPU or across SPUs is accomplished through the start_timing_belt instruction.
[0083] In one embodiment, for power conservation reasons, prior to APC controller 31 in each APC executes a start_timing_belt instruction, as in the normal course, the operators and associated circuits of the execution pipeline in logic circuits 32 are not active, as propagation of signal clock is normally disabled by the clock-gating register. When APC controller 31 executes the start_timing_belt instruction, a barrier_id specified in the instruction is sent to barrier controller 17 to indicate that the execution pipeline in the APC is ready and waiting for the barrier corresponding to the barrier_id, except when the instruction specifies a zero value for the barrier_id. When the barrier_id is zero-value, no waiting at a barrier is required, and APC controller 31 allows the execution pipeline to begin execution immediately. Beginning of execution may be effectuated, for example, by passing a token into the execution pipeline. See, e.g., the example of
[0084] When the last one of the APCs waiting at the barrier corresponding to barrier_id arrives at barrier controller 17, barrier controller 17 sends a barrier_release signal to each of the waiting APCs simultaneously. At each APC controller, at the beginning of the next cycle of timing signal clock, the execution pipeline begins execution. Beginning of pipeline execution may be accomplished, for example, by passing a token into the corresponding sequencer. See, e.g.,
[0085] In the normal course, as illustrated by
[0086] Upon completion of each pipeline execution, a send_interrupt instruction may cause the APC controller of an APC in the execution pipeline to raise an interrupt signal, which may be accomplished, for example, by setting a corresponding bit in an interrupt register in top-level registers 15. In some instances, a second interrupt register in top-level registers 15 may be provided, in which the interrupt bit of each APC is written after gating by a corresponding bit in a mask register. After the last one of the APCs completes execution, the trigger register is reset and the host processor is interrupted according to the value held in the interrupt register. The host processor then examines the interrupt register to determine that state of each APC at the respective completions of execution of their execution pipelines.
[0087] In the normal course, at the completion of its pipeline execution, an APC writes the result of its data processing into SRAM circuits 32. In some instances, the results may be input to a subsequent execution pipeline to be operated on the same or another APC. When all computational tasks are complete, the host processor may read the final results from processor 40. Such final results may be, for example, a mere single word, or any number of words. The final results may be provided at the local SRAM module 32 of a designated APC or distributed across SRAM modules in numerous APCs.
[0088] According to one embodiment of the present invention,
[0089] In the embodiment shown in
[0090] In one data processing application for satellite-based navigation, processor 100 may serve as a digital baseband circuit that processes in real time digitized samples from a radio frequency (RF) front-end circuit. In that application, the input data samples received into processor 100 at input data buses 106-1 and 106-2 are in-phase and quadrature components of a signal received at an antenna, after signal processing at the RF front-end circuit. The received signal includes the navigation signals transmitted from numerous positioning satellites.
[0091]
[0092] As shown in
[0093] The enable signal to an APC may be memory-mapped to allow it to be accessed over internal process bus 209. Through this arrangement, when multiple APCs are configured in a pipeline, the host CPU or SPU 200, as appropriate, may control enabling the APCs in the proper ordere.g., enabling the APCs in the reverse order of the data flow in the pipeline, such that all the APCs are ready for data processing when the first APC in the data flow is enabled.
[0094] Multiplexer 205 switches control of internal processor bus 209 between the host CPU and control unit 203. SPU 200 includes memory blocks 207-1, 207-2, 207-3 and 207-4, which are accessible over internal processor bus 209 by the host CPU or SPU 200, and by APC 201-1, 201-2, . . . , 201-8 over internal data bus during the computation phase. Switches 208-1, 208-2, 208-3 and 208-4 each switch access to memory blocks 207-1, 207-2, 207-3 and 207-4 between internal processor bus 209 and a corresponding one of internal data bus 210-1, 210-2, 210-3 and 210-4. During the configuration phase, the host CPU may configure any element in SPU 200 by writing into configuration registers over global bus 104, which is extended into internal processor bus 209 by multiplexer 205 at this time. During the computation phase, control unit 203 may control operation of SPU 200 over internal processor bus 209, including one or more clock signals that that allow APCs 201-1, 201-2, . . . , 201-8 to operate synchronously with each other. At appropriate times, one or more of APCs 201-1, 201-2, . . . , 201-8 may raise an interrupt on interrupt bus 211, which is received into SPU 200 for service. SPU may forward the interrupt signals and its own interrupt signals to the host CPU over interrupt bus 105. Scratch memory 206 is provided to support instruction execution in control unit 203, such as for storing intermediate results, flags and interrupts. Switching between the configuration phase and the computation phase is controlled by the host CPU.
[0095] In one embodiment, memory blocks 207-1, 207-2, 207-3 and 207-4 are accessed by control unit 203 using a local address space, which may be mapped into an allocated part of a global address space of processor 100. Configuration registers of APCs 201-1, 201-2, . . . , 201-8 are also likewise accessible from both the local address space and the global address space. APCs 201-1, 201-2, . . . , 201-8 and memory blocks 207-1, 207-2, 207-3 and 207-4 may also be directly accessed by the host CPU over global bus 104. Setting multiplexer 205 through a memory-mapped register, the host CPU can connect and allocate internal processor bus 209 to become part of global bus 104.
[0096] Control unit 203 may be a microprocessor of a type referred to by those of ordinary skill in the art as a minimal instruction set computer (MISC) processor, which operates under supervision of the host CPU. In one embodiment, control unit 203 manages lower-level resources (e.g., APC 201-1, 201-2, 201-3 and 201-4) by servicing certain interrupts and by configuring locally configuration registers in the resources, thereby reducing the supervisory requirements of these resources on the host CPU. In one embodiment, the resources may operate without participation by control unit 203, i.e., the host CPU may directly service the interrupts and the configuration registers. Furthermore, when a configured data processing pipeline requires participation by multiple SPUs, the host CPU may control the entire data processing pipeline directly.
[0097]
[0098]
[0099] Within a configured pipeline, the output data stream of each operator is provided as the input data stream for the next operator. As shown in
[0100] Some operators may be configured to request data from an associated memory block (i.e., memory blocks 207-1, 207-2, 207-3 or 207-4). For example, one operator may receive data from the associated memory block and may write the data onto its output data stream into the pipeline. One operator may read data from its input data stream in the pipeline and send data to be written into the associated memory block. Some operators may require data from the RF digital data stream (e.g., over RF interfaces 106-1 and 106-2; see,
[0101] One or more buffer operators may be provided in an APC. A buffer operator may be configured to read or write from a local buffer (e.g., a FIFO buffer). When congestion occurs at a buffer operator, the buffer operator may assert a pause signal to pause the current pipeline. The pause signal disables all related APCs until the congestion subsides. The buffer operator then resets the pause signal to resume the pipeline operation.
[0102] In one embodiment, specialized memory or register circuits (flop matrices) may be provided in addition to the memory blocks (e.g., memory blocks 207-1 to 207-4), or as part of the memory blocks. Each flop matrix is organized as n rowsm columns of memory words, with access ports optimized for accessing the memory words by row or by column. These flop matrices are particularly useful when data, state information and configuration information can be modeled and manipulated using matrix operations.
[0103]
[0104] The above detailed description is provided to illustrate specific embodiments of the present invention and is not intended to be limiting. Numerous modifications and variations within the scope of the invention are possible. For example, memory units at the APC-level (e.g., memory units 207-1 to 207-4 of