SYSTEMS AND METHODS FOR MAPPING HARDWARE FIFO TO PROCESSOR ADDRESS SPACE
20220349417 · 2022-11-03
Assignee
Inventors
Cpc classification
G06F13/4022
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
F16J15/32
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
International classification
F04D29/10
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
F04D13/06
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
F04D29/046
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
F04D29/12
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
F16C33/74
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
Abstract
An apparatus for a microprocessor computer system and method for configuring the same where said microprocessor computer system comprises a processor core and at least one hardware buffer FIFO with memory-mapped head and tail that handles data movement among the processor cores, networks, raw data input and outputs, and memory. The method for configuring said microprocessor computer system comprises utilizing a FIFO auxiliary processor to process said data traversing said hardware FIFO; utilizing said hardware FIFOs to efficiently pipe data through functional blocks; and utilizing a FIFO controller to perform DMA operations that include non-unit-stride access patterns and transfers among processor cores, networks, raw data input and outputs, memory, and other memory-mapped hardware FIFOs.
Claims
1. A method of configuring a microprocessor computer system, wherein said microprocessor computer system comprises: a plurality of external data streams, each associated with a distinct memory-mapped FIFO; and a processor core that processes multiple sequential iterations of a vector operation, each iteration requiring the processing of a datum from each of the data streams before proceeding to the next iteration; said method comprising, for each iteration of the vector operation, configuring said vector operation to access each memory-mapped FIFO in turn and to stall if any such access is invalid.
2. (canceled)
3. The method of claim 21, wherein the plurality of external data streams is a plurality of external input data streams; wherein each of said input data streams in enqueued in a distinct memory-mapped FIFO and said processor core must dequeue and process one datum from each memory-mapped FIFO for each iteration; wherein an invalid access of a memory-mapped FIFO comprises executing a read operation against an empty FIFO; and wherein iterating said vector operation results in processing the external input data streams.
4. (canceled)
5. The method of claim 1, wherein the plurality of external data streams is a plurality of external output data streams; wherein each of said output data streams is dequeued from a distinct memory-mapped FIFO and said processor core must enqueue one datum to each memory-mapped FIFO for each iteration; wherein an invalid access of a memory-mapped FIFO comprises executing a write operation against an full FIFO; and wherein iterating said vector operation results in generating the external output data streams.
6. (canceled)
7. The method of claim 1, wherein said external data streams are synchronously sampled data from a multiplicity of analog-to-digital converters.
8. The method of claim 1, wherein each said memory-mapped FIFO is accessed by said processor core as an address.
9. The method of claim 8, wherein the said processor core reads said memory-mapped FIFO by reading said address.
10. The method of claim 9, wherein said step of reading said address dequeues a value at the head of a queue.
11. The method of claim 8, wherein said processor core writes to said memory-mapped FIFO by writing said address.
12. The method of claim 11, wherein said step of writing said address enqueues a value at the head of a queue.
13. The method of claim 12, wherein said processor core will stall when writing to a full memory-mapped FIFO.
14. The method of claim 1, wherein when said plurality of external data streams are stored sequentially in said memory-mapped FIFO, said memory-mapped FIFO is addressed by specifying a vector stride of zero.
15. The method of claim 14, wherein said step of specifying a vector stride of zero reads a same address repeatedly to access all data in said memory-mapped FIFO.
16. The method of claim 1, wherein when said plurality of external data streams are to be stored sequentially in said memory-mapped FIFO, a write to the same memory-mapped FIFI address will fill said memory-mapped FIFO when a vector stride of zero is specified.
16. The method of claim 1, wherein said processor core will read and dequeue data at the head of a non-empty said memory-mapped FIFO and will block said read operation when said memory-mapped FIFO is empty.
17. The method of claim 1, wherein said processor core will read data at the head of a non-empty said memory-mapped FIFO and will block said read operation when said memory-mapped FIFO is empty.
18. The method of claim 1, wherein said processor core will read and dequeue data at the head of a non-empty said memory-mapped FIFO and will return an invalid value when said memory-mapped FIFO is empty.
19. The method of claim 1, wherein said processor core will read data at the head of a non-empty said memory-mapped FIFO and return an invalid value when said memory-mapped FIFO is empty.
20. The method of claim 18, wherein said processor core remains un-stalled when said invalid value is returned.
21. The method of claim 19, wherein said processor core remains un-stalled when said invalid value is returned.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
DETAILED DESCRIPTION OF THE INVENTION
[0042] The present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which preferred embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Those of ordinary skill in the art realize that the following descriptions of the embodiments of the present invention are illustrative and are not intended to be limiting in any way. Other embodiments of the present invention will readily suggest themselves to such skilled persons having the benefit of this disclosure. Like numbers refer to like elements throughout.
[0043] In this detailed description of the present invention, a person skilled in the art should note that directional terms, such as “above,” “below,” “upper,” “lower,” and other like terms are used for the convenience of the reader in reference to the drawings. Also, a person skilled in the art should notice this description may contain other terminology to convey position, orientation, and direction without departing from the principles of the present invention.
[0044] Furthermore, in this detailed description, a person skilled in the art should note that quantitative qualifying terms such as “generally,” “substantially,” “mostly,” and other terms are used, in general, to mean that the referred to object, characteristic, or quality constitutes a majority of the subject of the reference. The meaning of any of these terms is dependent upon the context within which it is used, and the meaning may be expressly modified.
[0045] Referring to
[0046] Certain embodiments of the invention, as shown and described by the various figures and accompanying text, may overcome the problems in the art described above by delivering the following advantages, as described in more detail herein below:
[0047] 1) Consumes/generates/moves data efficiently at high data rates characteristic of network switches and converters (e.g., analog-to-digital, digital-to-analog)
[0048] 2) Efficiently operates available buffers to mitigate latency penalties for memory access and avoids undesirable results of clogging of the queued data pipeline (e.g., dropped data, corrupted data streams)
[0049] Referring now to
[0050] A) Hardware FIFO(s) 100 that efficiently enqueues data on behalf of a processor core 110.
[0051] B) Controllable FIFO(s) 200 capable of processing data while the processor core performs other activities.
[0052] C) Hardware FIFO bank(s) 300 comprising at least one of at least one hardware memory-mapped FIFO(s) 100 and at least one of controllable hardware memory-mapped FIFO(s) 200.
[0053] D) Hardware FIFO(s) 100 backed by expandable buffer space in at least one of cache memory 440, higher-level memory 450, and additional buffer memory 510.
[0054] E) Processor core(s) 110 operably coupled with at least one of hardware FIFOs (100) and FIFO banks(s) 300 and cache memory 440 via shared busses 131 and 150 and shared control signals 114.
[0055] F) Processor core(s) 110 operably coupled with at least one of hardware FIFOs 100 and FIFO banks(s) 300, cache memory 440, and higher-level memory 450 through arbiters that mediate resource contention.
[0056] G) Automating data processing by a controllable FIFO based upon processor configuration. Illustration of Direct Memory Access (DMA) setup and operation as a specific example data processing comprising data movement using FIFO control structures (e.g.,
[0057] E) Solution space-specific configurations of a DMA-based microprocessor computer system, including split FIFOs (
[0058] F) Auxiliary processor(s) for data-dependent pre- and post-processing, including addition of a modulus to the non-unit-stride output of a hardware FIFO(s) and/or a memory-mapped FIFO(s).
[0059] Memory-Mapped First-In First-Out (FIFO) Queue
[0060] Referring more specifically to
[0061] As described above, because a hardware FIFO bank 100 is characterized by a single head 120 and tail 140, such a FIFO 100 may be mapped to a single physical address of the processor core 110 or a virtual address of a process running on said processor core. One embodiment of the present invention may employ one address for the head 120 of the FIFO 100 and another address for the tail 140. An alternative embodiment of the present invention may use a single address for both head 120 and tail 140, but may recognize the logical distinction that a write to that address is enqueuing data at the tail 140 and a read is dequeuing data from the head 120 of the queue. Mapping a FIFO 100 into the address space of the processor core 110, as opposed to a dedicated internal register, is an advantage of the present invention.
[0062] The FIFO buffer 130 provides storage for data enqueued at the FIFO tail that has not yet arrived at the FIFO head and may comprise any of several storage selected from the group consisting of a static random-access memory (SRAM), a flip-flop, a register file, and a latch. In one embodiment of the present invention, the depth of the FIFO buffer 130 is fixed.
[0063] When writing (enqueuing) on the memory-mapped FIFO 100, the processor core 110 may simply write a value to the FIFO tail memory-mapped address. For example, and without limitation, this WRITE action may enqueue the value on the memory-mapped FIFO. When the memory-mapped FIFO is full, one embodiment of the present invention may be configured to treat the write as a cache miss and operate to stall the processor core 110 or the requesting process or thread. However, an alternative embodiment may be to operate to allow a write to fail.
[0064] A hardware FIFO 100 may be mapped to a memory location in the physical address space of the processor core 110. Additionally, said hardware FIFO 100 may be mapped into the virtual address space of one or more processes executing on the processor core.
[0065] In one embodiment of the present invention, each hardware FIFO 100 may present either its head 120 or its tail 140 to the processor core 110 if its tail is written to or its head is read from, respectively, at least one of the Network 190, Raw I/O 180, other storage, and other FIFOs 100. For example, and incoming Raw data input from an external analog-to-digital converter may asynchronously strobe data into the FIFO tail and this may be the only means of enqueuing data on the FIFO, whereas the processor may read (dequeue) data from the head of the processor using a typical LOAD operation. Alternatively, or in addition, the present invention may be configured such that both ends 120, 140 of the FIFO 100 may be exposed to the processor core 110 for tasks such as inter-thread or inter-process communication.
[0066] How the processor core 110 may interact with a memory-mapped FIFO 100 will now be described in detail. In one embodiment, the present invention may include architecturally treating the FIFO similar to a cache memory as commonly understood to those skilled in the art. Such architectural mapping may advantageously support use of the existing cache control signals 114 to handle exceptions such as an empty or full FIFO. For example, and without limitation, when implementing a READ FIFO using an embodiment of the present invention, if the processor core 110 is to consume a value from a memory-mapped hardware FIFO 100, it may simply read (fetch) a data structure from the memory exactly as it would from a cache memory. This action may have the effect of dequeuing the value at the head of the queue and returning it to the processor core 110. If the FIFO is empty, the processor core 110 may observe a condition similar to a “cache miss” and may respond like it would to any cache miss—generally, by stalling the processor core 110 or the requesting process or thread executing on the processor core 110 until the data is available. The essential semantics of a cache miss is that the data is not yet ready, and that is the case for an empty FIFO as well. The key difference is the cause of unavailability. In the case of a traditional cache miss, the data must be retrieved from higher in the memory hierarchy. In the case of a FIFO, the processor core 110 is waiting for another value to be enqueued. In the specific case of SPv2, the processor core stalls on a cache miss. Because such blocking behavior may be undesirable, the present invention may include alternative cache miss handling features (as described in detail below).
[0067] Variants of this typical READ behavior may have valuable use cases that require enhancements to the processor core 110 beyond typical cache interfaces such as, for example, and without limitation, a non-dequeuing read operation (often referred to as a PEEK operation) that does not modify the state of the FIFO. Another interface enhancement, also for example, and without limitation, is a non-blocking read operation that signals invalid data returned due to an empty FIFO, but that allows processing to continue. Building upon this idea is the ability to suspend a thread that is blocked on a FIFO rather than stalling the processor core 110 completely.
[0068] Referring more specifically to
[0069] Referring more specifically to
[0070] Referring more specifically to
[0071] In an embodiment of the present invention that does not require a FIFO arbiter 420, the processor core 110 may use the same busses 411 and 450 and control signals 414 to interact with the FIFO bank(s) 300 as it does for the cache memory 440.
[0072] Referring more specifically to
[0073] In one embodiment of the present invention comprising a controlled hardware FIFO and at least one of additional buffer storage 510, cache memory 440 and higher-level memory 450, data may be transferred from the FIFO buffer 130 to said storage as a single block more efficiently than with many individual transfer operations. Similarly, blocks may be transferred from said storage into a hardware FIFO with greater efficiency than individual transfer operations.
[0074] Referring now to
[0075] Referring to
[0076] Continuing to refer to
[0077] Referring to
[0078] Continuing to refer to
[0079] Referring to
[0080] The potentially wide value word may require multiple accesses to process if it exceeds the width of the data bus 112. To facilitate this processing, both dequeuing and non-dequeuing read operations may be employed: one that “consumes” the value (and thus results in the next value in line coming to the queue head) and one that only “peeks” at some portion of the value (up to and including the full value). Employing said combinations of operations, an exemplar the processor core 110 may “walk” down the 512-bit wide data word 64-bits at a time performing non-dequeuing reads until it reaches the last 64-bit word. It then performs a dequeuing read and consumes the data at the head of the queue, which will have the effect of moving the next data value, if available, to the head of the FIFO. Similarly for a write operation, an exemplar the processor core 110 with a 64-bit data bus first performs a non-enqueuing write that writes the first 64 bits of data into the FIFO tail. As a non-enqueuing write, the written data cannot move towards the head of the FIFO. Subsequently, the processor core 110 may “walk” down the 512-bit wide data word 64-bits at a time performing non-enqueuing writes until it reaches the last value to be written. It then performs an enqueuing write that will release the 512-bit word to move towards the head of the FIFO. That word will now be unaffected by subsequent write operations on the hardware FIFO.
[0081] Applying the memory-mapped FIFO architectural constructs defined above, various embodiments of microprocessor computing systems employing those constructs will now be discussed.
[0082] Implementing Vector and Matrix Instructions
[0083] As a matter of definition, a vector processor (or array processor) is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors and multi-dimensional arrays of data called matrices. A “stride” of an array (also referred to as increment, pitch or step size) may be defined as the number of locations in memory between beginnings of successive array elements. The SPv2 is a vector processor that has many vector-oriented instructions, such as vector dot product. Because vectors may be unit-stride (e.g., 8 bytes for a single-precision complex values) or non-unit-stride, the vector instructions have a parameter defining the stride (in bytes) between adjacent values. However, if a vector operation is operating upon data stored sequentially in a memory-mapped FIFO that has a single head address, addressing to a FIFO-oriented architecture may be accomplished by simply specifying a vector of stride zero. The result will be that the same address may be read over and over again (thus draining the memory-mapped FIFO) but each access will retrieve a subsequent value from the vector. Similarly, vector writes may fill a memory-mapped FIFO.
[0084] Similarly, matrices may be processed using an array of FIFOs, each holding a row or column of a matrix. To perform a matrix-vector product such as beamforming over a stream of incoming data, the processor core 110 may store the weights in internal registers and accumulate a dot product of these weights and each row of the matrix by dequeuing a value from the head of each FIFO in turn and adding the product of that value with the corresponding weight to the accumulated sum. This extension to this concept of vector instruction processing is particularly valuable when processing data in lock step from several input sources, such as a bank of analog-to-digital (A/D) converters from a radio frequency phased array presenting data as raw input data 180. Rather than relying on complex software synchronization and moving data blocks to memory before processing, a processor core 110 may be configured to read from a plurality of hardware FIFOs 100 in turn. Doing so may guarantee that all data that starts in synchronization remains in synchronization. If, perhaps due to network congestion, one input falls slightly behind, resulting in an empty FIFO, the processor core 110 may simply pause (block) until the data arrives and then may continue. If the addresses for the relevant memory-mapped FIFOs have a fixed stride, such a configuration may make it even easier to use existing fixed-stride vector instructions. Similar use cases may apply to WRITE FIFOs. Continuing the example above, if multiple beams are being formed by computing dot products on the input samples from the A/D converters, these synchronized beams may be sent out on WRITE FIFOs for subsequent processing either by the processor core 110 or by another servicing component. For operations that require a FIFO value to be used several times before moving to the next value, the processor core 110 may perform non-dequeuing reads until it is time to move to the next value at which point a dequeuing read is performed.
[0085] Note that because FIFOs implemented as described herein are memory-mapped, the processor core 110 may treat them just like any other memory location using efficient primitives like LOAD and STORE operations. Thus, memory-mapped FIFOs advantageously may be operable with existing compilers, using keywords like “volatile” in the C programming language, and assemblers without need for modification. Expensive operations (such as invoking the operating system and interrupt handling) may not generally be required to interact with the memory-mapped FIFOs.
[0086] Moving Data Efficiently to/from Memory
[0087] Referring to
[0088] Referring to
[0089] Chaining FIFOs and FIFO-Like Stream Processors
[0090] Continuing to refer to
[0091] In certain embodiments of the present invention, elements being chained need not actually be FIFOs, but instead need only implement the FIFO interface. For example, and without limitation, non-FIFO functional blocks that may implement a FIFO interface (e.g., enqueue and dequeue) may include integer-to-floating point converters (and reverse), TCP/IP offload engines, encryptors/decryptors, encoders/decoders, and checksum generators/checkers. Without programming, said functional blocks may stream process their inputs (sources) into outputs streamed to their outputs (sinks). In addition to forming a means to readily insert potentially significant hardware accelerators into an architecture, this embodiment of the present invention may allow such processing to be strung together; as long as the functional blocks implement FIFO interfaces, such elements may be chained by “forwarding” the head of one FIFO into the tail of another. By repeating this process, a chain of arbitrary length may be constructed. Unlike known techniques for chaining FIFOs together, chaining of memory-mapped FIFOs may implement data manipulations and may be addressed using the same techniques as transfers to memory.
[0092] Splitting FIFO Streams
[0093] Referring to
[0094] For example, and without limitation, splitting may be accomplished using a source FIFO head 120 and a plurality of consumers (FIFO tails 140), wherein the FIFO head controller forwards the data word at the head to each of the plurality of tail FIFOs before the word is dequeued. Also for example, and without limitation, splitting may be accomplished within a processing step whereby the processor core 110 (see
[0095] Merging FIFO Streams
[0096] Referring to
[0097] Another form of merging is at the “Frame” level. In this case, the processor may alternate in arbitrary order among a plurality of incoming FIFOs and may forward a plurality of data words from that FIFO before selecting the next FIFO. An example of frame-level merging is forwarding from multiple incoming network queues into a single TCP/IP offload engine. The data packets may be transferred in their entirety as a block/frame without insertion of other values.
[0098] FIFO Status
[0099] Referring now to
[0100] Accommodating Differing Datum Sizes
[0101] Referring again to
[0102] For example, and without limitation, one implementation may include a plurality of data elements concatenated on a single cache line. As data is read, according to the word size of the fetch instruction, the data may be shifted and masked as necessary to align the data by either the processor core 110 or the FIFO controller 250. Another embodiment may define a composite record (up to the size of the cache line) that may contain a plurality of sub-records. This processor core 110 may read the record and may deconstruct the sub-records in software or, alternatively, may parallel-load the sub-records into as many registers as appropriate. The semantics enforced may dictate that the record is not available until the entire set of bytes comprising the record is available. Such an implementation may require that the sender and receivers agree on the datum length apriori. One implementation option may be that data is framed and that the frame header indicates the datum length for the data within it. Similar mechanisms may be used on both enqueuing and dequeuing.
[0103] Cache Line “Pinning”
[0104] Referring to
[0105] As described above, various embodiments of the present invention exemplify how a small state machine may convert a simple FIFO mechanism into a powerful tool to map data into memory and/or forward data across a distributed computing architecture. In addition, if an FPGA or small microprocessor or microcontroller is added to the FIFO control design, additional functions such as data-dependent processing and routing and format translation may be performed on the fly that may be valuable for various applications. For example, and without limitation, converting 16-bit integer data into IEEE single-precision floating point representation, and dropping or capping outlier data, multi-step parallel sorting algorithms may be implemented where FIFOs represent bins in a multi-step sort, and moving averages may be computed.
[0106] Referring again to
[0107] Because of the ease of operating the FIFO interfaces using simple processor instructions (e.g., LOAD and STORE), much of the complexity commonly associated with messaging may be avoided, thus facilitating very low latency processing that may be advantageous to time-sensitive applications such as financial trading. For example, and without limitation, a processor core with a writable micro-store may be able to process entire messages with a single compound instruction. To continue the example of financial decisions, complex event processing may execute without interruption and without fetching instructions from memory.
[0108] While one advantageous use of the invention may be for handling high data rate computation and microprocessor computer systems, the FIFO abstraction and the mapping into simple processor instructions may allow low-power processors to more efficiently handle incoming and out-going data, which may be particularly appropriate and advantageous for internet-of-things (IoT) devices.
[0109] Multiprocessor Systems
[0110] Referring to
[0111] Some of the illustrative aspects of the present invention may be advantageous in solving the problems herein described and other problems not discussed which are discoverable by a skilled artisan.
[0112] While the above description contains much specificity, these should not be construed as limitations on the scope of any embodiment, but as exemplifications of the presented embodiments thereof. Many other ramifications and variations are possible within the teachings of the various embodiments. While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best or only mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims. Also, in the drawings and the description, there have been disclosed exemplary embodiments of the invention and, although specific terms may have been employed, they are unless otherwise stated used in a generic and descriptive sense only and not for purposes of limitation, the scope of the invention therefore not being so limited. Moreover, the use of the terms first, second, etc. do not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another. Furthermore, the use of the terms a, an, etc. do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced item.
[0113] Thus the scope of the invention should be determined by the appended claims and their legal equivalents, and not by the examples given.