MICROPROCESSOR ARCHITECTURE HAVING ALTERNATIVE MEMORY ACCESS PATHS

Abstract

The present invention is directed to a system and method which employ two memory access paths: 1) a cache-access path in which block data is fetched from main memory for loading to a cache, and 2) a direct-access path in which individually-addressed data is fetched from main memory. The system may comprise one or more processor cores that utilize the cache-access path for accessing data. The system may further comprise at least one heterogeneous functional unit that is operable to utilize the direct-access path for accessing data. In certain embodiments, the one or more processor cores, cache, and the at least one heterogeneous functional unit may be included on a common semiconductor die (e.g., as part of an integrated circuit). Embodiments of the present invention enable improved system performance by selectively employing the cache-access path for certain instructions while selectively employing the direct-access path for other instructions.

Claims

1. A method comprising: fetching, by at least one processor, at least one instruction of a first instruction set from a cache memory for execution by the at least one processor, wherein the at least one instruction of the first instruction set is loaded to the cache memory in a fixed-size data block fetched from main memory via a block oriented cache-access path that provides fixed-size data block access to the main memory; and offloading, by the at least one processor, at least one instruction of a second instruction set for execution by at least one heterogeneous functional unit, wherein the at least one instruction of the second instruction set is fetched directly from the main memory to the at least one heterogeneous functional unit via an address oriented cache-bypass path that provides individually-addressed data access to the main memory.

2. The method of claim 1, wherein the at least one processor is configured to execute instructions of the first instruction set, and the at least one heterogeneous functional unit is configured to execute instructions of the second instruction set that is different from the first instruction set.

3. The method of claim 1, wherein fetching the at least one instruction of the first instruction set from the cache memory for execution by the at least one processor includes: determining whether the at least one instruction is present in cache memory; fetching the fixed-size data block including the at least one instruction of the first instruction set from the main memory to the cache memory via the block oriented cache-access path that provides fixed-size data block access to the main memory; and loading the at least one instruction of the first instruction set from the cache memory to the at least one processor.

4. The method of claim 1, wherein the fixed-size data block loaded to the cache memory from the main memory includes instructions in addition to the at least one instruction of the first instruction set, and wherein the at least one instruction of the second instruction set is fetched directly from the main memory to the at least one heterogeneous functional unit is fetched individually without other instructions.

5. The method of claim 1, further comprising determining whether an instruction to be executed is an instruction of the first instruction set or an instruction of the second instruction set.

6. The method of claim 5, further comprising: fetching the at least one instruction of the first instruction set from the cache memory for execution by the at least one processor in response to determining that the instruction to be executed is an instruction of the first instruction set; and offloading the at least one instruction of the second instruction set for execution by at least one heterogeneous functional unit in response to determining that the instruction to be executed is an instruction of the second instruction set.

7. The method of claim 1, wherein the main memory comprises scatter/gather memory configured for individually-addressed data access, and is configured to emulate block data access for fetching the fixed-size data block fetched from the main memory to the cache memory via the block oriented cache-access path.

8. A system comprising: at least one processor configured to fetch at least one instruction of a first instruction set for execution by the at least one processor from a main memory via a block oriented cache-access path, wherein the block oriented cache-access path is configured to provide fixed-size data block access to the main memory; and at least one heterogeneous functional unit configured to fetch at least one instruction of a second instruction set from the main memory via an address oriented cache-bypass path, wherein the address oriented cache-bypass path provides individually-addressed data access to the main memory, and wherein the at least one instruction of the first instruction set is offloaded from the at least one processor to the at least one heterogeneous functional unit for execution by the at least one heterogeneous functional unit.

9. The system of claim 8, wherein the at least one processor is configured to execute instructions of the first instruction set, and the at least one heterogeneous functional unit is configured to execute instructions of the second instruction set that is different from the first instruction set.

10. The system of claim 8, wherein the block oriented cache-access path couples the main memory to a cache memory, and wherein the configuration of the at least one processor to fetch the at least one instruction of the first instruction set via the block oriented cache-access path includes configuration of the at least one processor to: cause at least one fixed-size data block to be fetched from the main memory and to be loaded to the cache memory when the at least one instruction of the first instruction set is absent from the cache memory, the at least one fixed-size data block including at least one instruction of the first instruction set; and fetch the at least one instruction of the first instruction set from the cache memory.

11. The system of claim 8, wherein the fixed-size data block access to the main memory returns data in addition to data referenced by a cache memory access by the at least one processor, and wherein the individually-addressed data access returns only data referenced by a physical address access by the at least one heterogeneous functional unit.

12. The system of claim 8, wherein the at least one processor is further configured to determine whether an instruction to be executed is an instruction of the first instruction set or an instruction of the second instruction set.

13. The system of claim 12, wherein the at least one processor is further configured to: fetch the at least one instruction of the first instruction set from a cache memory for execution by the at least one processor in response to a determination that the instruction to be executed is an instruction of the first instruction set; and offloading the at least one instruction of the second instruction set for execution by at least one heterogeneous functional unit in response to determining that the instruction to be executed is an instruction of the second instruction set.

14. The system of claim 8, wherein the main memory comprises scatter/gather memory configured for individually-addressed data access, and is configured to emulate block data access for fetching a fixed-size data block from the main memory via the block oriented cache-access path.

15. The system of claim 8, further comprising: a cache interrogation path coupling a cache memory to the at least one heterogeneous functional unit, wherein the cache interrogation path is configured to provide information regarding encached data to the at least one heterogeneous functional unit in response to an interrogation by the at least one heterogeneous functional unit regarding referenced data to be accessed by the at least one heterogeneous functional unit, the referenced data including the at least one instruction of the second instruction set.

16. The system of claim 15, wherein the cache interrogation path is configured to initiate loading a fixed-size cache block containing the referenced data to the main memory for individually-addressed data access of the referenced data from the main memory by the at least one heterogeneous functional unit using the address oriented cache-bypass path.

17. The system of claim 15, wherein the cache interrogation path is configured to invalidate the referenced data in the cache memory in association with individually-addressed data access of the referenced data from the main memory by the at least one heterogeneous functional unit using the address oriented cache-bypass path.

18. A method comprising: accessing, by at least one processor, a first portion of data from a main memory via a block oriented cache-access path, wherein the block oriented cache-access path is configured to provide fixed-size data block access to the main memory; and accessing, by at least one heterogeneous functional unit, a second portion of data from the main memory via an address oriented cache-bypass path, wherein the address oriented cache-bypass path provides individually-addressed data access to the main memory, and wherein the second portion of data includes at least one instruction offloaded from the at least one processor to the at least one heterogeneous functional unit for execution by the at least one heterogeneous functional unit.

19. The method of claim 18, wherein accessing the first portion of data from the main memory via the block oriented cache-access path includes: determining whether the first portion of data is present in cache memory; fetching a fixed-size data block including the first portion of data from the main memory to the cache memory via the block oriented cache-access path; and loading the first portion of data from the cache memory to the at least one processor.

20. The method of claim 18, wherein accessing the second portion of data from the main memory via the address oriented cache-bypass path includes: offloading, by the at least one processor, execution of the at least one instruction included in the second portion of data to the at least one heterogeneous functional unit; and fetching the at least one instruction directly from the main memory to the at least one heterogeneous functional unit via the address oriented cache-bypass path by referencing an individual address of the at least one instruction in the main memory to individually-addressed data access the address of the at least one instruction.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0043] For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:

[0044] FIG. 1 shows a block diagram of an exemplary system architecture of the prior art;

[0045] FIG. 2 shows a block diagram of an exemplary system architecture of an embodiment of the present invention; and

[0046] FIGS. 3A-3B show an exemplary operational flow diagram according to one embodiment of the present invention.

DETAILED DESCRIPTION

[0047] FIG. 2 shows a block diagram of a system 200 according to one embodiment of the present invention. System 200 comprises two subsystems: 1) main memory (physical memory) subsystem 201 and processor subsystem (semiconductor die) 208. The combination of subsystems 201 and 208 permit programs to be executed, i.e. instructions are executed in processor subsystem 208 to process data stored in main memory subsystem 201. As described further herein, processor subsystem 208 comprises one or more processor cores (two processor cores, 202A and 202B, in the illustrated example), cache 203, and a heterogeneous functional unit 204. In the illustrated example, all elements of processor subsystem 208 are implemented on a common die.

[0048] System 200 employs two memory access paths: 1) a cache-access path in which block data is stored/loaded to/from main memory 201 to/from cache 203, and 2) a direct-access path in which individually-addressed data is stored/loaded to/from main memory 201 (e.g., along path 206 in system 200). For instance, system 200 employs a cache-access path in which block data may be stored to main memory 201 and in which block data may be loaded from main memory 201 to cache 203. Additionally, system 200 employs a direct-access path in which individually-addressed data, rather than a fixed-size block of data, may be stored to main memory 201 and in which individually-addressed data may be loaded from main memory 201 (e.g., along path 206 in system 200) to a processor register (e.g., of heterogeneous functional unit 204).

[0049] System 200 comprises two processor cores, 202A and 202B, that utilize the cache-access path for accessing data from main memory 201. System 200 further comprises at least one heterogeneous functional unit 204 that is operable to utilize the direct-access path for accessing data from main memory 201. As described further herein, embodiments of the present invention enable improved system performance by selectively employing the cache-access path for certain instructions (e.g., selectively having the processor core(s) 202A/202B process certain instructions) while selectively employing the direct-access path for other instructions (e.g., by offloading those other instructions to the heterogeneous functional unit 204).

[0050] Embodiments of the present invention provide a system in which two memory access paths are employed for accessing data by two or more processing nodes. A first memory access path (which may be referred to herein as a “cache-access path” or a “block-oriented access path”) is a path in which a block of data is fetched from main memory 201 to cache 203. This cache-access path is similar to the traditional memory access described above with FIG. 1, whereby the processor core decodes an instruction and determines a physical address 210 of desired data. If the desired data (i.e., at the referenced physical address) is present in cache 203 it is accessed from cache 203, and if the desired data is not present in cache 203, the physical address is used to fetch (via path 207) the data from main memory 201, which is loaded into cache 203. Such fetching from main memory 201 may load not only the desired data residing at the referenced physical address into cache 203, but may also load some fixed block of data, commonly referred to as a “cache block” as discussed above (e.g., a 64-byte cache block such as that discussed with Table 1). A second memory access path (which may be referred to herein as a “direct-access path”, “cache-bypass path”, or “address-oriented access”) enables cache 203 to be bypassed to retrieve data directly from main memory 201. In such a direct access, data of an individual physical address that is requested may be retrieved from main memory 201, rather than retrieving a fixed-size block of data that encompasses more than what is desired.

[0051] According to certain embodiments of the present invention the main memory is implemented as non-sequential access main memory that supports random address accesses as opposed to block accesses. That is, upon requesting a given physical address, the main memory may return a corresponding operand (data) that is stored to the given physical address, rather than returning a fixed block of data residing at physical addresses. In other words, rather than returning a fixed block of data (e.g., a 64-byte block of data) independent of the requested physical address, the main memory is implemented such that it is dependent on the requested physical address requested (i.e., is capable of returning only the individual data residing at the requested physical address).

[0052] According to certain embodiments, processor cores 202A and 202B are operable to access data in a manner similar to that of traditional processor architectures (e.g., that described above with FIG. 1). That is, processor cores 202A and 202B are operable to access data via the cache-access path, in which a fixed-size block of data is fetched from main memory 201 for loading into cache 203, such as described above with exemplary Table 1. In addition, in certain embodiments, processor cores 202A and 202B are operable to off-load (e.g., via control line 209) certain instructions for processing by heterogeneous functional unit 204, which is operable to access data via the direct-access path 206.

[0053] When being accessed directly (via the “direct-access path” 206), main memory 201 returns the data residing at a given requested physical address, rather than returning a fixed-size block of data that is independent (in size) of the requested physical address. Thus, rather than a block-oriented access, an address-oriented access may be performed in which only the data for the requested physical address is retrieved. Further, when being accessed via the cache-access path, main memory 201 is capable of returning a cache block of data. For instance, the non-sequential access main memory 201 can be used to emulate a block reference when desired for loading a cache block of data to cache 203, but also supports individual random address accesses without requiring a block load (e.g., when being accessed via the direct-access path 206). Thus, the same non-sequential access main memory 201 is utilized (with the same physical memory addresses) for both the cache-access path (e.g., utilized for data accesses by processor cores 202A and 202B in this example) and the direct-access path (e.g., utilized for data access by heterogeneous functional unit 204). According to one embodiment, non-sequential access main memory 201 is implemented by scatter/gather DIMMs (dual in-line memory modules) 21.

[0054] Thus, main memory subsystem 201 supports non-sequential memory references. According to one embodiment, main memory subsystem 201 has the following characteristics:

[0055] 1) Each memory location is individually addressed. There is no built-in notion of a cache block.

[0056] 2) The entire physical memory is highly interleaved. Interleaving means that each operand resides in its individually controlled memory location.

[0057] 3) Thus, full memory bandwidth is achieved for a non-sequentially referenced address pattern. For instance, in the above example of the DO loop that accesses every fourth memory address, the full memory bandwidth is achieved for the address reference pattern: Address1, Address5, Address9, and Address13.

[0058] 4) If the memory reference is derived from a micro-core, then the memory reference pattern is sequential, e.g., physical address reference pattern: Address1, Address2, Address3, Address8 (assuming a cache block of 8 operands or 8 words).

[0059] 5) Thus, the memory system can support full bandwidth random physical addresses and can also support full bandwidth sequential addresses.

[0060] Given a memory system 201 as described above, a mechanism is further provided in certain embodiments to determine whether a memory reference is directed to the cache 203, or directly to main memory 201. In a preferred embodiment of the present invention, a heterogeneous functional unit 204 provides such a mechanism.

[0061] FIGS. 3A-3B show an exemplary operational flow diagram for processing instructions of a program being executed by processor subsystem 208 (of FIG. 2) according to one embodiment of the present invention. According to embodiments, an executable file is provided that includes a first portion of instructions to be processed by a first instruction set (such as a first instruction set of a first processor in a multi-processor system) and a second portion of instructions to be processed by a second instruction set (such as a second instruction set of a second processor in a multi-processor system). In the example of FIGS. 3A-3B, operation of system 200 works as follows: a processor core 202A/202B fetches referenced an instruction (e.g., referenced by a program counter (PC)) of the program being executed in operational block 31. In block 32, the processor core 202A/202B decodes the instruction and determines a physical address 210 at which the desired data resides. In block 33, the processor core determines whether the instruction is to be executed in its entirety by the processor core 202A/202B or whether it is to be executed by heterogeneous functional unit 204. According to one embodiment, as part of the definition of the instruction (i.e., the instruction set architecture), it is a priori determined if the instruction is executed by processor core 202A/202B or heterogeneous functional unit 204. If determined in block 33 that the instruction is to be executed by the processor core 202A/202B, operation advances to block 34 where the processor core 202A/202B accesses data (by referencing its physical address 210) for processing the instruction via a cache-access path. If, on the other hand, it is determined in block 33 that the instruction is to be executed by the heterogeneous functional unit 204, operation advances to block 35 (of FIG. 3B) where the processor core 202A/202B communicates the instruction to the heterogeneous functional unit 204, and then in block 36 the heterogeneous functions unit accesses data for processing the instruction via the direct-access path. Exemplary operations that may be performed in each of the cache-access path 34 and the direct-access path 36 in certain embodiments are described further below.

[0062] In certain embodiments, the determination in block 33 may be made based, at least in part, on the instruction that is fetched. For instance, in certain embodiments, the heterogeneous functional unit 204 contains some operational instructions (in its instruction set) that are part of the native instruction set of the processor cores 202A/202B. For instance, in certain embodiments, the x86 (or other) instruction set may be modified to include certain instructions that are common to both the processor core(s) and the heterogeneous functional unit. For instance, certain operational instructions may be included in the native instruction set of the processor core(s) for off-loading instructions to the heterogeneous functional unit.

[0063] For example, in one embodiment, the instructions of an application being executed are decoded by the processor core(s) 202A/202B, wherein the processor core may fetch (in operational block 31) a native instruction (e.g., X86 instruction) that is called, as an example, “Heterogeneous Instruction 1”. The decode logic of the processor core decodes the instruction in block 32 and determines in block 33 that this is an instruction to be off-loaded to the heterogeneous functional unit 204, and thus in response to decoding the Heterogeneous Instruction 1, the processor core initiates a control sequence (via control line 209) to the heterogeneous functional unit 204 to communicate (in operational block 35) the instruction to the heterogeneous functional unit 204 for processing.

[0064] In one embodiment, the cache-path access 34 includes the processor core 202A/202B querying, in block 301, the cache 203 for the physical address to determine if the referenced data (e.g., operand) is encached. In block 302, the processor core 202A/202B determines whether the referenced data is encached in cache 203. If it is encached, then operation advances to block 304 where the processor core 202A/202B retrieves the referenced data from cache 203. If determined in block 302 that the referenced data is not encached, operation advances to block 303 where a cache block fetch from main memory 201 is performed to load a fixed-size block of data, including the referenced data, into cache 203, and then operation advances to block 304 where the processor core retrieves the fetched data from cache 203.

[0065] In one embodiment, the direct-access path 36 (of FIG. 3B) includes the heterogeneous functional unit 204 interrogating (via path 205 of FIG. 2), in block 305, cache 203 to determine whether the referenced data has been previously encached. For instance, all memory references by heterogeneous functional unit 204 may use address path (bus) 205 of FIG. 2 to reference physical main memory 201. Data is loaded or stored via bus 206 of FIG. 2. Control path 209 of FIG. 2 is used to initiate control and pass data from processor core 202A/202B to heterogeneous functional unit 204.

[0066] In block 306, heterogeneous functional unit 204 determines whether the referenced data has been previously encached in cache 203. If it has not, operation advances to block 310 where the heterogeneous functional unit 204 retrieves the referenced data of the individually-referenced physical address (e.g., physical address 210 and 207 of FIG. 2) from main memory 201. That is, the referenced data is received, via path 206, from the individual-referenced physical address of main memory, rather than receiving a fixed-size block of data (such as a cache block), as is returned from main memory 201 in the cache-path access 34.

[0067] If determined in block 306 that the referenced data has been previously cached, then in certain embodiments different actions may be performed depending on the type of caching employed in the system. For instance, in block 307, a determination may be made as to whether the cache is a write-back caching technique or a write-through caching technique, each of which are well-known caching techniques in the art and are thus not described further herein. If a write-back caching technique is employed, then the heterogeneous functional unit 204 writes the cache block of cache 203 that contains the referenced data back to main memory 201, in operational block 308. If a write-through caching technique is employed, then the heterogeneous functional unit 204 invalidates the referenced data in cache 203, in operational block 309. In either case, operation then advances to block 310 to retrieve the referenced data of the individually-referenced physical address (e.g., physical address 210 and 207 of FIG. 2) from main memory 201, as discussed above.

[0068] In certain embodiments, if a hit is achieved from the cache in the direct-access path 36 (e.g., as determined in block 306), then the request may be completed from the cache 203, rather than requiring the entire data block to be written back to main memory 201 (as in block 308) and then referencing the single operand from main memory 201 (as in block 310). That is, in certain embodiments, if a hit is achieved for the cache 203, then the memory access request (e.g., store or load) may be satisfied by cache 203 for the heterogeneous functional unit 204, and if a miss occurs for cache 203, then the referenced data of the individually-referenced physical address (e.g., physical address 210 and 207 of FIG. 2) may be accessed in main memory 201, as discussed above (e.g., as in block 310). Thus, certain embodiments permit memory access of cache 203 by heterogeneous functional unit 204 (rather than bypassing the cache 203) when the memory access request can be satisfied by cache 203, but when the memory access request cannot be satisfied by cache 203 (i.e., a miss occurs), then an individually-referenced physical address (rather than a block-oriented access) is made of main memory 201.

[0069] For all traditional microprocessors of the prior art, main memory (e.g., 101 of FIG. 1) is block-oriented. Block-oriented means that even if one 64-bit word is referenced, 8 to 16 words ARE ALWAYS fetched and loaded into the microprocessor's cache (e.g., cache 103 of FIG. 1). As discussed above, this fixed-size block of 8 to 16 words are called the “cache block”. For many applications, only one word of the 8 to 16 words of the cache block that are fetched is used. Consequently, a large amount (e.g., 87%) of the memory bandwidth is wasted (not used). This results in reduced application performance.

[0070] Typical of these types of applications are those that reference memory using a vector of indices. This is called “scatter/gather”. For example, in the following FORTRAN code: [0071] do i=1,n

a(i)=b(i)+c(i) [0072] enddo
all the elements of a, b, and c are sequentially referenced.

[0073] In the following FORTRAN code: [0074] do i=1,n

a(j(i))=b(j(i))+c(j(i)) [0075] enddo
a, b, and c are referenced through an index vector. Thus, the physical main memory system is referenced by non-sequential memory addresses.

[0076] According to certain embodiments, main memory 201 of system 200 comprises a memory dimm that is formed utilizing standard memory DRAMs, that provides full bandwidth memory accesses for non-sequential memory addresses. Thus, if the memory reference pattern is: 1, 20, 33, 55; then only memory words, 1, 20, 33, and 55 are fetched and stored. In fact, they are fetched and stored at the maximum rate permitted by the DRAMs.

[0077] In the above example, with the same memory reference pattern, a block-oriented memory system, with a block size of 8 words, would fetch 4 cache blocks to fetch 4 words: [0078] {1 . . . 8}—for word 1; [0079] {17 . . . 24}—for word 20; [0080] {33 . . . 40}—for word 33; and [0081] {51 . . . 56}—for word 55.

[0082] In the above-described embodiment of system 200 of FIG. 2, since full bandwidth is achieved for non-sequential memory accesses, full memory bandwidth is achieved for sequential accesses. Accordingly, embodiments of the present invention enable full bandwidth for memory accesses to be achieved for both non-sequential and sequential memory accesses.

[0083] Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.

MICROPROCESSOR ARCHITECTURE HAVING ALTERNATIVE MEMORY ACCESS PATHS

Inventors

Cpc classification

Classification Explorer

G06F12/1027

PHYSICS

Classification Explorer

G06F12/0877

PHYSICS

Classification Explorer

G06F2212/60

PHYSICS

Classification Explorer

G06F2212/68

PHYSICS

Classification Explorer

G06F12/0888

PHYSICS

Classification Explorer

G06F12/0844

PHYSICS

International classification

Classification Explorer

G06F12/0888

PHYSICS

Classification Explorer

G06F12/0844

PHYSICS

Classification Explorer

G06F12/0877

PHYSICS

Classification Explorer

G06F12/1027

PHYSICS

Abstract

Claims

Description