STLB prefetching for a multi-dimension engine
09852081 · 2017-12-26
Assignee
Inventors
Cpc classification
G06F12/1027
PHYSICS
G06F12/1081
PHYSICS
International classification
G06F12/08
PHYSICS
G06F12/1027
PHYSICS
Abstract
A multi-dimension engine, connected to a system TLB, generates sequences of addresses to request page address translation prefetch requests in advance of predictable accesses to elements within data arrays. Prefetch requests are filtered to avoid redundant requests of translations to the same page. Prefetch requests run ahead of data accesses but are tethered to within a reasonable range. The number of pending prefetches are limited. A system TLB stores a number of translations, the number being relative to the dimensions of the range of elements accessed from within the data array.
Claims
1. A multi-dimension engine, comprising an interface, wherefrom an address translation prefetch request is sent to a system translation lookaside buffer (STLB), wherein the STLB comprises a number of translations approximately equal to one of two times a height of a fetch access region when a prefetch window width is one page, three times the height of the fetch access region when the prefetch window width is one page, 2+(a prefetch window size/a page size) times the height of the fetch access region, or 1+(the prefetch window width /a maximum page size supported by the multi-dimension engine).
2. The multi-dimension engine of claim 1 comprising an address generator.
3. The multi-dimension engine of claim 1 wherein each page accessed during an access sequence receives no more than one of the address translation prefetch request.
4. The multi-dimension engine of claim 1 wherein the address translation prefetch request is sent only for one address within a page.
5. The multi-dimension engine of claim 4 wherein the one address is aligned on a page boundary.
6. The multi-dimension engine of claim 4 wherein the one address corresponds to a starting boundary of the fetch access region.
7. The multi-dimension engine of claim 1 wherein a data request is subsequently sent to the same page as the address translation prefetch request.
8. The multi-dimension engine of claim 7 wherein the address translation prefetch request is constrained to an address range relative to the data request.
9. The multi-dimension engine of claim 8 wherein the address range is exactly one page.
10. The multi-dimension engine of claim 8 wherein the address range is less than one page.
11. The multi-dimension engine of claim 8 wherein the address range is more than one page.
12. The multi-dimension engine of claim 1 wherein the address translation prefetch request is limited.
13. The multi-dimension engine of claim 12 wherein the limiting is based on a bandwidth.
14. The multi-dimension engine of claim 12 wherein the limiting is based on a number of outstanding prefetch requests.
15. The multi-dimension engine of claim 12 wherein the limiting is based on a maximum latency.
16. A non-transitory computer-readable storage medium arranged to represent Hardware Description Language (HDL) source code, the HDL source code representing a multi-dimension engine, comprising an interface, wherefrom an address translation prefetch request is sent to a system translation lookaside buffer (STLB), wherein the STLB comprises a number of translations approximately equal to one of two times a height of a fetch access region when a prefetch window width is one page, three times the height of the fetch access region when the prefetch window width is one page, 2+(a prefetch window size/a page size) times the height of the fetch access region, or 1+(the prefetch window width/a maximum page size supported by the multi-dimension engine).
17. A system translation lookaside buffer (STLB) comprising a number of translations approximately equal to two times a height of a fetch access region when a prefetch window width is one page.
18. A system translation lookaside buffer (STLB) comprising a number of translations approximately equal to three times a height of a fetch access region when a prefetch window width is one page.
19. A system translation lookaside buffer (STLB) comprising a number of translations approximately equal to 1+(a prefetch window width/a maximum page size supported by a multi-dimension engine).
20. A system translation lookaside buffer (STLB) filtered to fix page entries at a start of an access region until reaching an end of the access region, wherein the STLB comprises a number of translations approximately equal to one of two times a height of a fetch access region when a prefetch window width is one page, three times the height of the fetch access region when the prefetch window width is one page, 2+(a prefetch window size/a page size) times the height of the fetch access region, or 1+(the prefetch window width/a maximum page size supported by the STLB).
21. A non-transitory computer-readable medium arranged to represent Hardware Description Language (HDL) source code, the HDL source code representing a system translation lookaside buffer (STLB), wherein the STLB comprises a number of translations approximately equal to at least one of: two times a height of a fetch access region when a prefetch window width is one page, three times the height of the fetch access region when the prefetch window width is one page, 2+(a prefetch window size/a page size) times the height of the fetch access region, or 1+(the prefetch window width/a maximum page size supported by a multi-dimension engine).
22. A method for accessing a data set comprising issuing a translation prefetch from a multi-dimension engine to a system translation lookaside buffer (STLB), wherein the STLB comprises a number of translations approximately equal to one of two times a height of a fetch access region when a prefetch window width is one page, three times the height of the fetch access region when the prefetch window width is one page, 2+(a prefetch window size/a page size) times the height of the fetch access region, or 1+(the prefetch window width /a maximum page size supported by the multi-dimension engine).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9) A multi-dimension engine such as a rotation engine takes a 2D surface and writes it with x-y coordinates reversed.
(10) According to an aspect of the invention, the memory address of each pixel of a surface, based on its coordinates, is given by the following formula:
Addr=BASE+y*WIDTH+x*PIX_SIZE
(11) where:
(12) x and y are the coordinates of the pixel within the surface;
(13) BASE is the base address of the surface;
(14) WIDTH is the distance (in bytes) between the start of a row and the start of the next one; and
(15) PIX_SIZE is the size of a pixel in bytes (typically 2 or 4 bytes).
(16) According to other aspects of the invention, other formulas describe the arrangement of pixels at memory addresses.
(17) Source surface 110 and destination surface 120 need not have the same parameters (BASE, WIDTH, PIX_SIZE).
(18) A problem for conventional multi-dimension engines is that while one surface can be stepped through in incremental addresses of adjacent data (except potentially at the end of a row), the other surface must be stepped through in addresses with relatively large steps. This is shown in
(19) source surface (0)=>destination surface (0)
(20) source surface (32)=>destination surface (4)
(21) source surface (64)=>destination surface (8)
(22) source surface (96)=>destination surface (12)
(23) source surface (128)=>destination surface (16)
(24) The destination surface is written in incremental addresses of adjacent data (with PIX_SIZE=4 bytes) while the SRC surface is read with big jumps between pixels.
(25) Memories, such as dynamic random access memories (DRAMs), where surfaces might be shared between writing and reading agents are not efficient when accessing small units of data. In the example of
(26) This is traditionally solved in 2 steps:
(27) (1) Fetching from the source surface in larger blocks
(28) (2) Adding some intermediate storage to the multi-dimension engine so that the unneeded data from the large block fetch is kept for enough time so that it is still in the intermediate storage when the multi-dimension engine needs it.
(29) In
(30) DRAMs typically behave near optimally for 64-256 byte bursts, so a rectangular access region might be 16-128 pixels on one side. To reduce buffering, one dimension of the rectangle may be reduced.
(31) Another problem arises when the addresses accessed by the multi-dimension engine are virtual addresses.
(32) In a virtual addressing system, memory is composed of pages (a typical size being 4 KB). The mapping of virtual addresses (VA) to physical addresses (PA) tends to be irregular, so that pixels at adjacent VAs, that cross a page boundary, might be far apart in physically-addressed memory.
(33) Surfaces to be rotated within chips may exceed a WIDTH of 4 KB with PIX_SIZE of 4B. With virtually addressed page sizes of 4 KB, this means that a single row of pixels in a surface spans more than one page. As a consequence, pixels within a column are not on the same page. Even with WIDTH smaller than the page size, the page locality of pixels in a column can be low enough to cause substantial performance problem due to STLB misses.
(34)
(35) In a virtual memory system, the multi-dimension engine is connected to the memory through a system memory management unit (SMMU). The SMMU takes VAs and converts them to PAs suitable for memory.
(36) According to an aspect of the invention, as shown in
(37) Walker 524 takes from 2 to more than 20 memory accesses to resolve a translation. 2 memory accesses are enough for a small VA space. 20 or more memory accesses are required for large VA spaces, such as ones represented with 64 bits, and “nested paging”, due to the extra layer of virtualization.
(38) Because of this, the memory access traffic generated by walker 524 during a traversal of surface 410 in the vertical direction far exceeds the traffic to access the pixels themselves and the duration of the stalls due to STLB misses can dramatically decrease throughput. Therefore, it is critical to cache the translations in STLB 522.
(39) An appropriate number of entries to cache in an STLB is the number of pages touched by a vertical traversal of a surface. When the memory used by rows of pixels exceeds VA page sizes, one entry should be cached for each row in the surface.
(40) Sizing the STLB to a number of entries equal to the height of the access region still presents problems:
(41) (A) The flow of rotation reads and writes is interrupted (sometimes for long periods of time) when a row access reaches a new page, causing a STLB miss.
(42) (B) For well aligned surfaces, such as ones where the WIDTH is an integer number of pages, STLB misses occur back-to-back for all row each time a row access reaches a new page. This creates a large burst of traffic from the SMMU walker, delaying pixel traffic for a long time.
(43) According to an aspect of the invention, a translation prefetching mechanism is used in conjunction with an STLB to reduce or eliminate delay due to STLB misses. The STLB receives prefetch commands from the multi-dimension engine (or another coordinated agent) to trigger the walker to fetch a translation in anticipation of its near future use. The walker places the new translation in the STLB so that it is available in advance or in a reduced amount of time after the translation being requested by the multi-dimension engine.
(44)
(45) According to some aspects of the invention, as shown in
(46) According to other aspects of the invention, as shown in
(47) According to another aspect of the invention, the prefetch generator is constrained to stay within a certain range of addresses of the regular stream.
(48) According to another aspect of the invention, the distance is one page, so that for any row being accessed the translation for the next page to be encountered may be prefetched, but not the following one.
(49) According to other aspects of the invention, the distance may be set to less than a page or more than a page depending on the buffering and the latency required to cover the walking time of the prefetch requests.
(50) Referring now to
(51) The raw stream is filtered to send just one prefetch per page. In particular, addresses are filtered out if they are not perfectly aligned on a page boundary. Thusly, a prefetch of the next page is sent immediately after the last data element of a previous page is accessed, and need be available for exactly two translations per row.
(52) At the right edge of surface 800, the data access column wraps to the beginning of the next group of eight rows, starting from left edge 830 of the surface. Upon wrapping, each access will cause a translation miss. According to another aspect of the invention, prefetch requests of addresses corresponding to left edge 830 are sent, despite the fact that some (most) are not perfectly aligned to a page boundary. This corresponds to a start condition for a new access region to be transferred when the data on the starting edge of an access region is not aligned to a page boundary.
(53) According to some aspects of the invention, the prefetch traffic is limited so that it does not overwhelm the walker or the memory system. That is, the multi-dimension engine discards or delays the issuance of prefetch requests based on its state. Limits are possible based on bandwidth of prefetch requests, the current number of outstanding prefetch requests, and maximum latency, among others.
(54) According to some aspects of the invention, the STLB is sized to a number of translations equal to twice the height of the fetch access region when the prefetch window is limited to one page of width. This is because the whole prefetch window can only contain two pages (current, next) per row.
(55) According to other aspects of the invention, the STLB is sized to a number of translations equal to 1+(prefetch window width/page size) for the largest page size that is supported by the system.
(56) These settings are optimal in steady state (i.e. when the prefetch window is not touching the edges of the surface). However, when the prefetch window is at the starting edge or straddles access regions there is discontinuity in the pages to prefetch as the new access region typically uses totally different pages.
(57) According to some aspects of the invention, the STLB is sized to 3 times fetch height (for a page-wide prefetch window) or 2+(prefetch window size/page size) times the fetch height for other sizes. This allows the prefetch window to cover 2 different access regions with no interruption in the prefetching.
(58) In unaligned cases, partially used pages at the left of the fetch access region are also used on the previous row at the right of the access region. On a wide enough surface, the page would be replaced in the TLB by the time the prefetch window size reaches the right side of the access region and so the page would have to be prefetched again. Increasing the raw prefetch stream filter size or adding special logic can make the repeated fetching unnecessary.
(59) According to some aspects of the invention, the STLB is sized to 3 times the fetch height (for a page-wide prefetch window) or 2+(prefetch window size/page size) times the fetch height for other sizes and the TLB is filtered to fix the page entries at the start of an access region until reaching the end of the access region.
(60) As will be apparent to those of skill in the art upon reading this disclosure, each of the aspects described and illustrated herein has discrete components and features which may be readily separated from or combined with the features and aspects to form embodiments, without departing from the scope or spirit of the present invention. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.
(61) Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present invention, representative illustrative methods and materials are now described.
(62) All publications and patents cited in this specification are herein incorporated by reference as if each individual publication or patent were specifically and individually indicated to be incorporated by reference and are incorporated herein by reference to disclose and describe the methods and/or system in connection with which the publications are cited. The citation of any publication is for its disclosure prior to the filing date and should not be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.
(63) Additionally, it is intended that such equivalents include both currently known equivalents and equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure. The scope of the present invention, therefore, is not intended to be limited to the exemplary embodiments shown and described herein.
(64) In accordance with the teaching of the present invention a computer and a computing device are articles of manufacture. Other examples of an article of manufacture include: an electronic component residing on a mother board, a server, a mainframe computer, or other special purpose computer each having one or more processors (e.g., a Central Processing Unit, a Graphical Processing Unit, or a microprocessor) that is configured to execute a computer readable program code (e.g., an algorithm, hardware, firmware, and/or software) to receive data, transmit data, store data, or perform methods.
(65) The article of manufacture (e.g., computer or computing device) includes a non-transitory computer readable medium or storage that includes a series of instructions, such as computer readable program steps or code encoded therein. In certain aspects of the present invention, the non-transitory computer readable medium includes one or more data repositories. Thus, in certain embodiments that are in accordance with any aspect of the present invention, computer readable program code (or code) is encoded in a non-transitory computer readable medium of the computing device. The processor, in turn, executes the computer readable program code to create or amend an existing computer-aided design using a tool. In other aspects of the embodiments, the creation or amendment of the computer-aided design is implemented as a web-based software application in which portions of the data related to the computer-aided design or the tool or the computer readable program code are received or transmitted to a computing device of a host.
(66) An article of manufacture or system, in accordance with various aspects of the present invention, is implemented in a variety of ways: with one or more distinct processors or microprocessors, volatile and/or non-volatile memory and peripherals or peripheral controllers; with an integrated microcontroller, which has a processor, local volatile and non-volatile memory, peripherals and input/output pins; discrete logic which implements a fixed version of the article of manufacture or system; and programmable logic which implements a version of the article of manufacture or system which can be reprogrammed either through a local or remote interface. Such logic could implement either a control system either in logic or via a set of commands executed by a soft-processor.
(67) Accordingly, the preceding merely illustrates the various aspects and principles of the present invention. It will be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the invention and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents and equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure. The scope of the present invention, therefore, is not intended to be limited to the various aspects discussed and described herein. Rather, the scope and spirit of present invention is embodied by the appended claims.