Memory management device capable of managing memory address translation table using heterogeneous memories and method of managing memory address thereby
11704018 · 2023-07-18
Assignee
Inventors
Cpc classification
G06F12/1027
PHYSICS
G06F3/0659
PHYSICS
G06F3/0604
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
Provided are a GPU and a method of managing a memory address thereby. TLBs are configured using an SRAM and an STT-MRAM so that a storage capacity is significantly improved compared to a case where a TLB is configured using an SRAM. Accordingly, the page hit rate of a TLB can be increased, thereby improving the throughput of a device. Furthermore, after a PTE is first written in the SRAM, a PTE having high frequency of use is selected and moved to the STT-MRAM. Accordingly, an increase in TLB update time which may occur due to the use of the STT-MRAM having a low write speed can be prevented. Furthermore, a read time and read energy consumption can be reduced.
Claims
1. A memory management device comprising: a first translation lookaside buffer (TLB) implemented as a static random access memory (SRAM) and suitable for storing a page table configured with multiple page table entries (PTEs) for translating, into a physical memory address, a virtual memory address applied by at least one corresponding processor; and a second TLB implemented as a spin-transfer torque magnetic random access memory (STT-MRAM) and suitable for receiving and storing a PTE of a hot page, having a relatively high frequency of use, which is selected among the multiple PTEs stored in the first TLB, wherein the PTE in the second TLB is identical to one of the multiple PTEs in the first TLB, wherein, when the virtual memory address is applied by the least one corresponding processor, the first TLB and the second TLB simultaneously search for PTEs corresponding to the applied virtual memory address and translate the virtual memory address into the physical memory address.
2. The memory management device of claim 1, wherein the second TLB has a greater storage capacity than the first TLB, so that the second TLB is capable of storing more PTEs than the first TLB.
3. The memory management device of claim 1, wherein, when a PTE corresponding to the virtual memory address is found in the search, the first TLB increases a count of the retrieved PTE and transmits, to the second TLB, a PTE whose count is greater than or equal to a set threshold.
4. The memory management device of claim 3, wherein the second TLB has a storage space in which the PTE is stored and which is divided into an upper space and a lower space, and stores, in one of the upper space and the lower space, the PTE transmitted by the first TLB, based on a least significant bit (LSB) of a TLB tag in the virtual memory address.
5. The memory management device of claim 3, further comprising a page table walker suitable for generating a PTE by translating the virtual memory address into a physical memory address using a set method when the PTE is not found in the first TLB and the second TLB and adding the count to the generated PTE.
6. The memory management device of claim 5, wherein the page table walker transmits and stores the generated PTE in the first TLB.
7. The memory management device of claim 1, wherein: the memory management device is implemented as a graphics processing unit (GPU) comprising multiple streaming multiprocessors each comprising multiple stream processors, and the first TLB and the second TLB comprises multiple instances respectively, and at least one instance of each of the first TLB and the second TLB is included in each of the multiple streaming multiprocessors.
8. The memory management device of claim 1, wherein a PTE in the first TLB is updated earlier than a PTE in the second TLB, regarding a same virtual memory address.
9. The memory management device of claim 8, wherein the PTE in the second PTE is updated when it is determined that an updated PTE in the first TLB is associated with the hot page, regarding the same virtual memory address.
10. A method of managing a memory address by a memory management device, the method comprising: when a virtual memory address is applied by at least one corresponding processor, simultaneously searching, by a first TLB and a second TLB, a page table, which is configured with multiple page table entries (PTEs) previously stored, for a PTE corresponding to the applied virtual memory address, in order to translate the virtual memory address into a physical memory address, the first translation lookaside buffer (TLB) implemented as a static random access memory (SRAM) and the second TLB implemented as a spin-transfer torque magnetic random access memory (STT-MRAM); and transmitting a PTE of a hot page, having a relatively high frequency of use, which is selected among the multiple PTEs stored in the first TLB, so that the PTE is stored in the second TLB, wherein the PTE in the second TLB is identical to one of the multiple PTEs in the first TLB.
11. The method of claim 10, wherein the transmitting of the PTE comprises: increasing a count of a retrieved PTE when the PTE corresponding to the virtual memory address is found in the search; and transmitting, to the second TLB having a greater storage capacity than the first TLB, a PTE whose count is greater than or equal to a set threshold.
12. The method of claim 11, wherein the transmitting of the PTE comprises: storing, in one of an upper storage space and a lower storage space of the second TLB, the PTE transmitted by the first TLB, based on a least significant bit (LSB) of a TLB tag in the virtual memory address.
13. The method of claim 11, further comprising: when the PTE is not found in the first TLB and the second TLB, generating a PTE by translating the virtual memory address into a physical memory address using a set method, and adding the count to the generated PTE; and transmitting and storing, into the first TLB, the PTE to which the count has been added.
14. A system comprising: multiple stream processors, each stream processor including a first translation lookaside buffer (TLB) implemented as a static random access memory (SRAM), and a second TLB implemented as a spin-transfer torque magnetic random access memory (STT-MRAM), wherein the first TLB stores multiple page table entries (PTEs), and wherein the stream processor: selects a hot PTE having a relatively high frequency of access count among the multiple PTEs in the first TLB; stores the hot PTE in the second TLB, wherein the PTE in the second TLB is identical to one of the multiple PTEs in the first TLB; simultaneously searches the first TLB and the second TLB for a PTE corresponding to a virtual memory address; and translates the virtual memory address into a physical memory address using the PTE found in the search.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8) Features and operational advantages of the present disclosure and how they are achieved are described with reference to the accompanying drawings.
(9) Although various embodiments are disclosed herein, the present disclosure may be implemented in various different ways. Thus, the present invention is not limited to described embodiments. Furthermore, in order to clearly describe the present disclosure, well known material is omitted. The same reference numerals are used throughout the specification to refer to the same or like parts.
(10) Throughout the specification, when it is said that a component or the like “includes” an element, the component may include one or more other elements, unless explicitly stated to the contrary. Furthermore, a term such as “module” when used herein means a component that performs at least one function or operation, and which may be implemented by hardware or software or a combination of hardware and software. Also, throughout the specification, reference to “an embodiment,” “another embodiment,” “the present embodiment” is not necessarily to only one embodiment, and different references to any such phrase are not necessarily to the same embodiment(s). The term “embodiments” when used herein does not necessarily mean all embodiments.
(11) In the present disclosure, in order to overcome the limit of the existing L1 translation lookaside buffer (TLB) implemented as an SRAM, an L1 TLB of a device is implemented using a spin-transfer torque magnetic random access memory (STT-MRAM) together with an SRAM.
(12) STT-MRAM is a nonvolatile memory having read latency similar to that of SRAM and has higher density and less energy consumption for a read operation than SRAM. That is, STT-MRAM has advantages in that more data can be stored in the same area and power consumption can be reduced.
(13) Table 1 illustrates the results of analysis of characteristics when TLBs are configured with an SRAM and an STT-MRAM.
(14) TABLE-US-00001 TABLE 1 TLB Config. TLB Config. 1 TLB Config. 2 STT-MRAM STT-MRAM SRAM 1T1J 2T2J SRAM 1T1J 2T2J Technology (1K) (4K) (2K) (16K) (64K) (32K) Memory Area 0.037 0.043 0.043 0.434 0.505 0.505 (mm.sup.2) Sensing Time 0.49 1.13 0.71 0.49 1.13 0.71 (ns) Total Read 1.78 1.81 1.39 7.24 6.78 6.36 Time (ns) Total Write 2.48 11.37 11.37 6.89 14.02 14.02 Time (ns) Read Energy 0.12 0.06 0.06 0.13 0.09 0.09 (nJ) Write Energy 0.13 0.39 0.79 0.15 0.49 0.98 (nJ) Leakage Power 62.49 21.86 21.86 522.67 157.48 157.48 (nW)
(15) In the STT-MRAM of Table 1, one magnetic tunnel one junction (1T1J) and two magnetic tunnel two junction (2T2J) indicate STT-MRAMs divided based on differential sensing methods.
(16) Referring to Table 1, it may be seen that in the same memory area, 1T1J STT-MRAM has a storage capacity four times greater than that of SRAM and 2T2J STT-MRAM has a storage capacity two times greater than that of SRAM. Moreover, 1T1J STT-MRAM and 2T2J STT-MRAM have a read time to the same as or faster than that of SRAM. That is, when L1 TLB is implemented using STT-MRAM, there are advantages in that more PTEs can be stored and the page hit rate can be significantly increased.
(17) Furthermore, the STT-MRAM can reduce power consumption of the TLB because it has much lower read energy consumption than the SRAM.
(18) However, as illustrated in Table 1, STT-MRAM has a write time four times greater than that of SRAM. That is, it takes a lot of time for a TLB update for writing a new PTE in TLB. Accordingly, when L1 TLB is implemented using only STT-MRAM, it is difficult to improve performance of TLB and thus the throughput of a device cannot be improved.
(19)
(20)
(21) Furthermore, from
(22) In the present embodiment, the L1 TLB is implemented using both SRAM and STT-MRAM by considering the characteristics of the GPU illustrated in
(23) In this case, it is assumed that the STT-MRAM included in the L1 TLB is applied to the 2T2J STT-MRAM. The 2T2J STT-MRAM has a lower storage capacity than the 1T1J STT-MRAM in the same area and has the same write time as the 1T1J STT-MRAM, but has advantages such as a shorter read time and a lower error rate. However, the present invention is not limited to the 2T2J STT-MRAM; the 1T1J STT-MRAM may be applied.
(24)
(25) Referring to
(26) When a virtual memory address to be read from the multiple SPs 211 is transmitted, the L1 TLB 212 searches for PTEs stored in the first L1 TLB 2121 and the second L1 TLB 2122, respectively. Furthermore, when a PTE corresponding to the virtual memory address is found in the first L1 TLB 2121 or the second L1 TLB 2122, the L1 TLB 212 translates the virtual memory address into a physical memory address based on the retrieved PTE.
(27) The second L1 TLB 2122 of the L1 TLB 212 is implemented as an STT-MRAM having high density, and thus may store more PTEs in the same area. Accordingly, the hit rate of the L1 TLB 212 is greatly increased, which may significantly improve the throughput of a device. Furthermore, the power consumption and the read time of the L1 TLB 212 may be reduced due to a short read time and small power consumption of the STT-MRAM.
(28) When a PTE corresponding to the virtual memory address is not found in either the first L1 TLB 2121 or second L1 TLB 2122 of the L1 TLB 212, the L1 TLB 212 searches an L2 TLB 230, that is, a shared TLB shared by the multiple SMs 210, for a PTE corresponding to the virtual memory address. When the PTE is found in the L2 TLB 230, the L1 TLB 212 translates the virtual memory address into a physical memory address. However, when the PTE is not found even in the L2 TLB 230, a page table walker 250 accesses an L2 cache 240, that is, a shared cache, and a global memory 220, obtains the PTE corresponding to the virtual memory address, and updates a TLB with the obtained PTE.
(29) In this case, the page table walker 250 may receive the start address of a page table stored in the PTBR 213, and may search for the PTE. Furthermore, the page table walker 250 may add, to the PTE, a count bit C having a set number of bits. The page table walker 250 may update the L2 TLB 230 by transmitting the PTE to the L2 TLB 230.
(30) Furthermore, the L2 TLB 230 increases a count bit C whenever a corresponding PTE is found, that is, whenever the PTE is found in at least one SM 210. When the PTE count bit C reaches a set first threshold, the L2 TLB 230 may update the L1 TLB 212 by transmitting the PTE to the L1 TLB 212.
(31) In this case, the L2 TLB 230 updates the L1 TLB 212 by transmitting the PTE to the first L1 TLB 2121 of the L1 TLB 212. The reason for this is that the first L1 TLB 2121 implemented as an SRAM has a much shorter write time than the second L1 TLB 2122 implemented as an STT-MRAM.
(32) That is, in the present embodiment, the first L1 TLB 2121 is implemented as an SRAM, and the update of a PTE for the L1. TLB 212 is performed only in the first L1 TLB 2121. Accordingly, an increase in write time which may occur due to the second L1 TLB 2122 implemented as an STT-MRAM may be prevented.
(33) In some cases, the L2 TLB 230 may be omitted. In this case, the page table walker 250 may update a TLB by transmitting a PTE to the first L1 TLB 2121.
(34) Thereafter, the first L1 TLB 2121 increases a count bit C whenever a PTE is found. When the PTE count bit C reaches a set second threshold, the first L1 TLB 2121 transmits the PTE to the second L1 TLB 2122. That is, the first L1 TLB 2121 treats, as a hot page, a PTE having a count bit C higher than the second threshold, and stores the PTE in the second L1 TLB 2122.
(35) The second L1 TLB 2122 implemented as an STT-MRAM has a longer write time. However, the relatively low write speed may be offset because multiple devices including GPUs more frequently access a hot page having high frequency access, as illustrated in
(36) When moving a PTE to the second L1 TLB 2122, the first L1 TLB 2121 may transmit the PTE to the second L1 TLB 2122, which is divided into an upper space and a lower space, based on the least significant bit (LSB) of a TLB tag. This is for enabling more PTEs to be stored in the second L1 TLB 2122 than in the first L1 TLB 2121 because the first L1 TLB 2121 implemented as an SRAM and the second L1 TLB 2122 implemented as an STT-MRAM have different densities.
(37) In some embodiments, the first L1 TLB 2121, the second L1 TLB 2122 and the L2 TLB 230 may evict a stored PTE. For example, the first L1 TLB 2121, the second L1 TLB 2122 and the L2 TLB 230 may sequentially evict PTEs in ascending order of hit frequency according to the least recently used (LRU) scheme so that other PTEs may be stored.
(38)
(39) Referring to
(40) Each of PTEs configuring the page tables of the first and second L1 TLBs 2121 and 2122 may be configured to include a valid hit V, a count bit C, a tag and a physical page number PPN. The valid bit V indicates whether a corresponding page is stored in a memory. The count bit C is a bit added to determine whether a corresponding page is a hot page. The tag indicates the address of each PTE in the page table, and may correspond to TLBT and the TLBI of the virtual memory address VA.
(41) As illustrated in
(42) When the count bit C of a specific PTE is a second threshold or more, the first L1 TLB 2121 moves the specific PTE to the second L1 TLB 2122. In some embodiments, the first L1 TLB 2121 may move the specific PTE to the upper or lower address space of the second L1 TLB 2122 with reference to a least significant bit (LSB) of the TLBT in the virtual memory address VA. For example, the first L1 TLB 2121 may move the specific PTE to the upper address space of the second L1 TLB 2122 when the LSB of the TLBT is 0, and may move the specific PTE to the lower address space of the second L1 TLB 2122 when the LSB of the TLBT is 1.
(43) The present embodiment illustrates the GPU as an example of the device and illustrates that the hybrid TLB is included in each of the multiple SMs 210 as the L1 TLB 212, but the present disclosure is not limited thereto. The hybrid TLB may be included in various multiprocessors including multiple processors and may be configured to correspond to each of processors. For example, the hybrid TLB may also be applied to a device, such as a CPU including multiple core processors.
(44)
(45) In describing the method of
(46) The method of managing, by the memory management device, a memory address according to the present embodiment, which is illustrated in
(47) Furthermore, the L1 TLB 212 determines whether a PTE corresponding to the virtual memory address VA is obtained from at least one of the first L1 TLB 2121 and the second L1 TLB 2122 (S40). When a PTE corresponding to the virtual memory address VA is not obtained, i.e., not found, the L2 TLB 230 searches a page table stored therein for a PTE corresponding to the virtual memory address VA (S50).
(48) Furthermore, the L2 TLB 230 determines whether a PTE corresponding to the virtual memory address VA is obtained (S60). When a PTE corresponding to the virtual memory address VA is not found in the L2 TLB 230, the L1 TLB 212 translates the virtual memory address VA into a physical memory address PA using a set method, and searches the cache 240 and the global memory 220 (S70).
(49) When a PTE corresponding to the virtual memory address VA is obtained in the L1 TLB 212 or the L2 TLB 230, the L1 TLB 212 or the L2 TLB 230 translates the virtual memory address VA into the physical memory address PA based on the obtained PTE, and sequentially accesses the cache 240 and the global memory 220 based on the translated physical memory address (S80).
(50) Furthermore, the L1 TLB 212 or the L2 TLB 230 updates the TLB with a PTE based on the address of data obtained by searching the cache 240 and the global memory 220 or updates the TLB by moving a PTE stored in the L1 TLB 212 or the L2 TLB 230 (S90).
(51) Referring to
(52) When it is determined that a page fault has not occurred in the L1 TLB 212 or the L2 TLB 230, the count bit C of the PTE is increased (S913). Furthermore, it is determined whether the PTE has been found in the L2 TLB 230 (S914). When it is determined that the PTE has been found in the L2 TLB 230, it is determined whether the count bit of the PTE is greater than or equal to a set first threshold (S915). When the count bit C is greater than or equal to the first threshold or more, the PTE stored in the L2 TLB 230 is moved to the L1 TLB 212, particularly, the first L1 TLB 2121 implemented as an SRAM and having a short write time (S916).
(53) Furthermore, it is determined whether the PTE has been found in the L1 TLB 212 (S917). When it is determined that the PTE has been found in the L1 TLB 212, it is determined whether the count bit C of the PTE is greater than or equal to a set second threshold (S918). When the count bit C is greater than or equal to the second threshold, the PTE stored in the first L1 TLB 2121 is moved to the second L1 TLB 2122 (S919).
(54) Furthermore, a storage space from which an additional PTE will be obtained is secured by evicting the PTEs stored in the L1 TLB 212 and the L2 TLB 230 using a set method. For example, the PTE may be evicted using the LRU scheme.
(55) It is assumed that a TLB is divided into the L1 TLB 212 and the L2 TLB 230, but in another embodiment the L2 TLB 230 may be omitted. When the L2 TLB 230 is omitted, the operation S50 of searching the L2 TLB and the operation S60 of determining whether a PTE is obtained therefrom may be omitted. Furthermore, even in the operation S90 of updating the TLB, the operation S914 of checking whether a PTE is found in the L2 TLB to the operation S916 of moving the PTE to the first L1 TLB 2121 may be omitted.
(56) Accordingly, in the GPU and method of managing a memory address thereof according to embodiments, TLBs are configured using an SRAM and an STT-MRAM, and thus storage capacity may be significantly improved compared to a case where a TLB is configured using an SRAM. Accordingly, the page hit rate of a TLB is increased, thereby improving the throughput of a device. Furthermore, after a PTE is first written in the SRAM, a PTE having high frequency of use is selected and moved to the STT-MRAM. Accordingly, an increase in TLB update time, which may occur due to the use of the STT-MRAM having a lower write speed, may be prevented. Furthermore, read time and read energy consumption may be reduced.
(57) The method according to embodiments of the present disclosure may be implemented as a computer program stored in a medium for executing the method in a computer. In this case, the computer-readable medium may be accessible by a computer, and may include all suitable computer storage media. Examples of such computer storage media include all of volatile and nonvolatile media and separation type and non-separation type media implemented by a given method or technology for storing information, such as a computer-readable instruction, a data structure, a program module or other data, and may include a read only memory (ROM), a random access memory (RAM), a compact disk (CD)-ROM, a digital video disk (DVD)-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
(58) The present disclosure has been illustrated and described in connection with various embodiments, but the embodiments are merely exemplary. A person having ordinary skill in the art to which the present disclosure pertains will understand that various modifications and other equivalent embodiments are possible from the embodiments.
(59) Accordingly, the present invention encompasses all variations that fall within the scope of the claims.