Bit Efficient Memory Error Correcting Coding And Decoding Scheme
20240289212 ยท 2024-08-29
Inventors
Cpc classification
International classification
G06F11/10
PHYSICS
H03M13/15
ELECTRICITY
Abstract
Aspects of the disclosed technology include techniques and mechanisms for an efficient error correction coding scheme that can detect and correct data errors that may occur in a memory. In general, the scheme comprises segmenting the data that would be transferred as part of a data request into different parts and applying error correction codes to the separate parts.
Claims
1. A method for encoding data associated with a request access for one or more DRAM devices, comprising: segmenting a number of beats defined for a burst access to the one or more DRAMs into at least a first set of beats and a second set of beats; defining a first error correction code (ECC) for a first set of data associated with the first set of beats; and defining a second ECC for a second set of data associated with the second set of beats; wherein the first ECC comprises a first set of symbols, each symbol of the first set being associated with the first set of beats, wherein the second ECC comprises a second set of symbols, each symbol of the second set of symbols being associated with the second set of beats, and wherein the first set of symbols comprises a different number of symbols than the second set of symbols and an error associated with the second ECC is correctable without knowing a location of the error.
2. The method of claim 1, wherein the first set of beats and the second set of beats equal the number of beats defined for the burst access.
3. The method of claim 1, wherein the one or more DRAMs comprise DDR5 DRAMs and the number of beats defined for the burst access comprises 16 beats.
4. The method of claim 3, wherein the first set of beats and second set of beats each comprise 8 beats.
5. The method of claim 4, wherein the one or more DRAMs each include 4 data pins.
6. The method of claim 5, wherein the one or more DRAMs comprise 10 DRAMs and the first error correction code comprises a Reed Solomon with code 8 ECCs for 32 data symbols and 8 bits/symbol such as RS(40, 32, 8).
7. The method of claim 6, wherein the second error correction code comprises a Reed Solomon code with 8 ECCs for 32 data symbols and 8 bits/symbol such as RS(40, 32, 8).
8. The method of claim 7, comprising defining a second 4 byte symbol associated with the second error ECC, the second 4 byte symbol comprising metadata associated with a memory tag extension or another ECC scheme.
9. A memory system, comprising: one or more DRAMs; and a memory controller communicatively coupled to the one or more DRAMs, the memory controller having logic that implements the following function in response to a request access to the one or more DRAMs: segment a number of beats defined for a burst access to the one or more DRAMs in at least a first set of beats and a second set of beats; define a first error correction code (ECC) for a first set of data associated with the first set of beats; define a second ECC for a second set of data associated with the second set of beats; wherein the first ECC comprises a first set of symbols, each symbol of the first set being associated with the first set of beats, wherein the second ECC comprises a second set of symbols, each symbol of the second set of symbols being associated with the second set of beats, and wherein the first set of symbols comprises a different number of symbols than the second set of symbols and an error associated with the second ECC is correctable without knowing a location of the error.
10. The memory system of claim 9, wherein the first set of beats and the second set of beats equal the number of beats defined for the burst access.
11. The memory system of claim 9, wherein the one or more DRAMs comprises DDR5 DRAMs and the number of beats defined for the burst access comprises 16 beats.
12. The memory system of claim 11, wherein the first set of beats and the second set of beats each comprise 8 beats.
13. The memory system of claim 12, wherein the one or more DRAMs each include 4 data pins.
14. The memory system of claim 13, wherein the one or more DRAMs comprise 10 DRAMs and the first error correction code comprises a Reed Solomon code with 8 ECCs for 32 data symbols and 8 bits/symbol such as RS(40, 32, 8).
15. The memory system of claim 13, wherein the second error correction code comprises a Reed Solomon code with 8 ECCs for 32 data symbols and 8 bits/symbol such as RS(40, 32, 8).
16. The memory system of claim 15, wherein the logic functions to define a second 4 byte symbol associated with the second error ECC, the second 4 byte symbol comprising metadata associated with a memory tag extension or another ECC scheme.
17. The memory system of claim 9, wherein the logic comprises hardware logic comprising an encoder and a decoder.
18. The memory system of claim 17, wherein the encoder encodes the first set of data using the first ECC in a first cycle of a 64 byte transaction and the encoder encodes the second set of data using the second ECC in a second cycle of the 64 byte transaction.
19. The memory system of claim 17, wherein the decoder decodes the first set of data using the first ECC in a first cycle of a 64 byte transaction and the decoder decodes the second set of data using the second ECC in a second cycle of the 64 byte transaction.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
DETAILED DESCRIPTION
[0020] Aspects of the disclosed technology include techniques and mechanisms for an efficient error correction coding scheme that can detect and correct data errors that may occur in a memory. In general, the scheme comprises segmenting the data that would be transferred as part of a data request into different parts and applying error correction codes to the separate parts. The scheme is efficient in that fewer bits can be used to code the different data parts and robust in that it has the same detection and correction capability as existing ECCs, e.g., can correct up to four data output pin (DQ) errors. The scheme also frees up ECC bits for other functions (e.g., store metadata or form a secondary error detection and/or correction code scheme) without impacting the capability of the ECC to detect and correct errors.
[0021]
[0022] Upon receiving the memory access request to write data to memory, the data is segmented for encoding based on a number of beats, block 120. For example, let's assume the ECC scheme is being applied in an environment having DDR5 DRAM devicesthough the scheme may be employed in environments that use other DDR standards. A burst access to such a DRAM device is assumed to comprise 16 beats and data is transferred in 64 byte data blocks. A typical DDR5 server configuration is the 10?4 configuration, i.e., 10 DRAM devices each having 4 DQs. Upon request, 4 DQs will drive a four bit data bus 16 times (1 bit per DQ for each of 16 beats) resulting in 64 bits or 8 bytes for each DRAM device. For 10 devices, a burst access results in 640 bits or 80 bytes of data. The convention is to use 64 bytes (8 DRAMs) to write data and 16 bytes (2 DRAMs) for ECCs. Segmenting in accordance with the disclosed technology comprises, for example, the error coding scheme shown in
[0023] The error coding scheme is then applied within each segment, as in block 130 of
[0024] The error coding scheme may comprise a Reed Solomon (RS) error capability having 8 bits/symbol, 32 data symbols, and 8 ECC symbols (i.e., nECC)commonly referred to as RS(40, 32, 8). Such a scheme allows for the capability to (i) detect and correct up to nECC/2 erroneous symbolsassuming the location(s) of the erroneous signals are unknown or (ii) detect and correct up to nECC erroneous symbolsassuming the location(s) of the erroneous symbols are previously known (sometimes referred to as erasure code capability). This scheme is often used with DDR5 DRAM devices.
[0025] In the example discussed above in relation to
[0026] Once the data is encoded as described in accordance with block 130 of
[0027]
[0028] At block 320, the ECC code for the second segment is decoded and, along with information learned from decoding the first segment, data errors associated with the second segment may be corrected. For example, assume that an RS(40, 32, 8) code was used to encode the 32 bytes of the first segment and an RS(40, 32, 8) code was used to encode the 32 bytes of the second segment.
[0029] In addition, a bounded fault map 400 for the DDR5 DRAMs is assumed as shown in
[0030] Returning to block 310 in
[0031] As another example, let's assume that errored data symbols are detected in the first segment or part of the data associated with the read access burst request. This means that the fault is a full 4 DQ error (map 9 in
[0032] Therefore, in accordance with the disclosed technology, the technique and/or mechanism provides the same detection and error correction capability as existing schemes and can correct up to 4 DQs errors, while using fewer bits. Specifically, in accordance with the disclosed technology, the ECC size is reduced by 4 bytes to 12 bytes as compared to the 16 bytes required by comparable conventional ECC schemes.
[0033] Turning now to
[0034] Turning now to
[0035] Turning now to
[0036] The DDR controller 722 includes logic 732 that implements the method or process discussed above in relation to
[0037] The system of
[0038]
[0039] As shown in
[0040] The instructions 832 may be any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by the processor 812. For example, the instructions may be stored as computing device code on the computing device-readable medium. In that regard, the terms instructions and programs may be used interchangeably herein. The instructions may be stored in object code format for direct processing by the processor, or in any other computing device language, including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. Processes, functions, methods, and routines of the instructions are explained in more detail below.
[0041] The data 834 may be retrieved, stored, or modified by processor 812 in accordance with the instructions 832. As an example, data 834 associated with memory 816 may comprise data used in supporting services for one or more client devices, an application, etc. Such data may include data to support hosting web-based applications, file share services, communication services, gaming, sharing video or audio files, or any other network-based services.
[0042] The one or more processors 812 may be any conventional processor, such as commercially available CPUs. Alternatively, the one or more processors may be a dedicated device such as an ASIC or other hardware-based processor. Although
[0043] Computing device 810 may also include a display 820 (e.g., a monitor having a screen, a touch-screen, a projector, a television, or other device that is operable to display information) that provides a user interface that allows for controlling the computing device 810. Such control may include, for example, using a computing device to cause data to be uploaded through input system 828 to cloud system 850 for processing, cause accumulation of data on storage 836, or more generally, manage different aspects of a customer's computing system. While input system 828 may be used to upload data, e.g., a USB port, computing system 800 may also include a mouse, keyboard, touchscreen, or microphone that can be used to receive commands and/or data.
[0044] The network 840 may include various configurations and protocols including short range communication protocols such as Bluetooth?, Bluetooth? LE, the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, private networks using communication protocols proprietary to one or more companies, Ethernet, WiFi, HTTP, etc., and various combinations of the foregoing. Such communication may be facilitated by any device capable of transmitting data to and from other computing devices, such as modems and wireless interfaces. Computing device 810 interfaces with network 840 through communication interface 824, which may include the hardware, drivers, and software necessary to support a given communications protocol.
[0045] Cloud computing systems 850 may comprise one or more data centers that may be linked via high speed communications or computing networks. A given data center within system 850 may comprise dedicated space within a building that houses computing systems and their associated components, e.g., storage systems and communication systems. Typically, a data center will include racks of communication equipment, servers/hosts, and disks. The servers/hosts and disks comprise physical computing resources that are used to provide virtual computing resources such as VMs. To the extent that a given cloud computing system includes more than one data center, those data centers may be at different geographic locations within relative close proximity to each other, chosen to deliver services in a timely and economically efficient manner, as well as provide redundancy and maintain high availability. Similarly, different cloud computing systems are typically provided at different geographic locations.
[0046] As shown in
[0047] Aspects of the disclosed technology may be embodied in a method, process, apparatus, or system. Those examples may include one or more of the following features (e.g., F1 through F19):
[0048] F1. A method for encoding data associated with a request access for one or more DRAM devices, comprising: [0049] segmenting a number of beats defined for a burst access to the one or more DRAMs in at least a first set of beats and a second set of beats; [0050] defining a first error correction code (ECC) for a first set of the data associated with the first set of beats; and [0051] defining a second ECC for a second set of the data associated with the second set of beats; [0052] wherein the first ECC comprises a first set of symbols, each symbol of the first set being associated with the first set of beats, and [0053] wherein the second ECC comprises a second set of symbols, each symbol of the second set of symbols being associated with the second set of beats.
[0054] F2. The method of F1, wherein the first set of beats and the second set of beats equal the number of beats defined for the burst access.
[0055] F3. The method of any one F1 and F2, wherein the one or more DRAMs comprise DDR5 DRAMs and the number of beats defined for the burst access comprises 16 beats.
[0056] F4. The method of F3, wherein the first set of beats and second set of beats each comprise 8 beats.
[0057] F5. The method of F4, wherein the one or more DRAMs each include 4 data pins.
[0058] F6. The method of F5, wherein the one or more DRAMs comprise 10 DRAMs and the first error correction code comprises a Reed Solomon with code 8 ECCs for 32 data symbols and 8 bits/symbol such as RS(40, 32, 8).
[0059] F7. The method of F6, wherein the second error correction code comprises a Reed Solomon code with 8 ECCs for 32 data symbols and 8 bits/symbol such as RS(40, 32, 8).
[0060] F8. The method of F7, comprising defining a second 4 byte symbol associated with the second error ECC, the second 4 byte symbol comprising metadata associated with a memory tag extension or another ECC scheme.
[0061] F9. A memory system, comprising: [0062] one or more DRAMs; and [0063] a memory controller communicatively coupled to the one or more DRAMs, the memory controller having logic that implements the following function in response to a request access to the one or more DRAMs: [0064] segment a number of beats defined for a burst access to the one or more DRAMs in at least a first set of beats and a second set of beats; [0065] define a first error correction code (ECC) for a first set of the data associated with the first set of beats; [0066] define a second ECC for a second set of the data associated with the second set of beats; [0067] wherein the first ECC comprises a first set of symbols, each symbol of the first set being associated with the first set of beats, and [0068] wherein the second ECC comprises a second set of symbols, each symbol of the second set of symbols being associated with the second set of beats.
[0069] F10. The memory system of F9, wherein the first set of beats and the second set of beats equal the number of beats defined for the burst access.
[0070] F11. The memory system of any one of F9 and F10, wherein the one or more DRAMs comprises DDR5 DRAMs and the number of beats defined for the burst access comprises 16 beats.
[0071] F12. The memory system of F11, wherein the first set of beats and the second set of beats each comprise 8 beats.
[0072] F13. The memory system of F12, wherein the one or more DRAMs each include 4 data pins.
[0073] F14. The memory system of F13, wherein the one or more DRAMs comprise 10 DRAMs and the first error correction code comprises a Reed Solomon code with 8 ECCs for 32 data symbols and 8 bits/symbol such as RS(40, 32, 8).
[0074] F15. The memory system of F13, wherein the second error correction code comprises a Reed Solomon code with 8 ECCs for 32 data symbols and 8 bits/symbol such as RS(40, 32, 8).
[0075] F16. The memory system of F15, wherein the logic functions to define a second 4 byte symbol associated with the second error ECC, the second 4 byte symbol comprising metadata associated with a memory tag extension or another ECC scheme.
[0076] F17. The memory system of any one of F9 through F16, wherein the logic comprises hardware logic comprising an encoder and a decoder.
[0077] F18. The memory system of F17, wherein the encoder encodes the first set of data using the first ECC in a first cycle of a 64 byte transaction and the encoder encodes the second set of data using the second ECC in a second cycle of the 64 byte transaction.
[0078] F19. The memory system of F17, wherein the decoder decodes the first set of data using the first ECC in a first cycle of the 64 byte transaction and the decoder decodes the second set of data using the second ECC in a second cycle of the 64 byte transaction.
[0079] Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims.