Parallel turbine ternary content addressable memory for high-speed applications
09792988 · 2017-10-17
Assignee
Inventors
Cpc classification
G11C15/00
PHYSICS
International classification
Abstract
A parallel turbine ternary content addressable memory includes one or more atoms in each of one or more rows, wherein each of the one or more atoms includes a memory with N addresses and a width of M logical lookup entries, wherein N and M are integers, one or more result registers, each with a width of M, wherein a number of the one or more result registers equals a number of one or more keys each with a length of N, and a read pointer configured to cycle through a row of the N addresses per clock cycle for comparison between the M logical entries and the one or more keys with a result of the comparison stored in an associated result register for each of the one or more keys.
Claims
1. A parallel turbine ternary content addressable memory, comprising: one or more atoms in each of one or more rows, wherein each of the one or more atoms comprises a memory with N addresses and a width of M logical lookup entries, wherein N and M are integers, two or more result registers, wherein a number of the two or more result registers equals a number of two or more keys and each result register has a length of M, and a read pointer configured to cycle through a row of the N addresses per clock cycle for comparison between the M logical entries and the two or more keys each having a length of N with a result of the comparison stored in an associated result register for each of the two or more keys, wherein a number of the two or more keys is selected relative to the N addresses to provide a number of lookups in the memory per clock cycle, and wherein the number of lookups equals the number of the two or more keys divided by N.
2. The parallel turbine ternary content addressable memory of claim 1, wherein the M logical lookup entries each include a logical value and a match/don't care value, and a memory size is N*M*2 representing a theoretical minimum size for a fully arbitrary wild card ternary content addressable memory.
3. The parallel turbine ternary content addressable memory of claim 1, wherein the one or more atoms are implemented in a Field Programmable Gate Array or an Application Specific Integrated Circuit.
4. The parallel turbine ternary content addressable memory of claim 1, wherein the memory logically operates as a barrel with the read pointer cycling incrementally through the N addresses during each clock cycle, and wherein a next increment after the read pointer reaches an address N−1, a next increment returns the read pointer back to an address 0 thereby finishing an entire revolution of the barrel.
5. The parallel turbine ternary content addressable memory of claim 4, wherein the two or more result registers contain valid results after the entire revolution, with latency for obtaining the valid results being directly correlated with the length N of the two or more keys.
6. The parallel turbine ternary content addressable memory of claim 4, wherein the number of the two or more keys is selected relative to the N addresses to provide a number of lookups in the memory per clock cycle, and wherein the number of lookups equals the number of the two or more keys divided by N.
7. The parallel turbine ternary content addressable memory of claim 1, wherein the one or more rows include X of the one or more atoms to support X*M logical entries, X being an integer.
8. The parallel turbine ternary content addressable memory of claim 1, wherein a number of the one or more rows includes Y to support a Y*N key length for the two or more keys.
9. The parallel turbine ternary content addressable memory of claim 8, wherein the two or more result registers associated with each of the one or more rows are combined with a result concentrator.
10. The parallel turbine ternary content addressable memory of claim 1, wherein each of the one or more atoms further comprises a write pointer configured to cycle through a different row of the N addresses per clock cycle, concurrently with the read pointer, and read modify write circuitry configured to perform a striping read/write for host access.
11. The parallel turbine ternary content addressable memory of claim 10, wherein the memory logically operates as a barrel with the read pointer cycling incrementally through the N addresses during each clock cycle, and when the read pointer reaches an address N−1, a next increment returns the read pointer back to an address 0 thereby finishing an entire revolution of the barrel, and wherein a complete read and write operation is performed in N clock cycles+latency, where latency is a number of clock cycles between the read and a final write of the striping read/write.
12. The parallel turbine ternary content addressable memory of claim 10, wherein the read modify write circuitry comprises a write register having host access configured to write into the memory, a scan out register tied to the read pointer configured to read from the memory, and a scan in register tied to the write pointer configured to write to the memory subsequent to the read from the memory.
13. The parallel turbine ternary content addressable memory of claim 1, wherein the memory is formed in a Field Programmable Gate Array with the M logical lookup entries being stored vertically and each different logical lookup entry being stored horizontally.
14. A parallel turbine ternary content addressable memory atom circuit, comprising: a memory with N addresses and a width of M logical entries, wherein N and M are integers; two or more result registers, wherein a number of the one or more result registers equals a number of two or more keys and each result register has a length of M; and a read pointer configured to cycle through a row of the N addresses per clock cycle for comparison between the M logical entries and the two or more keys each having a length of N with a result stored in an associated result register for each of the two or more keys, wherein a number of the two or more keys is selected relative to the N addresses to provide a number of lookups in the memory per clock cycle, and wherein the number of lookups equals the number of the two or more keys divided by N.
15. The parallel turbine ternary content addressable memory atom circuit of claim 14, wherein the memory logically operates as a barrel with the read pointer cycling incrementally through the N addresses during each clock cycle, and when the read pointer reaches an address N−1, a next increment returns the read pointer back to an address 0 thereby finishing an entire revolution of the barrel.
16. The parallel turbine ternary content addressable memory atom circuit of claim 14, further comprising a write pointer configured to cycle through a different row of the N addresses per clock cycle concurrently with the read pointer, and read modify write circuitry to perform a striping read/write for host access.
17. A method in a parallel turbine ternary content addressable memory, comprising: operating a circuit, wherein the circuit comprises i) a memory with N addresses and a width of M logical entries, wherein N and M are integers, ii) two or more result registers each with a width of M, wherein a number of the two or more result registers equals a number of two or more keys, each of the two or more keys has a length of N, iii a read pointer, and iv) a clock; incrementing the read pointer through a row of the N addresses per clock cycle of the clock for comparison between the M logical entries and the two or more keys each having a length of N; updating the two or more result registers based on the comparison; and providing a valid result from the two or more result registers after N clock cycles, wherein a number of the two or more keys is selected relative to the N addresses to provide a number of lookups in the memory per clock cycle, and wherein the number of lookups equals the number of the two or more keys divided by N.
18. The method of claim 17, wherein the incrementing step further comprises cycling incrementally the read pointer through the N addresses during each clock cycle, and wherein a next increment after the read pointer reaches an address N−1, a next increment returns the read pointer back to an address 0 thereby finishing an entire revolution of the barrel such that the memory logically operates as a barrel with the read pointer.
19. The method of claim 17, wherein the circuitry has Y rows support a Y*N key length for the two or more keys, Y being an integer, and wherein the providing step comprises combining the two or more result registers associated with each of the Y rows with a result concentrator.
20. The method of claim 17, further comprising performing a striping read/write for host access with read modify write circuitry in the circuitry.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The present disclosure is illustrated and described herein with reference to the various drawings, in which like reference numbers are used to denote like system components/method steps, as appropriate, and in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION OF THE DISCLOSURE
(10) In various exemplary embodiments, a parallel turbine ternary content addressable memory (TCAM) for high-speed applications is described. The parallel turbine TCAM was derived to meet the functionality of a high-performing TCAM yet fitting within system constraints such as space, power, cost, etc. In an exemplary aspect, the parallel turbine TCAM reduces overall FPGA memory utilization to theoretical minimums with a tradeoff of some latency and logical gates while still providing a full wild-carding solution. The parallel turbine TCAM can also be extended to higher-speed applications and fully flexible pattern matching, such as those found in Software Defined Networking (SDN) or flow switching applications.
(11) The parallel turbine TCAM described herein utilizes memory space at key_size*number of entries*2 which represents the theoretical minimum a fully arbitrary wild card TCAM could possibly be. The parallel turbine TCAM describes an FPGA implementation of a TCAM that uses standard memory arrays organized in a columnar fashion, along with multiple key registers to create a TCAM structure capable of performing multiple simultaneous lookups that can be configured to achieve more than one result per internal clock at a defined latency. The ability of the parallel turbine TCAM to return more than one result per clock becomes critical for speeds up to 400G and beyond where as many as four minimum sized frames may be present across a 256 byte wide bus internally per system clock.
(12) The parallel turbine TCAM uses virtual memory barrels that rotate against a variable number of keys. The parallel turbine TCAM was improved by virtue of altering the key quantity, which in turn controlled the rate of lookups. In certain configurations and implementations, the parallel turbine TCAM can produce results that exceed one lookup per clock, making it a potential solution for 100G and 400G packet networking applications. Tangible benefits of this parallel turbine TCAM include an FPGA based building block required for high-speed rich classification, such as service aware operating system (SAOS) based devices or future SDN/flow switch devices. Another benefit is that the parallel turbine TCAM is malleable and its performance can be flexibly adjusted to provide either multiple lookups per clock or single lookup per minimum frame time slice. This allows for trading off more expensive external TCAMs and board real estate with incremental additional FPGA resources.
(13) Referring to
(14) Referring to
(15) The addresses 22 can support addresses of length N (e.g., 8—ADDR0 . . . ADDR7). The addresses 22 are stored vertically in the memory 20 (rows in the barrel). Each of the logical lookup entries 24 is two bits—a first bit for a logical value 26 corresponding to the address 22 and a second bit representing a match state 28, i.e. match or don't care. Specifically, with the ternary CAM, each entry requires two bits to account for 0, 1, or X. Thus, the logical lookup entry 24 is at minimum, two bits wide, representing Logical 0, Logical 1 or Don't care so memory width with will be 2× the number of logical elements implemented.
(16) The TCAM atom 14 includes keys 30 which are of length N, equal to the number of physical addresses. The keys 30 are shown vertically and they represent the query being sought in the memory 20. In this exemplary embodiment, there are two keys 30 shown, namely key X, key Y, each with associated logical values. Answers to the query are shown horizontally in result registers 32 which have a width equal to M, corresponding to the width of the logical lookup entries 24.
(17) In operation, the TCAM atom 14 includes a read pointer 34 that is set to cycle through each of the addresses 22 across all of the logical look up entries 24 in the memory 20. The TCAM atom 14 includes a write pointer 36 as well used by a stripping mechanism for host access described in additional detail herein. The Read and Write pointers 34, 36 are related by a latency factor, set to either free run or software enabled free run. In the example shown in
(18) In order to achieve one lookup per clock, it is required that the number of keys 30 be equal to the number of physical addresses 22. On average, the number of lookups per clock is given as #keys/physical addresses. As the equation suggests, doubling the number of keys 30 relative to the number of physical addresses 22 will result in 2 lookups per clock. If it was desired that only 2 lookups were required per full revolution of the memory 20 (as illustrated in
(19) The TCAM atom 14 enables searching all of the memory 20 for the keys 30, simultaneously. Computation of the result logic represents perhaps the largest portion of logic in the TCAM atom 14. The reason for this is that comparison of key against current read pointer 34 data is key 30 and result specific, so with the addition of keys 30, additional result registers 32 and comparison logic is required. At search initiation, the result vector, stored in the result registers 32, is set to all ones. In
(20) The barrel functionality is implemented by the read pointer 34 cycling incrementally through the physical addresses 22 during each clock cycle. Again, there are N physical addresses 22, so when the read pointer 34 reaches address N−1, the next increment returns the read pointer 34 back to address 0 thereby finishing an entire revolution of the barrel, or TCAM atom 12.
(21) The following equation is an example computation for the result vector in the result registers 32:
Result_X[p]=new_srch?1′b1: Result_X[p]&(READ_REG[2*p+1]|˜(READ_REG[2*p]^Key_X[RD_PTR])) (Equation 1)
This equation is used in conjunction with a loop and assumes that a memory has been portioned as shown in
Result_X[11]=new_srch?1′b1: Result_X[11]&(READ_REG[23]|˜(READ_REG[22]^Key_X[RD_PTR]))
(22) TABLE-US-00001 Entry Entry 11 Entry 11 11 Entry 11 New Current Bit 11 Bit 22 Bit 23 Result_X[11] Result_X[11] new_srch Key Data Mask 1 X 1 X X X 0 0 0 X X X 1 1 0 X X 1 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0
(23) On initiation of a new search, the new_srch bit resets the Result_X for entry 11 to a logical 1. As the memory barrel turns, any mismatch will cause Result_X for entry 11 to become zero. Once the Result_X for entry 11 becomes zero, the results will remain zero indicating that entry 11 is not a match for Key_X. This equation loses one clock every time the new_srch bit is set, in other words it cannot compare when that bit is set.
(24) An alternative equation which allows compare when the new_srch bit is set is defined as:
Result_X[11]=(new_srch|Result_X[11])&(READ_REG[23]|˜(READ_REG[22]^Key_X[RD_PTR]))
(25) TABLE-US-00002 Entry Entry 11 Entry 11 11 Entry 11 New Current Bit 11 Bit 22 Bit 23 Result_X[11] Result_X[11] new_srch Key Data Mask 1 X 1 1 1 X 1 X 1 0 0 X 1 X 1 X X 1 1 X 1 X X 1 1 1 X 1 1 X 1 1 X 0 0 X 1 1 X X X 1 1 1 X X X 1 0 0 0 X X X 0 X 1 0 1 0 0 X 1 1 0 0 0 1 X 0 1 0 0 1 X 1 0 0
(26) This equation performs comparison on the new_srch assertion and will return the new Result_X to a 1 is a match is present regardless of the state of the current Result_X. The equation shown in Equation 1, uses the term 2*p when referring to data bits within the TCAM atom 14 memory. This is due to the 2 bit nature of the data stored there. In the equation, an assumption is made that odd bits contain the match/don't care setting, 0 and 1 respectively, and even bits store the logical bit match value.
(27) Referring to
(28) Referring to
(29) Referring to
Result_X=ResultX_0& ResultX_1& ResultX_2
Result_Y=ResultY_0& ResultY_1& ResultY_2
(30) Referring to
(31) Referring to
(32) In the example of
scan_in<=#TP scan_wreg[ADDR4]?scan_out|1′b1<<scan_waddr:scan_out&˜(1′b1<<scan_waddr);
(33) What this equation says, is read out the 256 bits from the memory the current memory address ADDR_X, which is stored in variable scan_out. Now if there is a 1 in location ADDR_X of the scan_wreg, OR a 1 left shifted by the address 22 into scan out and save it to ADDR_X. If on the other hand, there is a 0 in the location ADDR_X of the scan_wreg, then AND the inverted pattern of the left shifted 1 over the value of scan_out.
(34) The equation above allows for a read modify write action to occur based on data read from a memory and then writing back the same data, while only updating one bit. This is the crux of writing stripped memory, because the memory is still traditional RAM, we always need to write to all the data bits, but we are only trying to update one of the bits for each address, while leaving all the rest alone.
(35) In
(36) TABLE-US-00003 ADDR 4 0x00010203_04050607_08090A0B_0C0D0E0F ADDR 5 0x10111213_14151617_18191A1B_1C1D1E1F ADDR 6 0x40414243_44454647_48494A4B_4C4D4E4F
(37) The scan_wreg[6:4] has the bits 011. Each bit offset relates to the corresponding address offset in the memory. For bit 4 and ADDR_4, a 1 needs to be loaded into entry 11 so the scan_out would capture:
(38) TABLE-US-00004 0x00010203_04050607_08090A0B_0C0D0E0F
(39) Now using the equation:
scan_in<=#TP scan_wreg[ADDR4]?scan_out|1′b1<<scan_waddr:scan_out&˜(1′b1<<scan_waddr); scan_wreg[ADDR4] is 1, so the equation reduces to
scan_in<=#TP scan_out|1′b1<<scan_waddr;
scan_in<=#TP 0x00010203_04050607_08090A0B_0C0D0E0F|1′b1<<22//loading the data bit location
scan_in<=#TP 0x00010203_04050607_08090A0B_0C0D0E0F|0x00000000_00000000_00000000_00400000;
scan_in<=#TP 0x00010203_04050607_08090A0B_0C4D0E0F//Notice the 4D here
(40) The next address would be calculated using scan_wreg[ADDR5]=1:
scan_in<=#TP 0x10111213_14151617_18191A1B_1C1D1E1F|0x00000000_00000000_00000000_00400000;
scan_in<=#TP 0x10111213_14151617_18191A1B_1C5D1E1F;//Notice the 5D here
(41) The next address would be calculated as follows using scan_wreg[ADDR6]=0 which makes use of the other portion of the equation now, which reduces to:
scan_in<=#TP scan_out &˜(1′b1<<scan_waddr);
scan_in<=#TP 0x40414243_44454647_48494A4B_4C4D4E4F &˜(0x00000000_00000000_00000000_00400000);// There are still have logical ops to reduce out.
scan_in<=#TP 0x40414243_44454647_48494A4B_4C4D4E4F&(0xFFFFFFFF_FFFFFFFF_FFFFFFFF_FFBFFFFF);//Invert complete. Still need to do the AND op.
scan_in<=#TP 0x40414243_44454647_48494A4B_4C0D4E4F;// The bit at location 22 is now clear
(42) Each time this was done, the data was written back to the RAM at the appropriate location. Data is extracted from ADDR5 where the read pointer 34 is and placed in the scan out register 42 which now holds the entire ADDR5 line.
(43) At Clock 2, the ADDR4 where the write pointer 36 is placed is loaded with data in a scan_in register 44, the data being a result of the scan_in previously calculated during the Clock 1, a scan_in is calculated based on the ADDR5 content, e.g.
scan_in<=#TP scan_wreg[ADDR5]?scan_out|1′b1<<scan_waddr:scan_out &˜(1′b1<<scan_waddr);
and data is extracted from ADDR6 where the read pointer 34 is and placed in the scan out register 42 which now holds the entire ADDR6 line.
(44) At Clock 3, the ADDR5 where the write pointer 36 is placed is loaded with data in a scan_in register 44, the data being a result of the scan_in previously calculated during the Clock 2, a scan_in is calculated based on the ADDR6 content, e.g.
scan_in<=#TP scan_wreg[ADDR6]?scan_out|1′b1<<scan_waddr:scan_out &˜(1′b1<<scan_waddr);
and data is extracted from ADDR7 where the read pointer 34 is and placed in the scan out register 42 which now holds the entire ADDR7 line. This process continues for all of the physical addresses 22, i.e. during a single revolution+HOST_LATENCY.
(45) Referring to
(46) TABLE-US-00005 TABLE 1 Logical Entry Bit Offsets 0 0:1 1 2:3 2 4:5 3 6:7 4 8:9 5 10:11 6 12:13 7 14:15 8 16:17 9 18:19 10 20:21 11 22:23 12 24:25 13 26:27 14 28:29 15 30:31
(47) In this example implementation, a match value of 0x41 (8′b01000001) has been loaded into the Logical Entry 11 and has had a mask of 0x28 applied to it as well. The mask of 0x28=8′b00101000 (Physical data bit 23 and striped between most significant bit (MSb) at physical address 7 and least significant (LSb) at physical address 0) indicates that any match value located in physical addresses 5 and 3 will be automatically matched regardless of value held in those corresponding locations within the logical match column (Physical data bit 22 and striped between MSb at physical address 7 and LSb at physical address 0). The match value of 0x41=8′b01000001, when masked against 0x28 means that any key that holds the value 8′b01X0X001, where X's are considered to be any value, will qualify as a match.
(48) It will be appreciated that some exemplary embodiments described herein may include one or more generic or specialized processors (“one or more processors”) such as microprocessors, digital signal processors, customized processors, and field programmable gate arrays (FPGAs) and unique stored program instructions (including both software and firmware) that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the methods and/or systems described herein. Alternatively, some or all functions may be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the aforementioned approaches may be used. Moreover, some exemplary embodiments may be implemented as a non-transitory computer-readable storage medium having computer readable code stored thereon for programming a computer, server, appliance, device, etc. each of which may include a processor to perform methods as described and claimed herein. Examples of such computer-readable storage mediums include, but are not limited to, a hard disk, an optical storage device, a magnetic storage device, a ROM (Read Only Memory), a PROM (Programmable Read Only Memory), an EPROM (Erasable Programmable Read Only Memory), an EEPROM (Electrically Erasable Programmable Read Only Memory), Flash memory, and the like. When stored in the non-transitory computer readable medium, software can include instructions executable by a processor that, in response to such execution, cause a processor or any other circuitry to perform a set of operations, steps, methods, processes, algorithms, etc. Specifically, in an exemplary embodiment, the TCAM 10 with the associated rows 12 and atoms 14 can be realized in an FPGA or an ASIC. Also, the TCAM 10 can be integrated with other circuitry such as a network processor or the like.
(49) Although the present disclosure has been illustrated and described herein with reference to preferred embodiments and specific examples thereof, it will be readily apparent to those of ordinary skill in the art that other embodiments and examples may perform similar functions and/or achieve like results. All such equivalent embodiments and examples are within the spirit and scope of the present disclosure, are contemplated thereby, and are intended to be covered by the following claims.