System and Method for Concurrent Indexing and Searching of Data in Working Memory
20170337003 · 2017-11-23
Inventors
Cpc classification
International classification
Abstract
Systems and methods are described herein for concurrently storing and searching an index in the working memory of a computing system. The present approach has multiple levels of storage block pools, each level made up of one or more storage block subdivided into slices, the slices being larger in size at higher pool levels, where additional storage pool blocks are allocated at a given level when there are no more slices available at that level. Further, the index is encoded as straight integer values, rather than using delta encoding or variable integer compression. The stored values can therefore be directly searchable without first having to flush the index from working memory into long term storage.
Claims
1. A method for storing an index in working memory of a computing system that can concurrently be searched, the method comprising: allocating, by the computing system, a set of storage blocks in the working memory of the computing system, the allocated storage blocks defined as being in a hierarchy with each storage block subdivided into storage slices of an increasing size at each higher level in the defined storage block hierarchy; receiving, by the computing system, a request to store in the index a postings list comprising a set of integer index values; requesting, by the computing system, a storage slice at a lowest level in the defined storage block hierarchy not already containing any integer index values; storing, by the computing system, the set of integer index values in the requested storage slice at the lowest level in the defined storage block hierarchy; storing, by the computing system, under an index value equal to a term identifier for the postings list, an offset value of the requested storage slice; and, if there are additional integer index values, from the set of integer index values, that did not fit in the requested storage slice at the lowest level in the defined storage block hierarchy, then: requesting, by the computing system, a storage slice at a next higher level in the defined storage block hierarchy not already containing any integer index values; storing, by the computer system, in the requested storage slice at the lowest level in the defined storage block hierarchy, a pointer from the requested storage slice at the lowest level in the defined storage block hierarchy to the requested storage slice at the next higher level in the defined storage block hierarchy; and, storing, by the computing system, the additional integer index values in the requested slice at the next higher level in the defined storage block hierarchy.
2. The method of claim 1 wherein: if the request, by the computing system, for the storage slice at the lowest level in the defined storage block hierarchy not already containing any integer index values failed because there were no more storage slices at the lowest level in the defined storage block hierarchy not already containing any integer values, then further comprising: allocating, by the computing system, an additional storage block in the working memory of the computing system, the allocated additional storage block defined as being at the lowest level in the defined storage block hierarchy; requesting, by the computing system, a storage slice in the allocated additional storage block defined as being at a lowest level in the defined storage block hierarchy; and wherein storing the set of integer index values in the requested storage slice at the lowest level in the defined storage block hierarchy instead stores the set of integer index values in the requested storage slice in the allocated additional storage block defined as being at the lowest level in the defined storage block hierarchy.
3. The method of claim 1, further comprising: reading, by the computing system, at the index value equal to the term identifier, the stored offset value of the requested storage slice; and, reading, by the computing system, at the read offset value, the set of integer index values in the requested storage slice.
4. The method of claim 3 further comprising: receiving, by the computing system, a request to store in the index another postings list comprising another set of integer index values; requesting, by the computing system, another storage slice at the lowest level in the defined storage block hierarchy not already containing any integer index values; and, storing, by the computing system, the another set of integer index values in the requested another storage slice at the lowest level in the defined storage block hierarchy essentially concurrently with the operation of reading the set of integer index values in the requested storage slice is occurring.
5. A non-transitory computer-readable storage medium having embodied thereon a program, the program being executable by a processor to perform a method for storing an index in working memory of a computing system that can concurrently be searched, the method comprising the steps of: allocating a set of storage blocks in the working memory of the computing system, the allocated storage blocks defined as being in a hierarchy with each storage block subdivided into storage slices of an increasing size at each higher level in the defined storage block hierarchy; receiving a request to store in the index a postings list comprising a set of integer index values; requesting a storage slice at a lowest level in the defined storage block hierarchy not already containing any integer index values; storing the set of integer index values in the requested storage slice at the lowest level in the defined storage block hierarchy; storing under an index value equal to a term identifier for the postings list, an offset value of the requested storage slice; and, if there are additional integer index values, from the set of integer index values, that did not fit in the requested storage slice at the lowest level in the defined storage block hierarchy, then: requesting a storage slice at a next higher level in the defined storage block hierarchy not already containing any integer index values; storing in the requested storage slice at the lowest level in the defined storage block hierarchy, a pointer from the requested storage slice at the lowest level in the defined storage block hierarchy to the requested storage slice at the next higher level in the defined storage block hierarchy; and, storing the additional integer index values in the requested slice at the next higher level in the defined storage block hierarchy.
6. The non-transitory computer readable medium of claim 5, wherein: if the request for the storage slice at the lowest level in the defined storage block hierarchy not already containing any integer index values failed because there were no more storage slices at the lowest level in the defined storage block hierarchy not already containing any integer values, then further comprising: allocating an additional storage block in the working memory of the computing system, the allocated additional storage block defined as being at the lowest level in the defined storage block hierarchy; requesting a storage slice in the allocated additional storage block defined as being at a lowest level in the defined storage block hierarchy: and wherein storing the set of integer index values in the requested storage slice at the lowest level in the defined storage block hierarchy instead stores the set of integer index values in the requested storage slice in the allocated additional storage block defined as being at the lowest level in the defined storage block hierarchy.
7. The non-transitory computer readable medium of claim 5, wherein the method further comprises the steps: reading at the index value equal to the term identifier, the stored offset value of the requested storage slice; and, reading at the read offset value, the set of integer index values in the requested storage slice.
8. The non-transitory computer readable medium of claim 7, wherein the method further comprises the steps: receiving a request to store in the index another postings list comprising another set of integer index values; requesting another storage slice at the lowest level in the defined storage block hierarchy not already containing any integer index values; and, storing the another set of integer index values in the requested another storage slice at the lowest level in the defined storage block hierarchy essentially concurrently with the operation of reading the set of integer index values in the requested storage slice is occurring.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0018]
[0019]
[0020]
[0021]
[0022]
DETAILED DESCRIPTION OF THE INVENTION
[0023] A method and apparatus is disclosed for creating and storing an inverted index as straight integer values into multiple levels of expandable pools that can be searched while being created and stored in a RAM buffer. This approach, referred to herein as “live indexing” due to the simultaneous searching capability, supports faster search operations by avoiding having to wait until after the RAM buffer has been flushed to long term memory.
[0024] In an embodiment of live indexing, as shown in
[0025] In an embodiment, as shown in
[0026] In an embodiment, as explained further elsewhere herein, for each posting list, a pool starting offset (which is the offset into the pool, as will be explained) is stored in an offset table (or other data structure) under an index value equal to a term identifier to which the given posting list belongs. As such, the offset in the pool of a given term is obtained by performing a lookup in the offset table using the term identifier. Further, because the term identifier (which is given at the time the posting list is updated) is also an index into the offset table, the time to perform such table lookup remains constant.
[0027] In an embodiment, live indexing maintains four separate memory pools for holding postings, as will now be explained. Conceptually, each pool can be viewed as an unbounded integer array. In practice, pools are large integer arrays of 128 kiloByte blocks allocated as 2.sup.15 positions of 4 bytes each, and if a pool fills up then another block is allocated thereby growing that pool. In each pool, slices are allocated and used to hold individual postings belonging to a term. In each pool, the slice sizes are fixed, as follows: slice size for pool level zero is 2.sup.3×4 bytes (i.e., 32 bytes, capable of holding 8 four-byte straight integer values), slice size for pool level one is 2.sup.5×4 bytes (i.e., 128 bytes, capable of holding 32 four-byte straight integer values), slice size for pool level two is 2.sup.7×4 bytes (i.e., 512 bytes, capable of holding 128 four-byte straight integer values), and slice size for pool level three is 2.sup.11×4 bytes (i.e., 8192 bytes, capable of holding 2048 four-byte straight integer values).
[0028] Referring now to
[0029] A request is made for a Pool Level Zero slice, which as has been explained can contain 2.sup.3 (8) integers. The first seven (2.sup.3−1) document IDs, along with corresponding term frequencies, are stored in the Pool Level Zero slice and the offset of the Pool Level Zero slice is stored in an offset table under an index equal to a term identifier for the postings list. Because there is more than one additional item in the posting list left to be stored (in this example, there are 2000−7=1993 additional items), another slice is needed. So, a request for a slice from Pool Level One is made, and a pointer from the Pool Level Zero slice to the Pool Level One slice is stored in the last position of the Pool Level Zero slice.
[0030] As has been explained, the Pool Level One slice can contain 2.sup.5 (32) integers. The next 31 (2.sup.5−1) document IDs, along with corresponding term frequencies, are stored in the Pool Level One slice. Because there is more than one additional item in the posting list left to be stored (in this example, there are 2000−7−31=1962 additional items), another slice is needed. So, a request for a slice from Pool Level Two is made, and a pointer from the Pool Level One slice to the Pool Level Two slice is stored in the last position of the Pool Level One slice.
[0031] As has been explained, the Pool Level Two slice can contain 2.sup.7 (128) integers. The next 127 document IDs, along with corresponding term frequencies, are stored in the Pool Level Two slice. Because there is more than one additional item in the posting list left to be stored (in this example, there are 2000−7−31−127=1835 additional items), another slice is needed. So, a request for a slice from Pool Level Three is made, and a pointer from the Pool Level Two slice to the Pool Level Three slice is stored in the last position of the Pool Level Two slice.
[0032] As has been explained, the Pool Level Three slice can contain 2.sup.11 (2048) integers. The remaining 1835 document IDs, along with corresponding term frequencies, are stored in the Pool Level Three slice. Because this Pool Level Three slice is larger enough to contain all of the remaining document IDS, along with term frequencies, no additional slices are needed to store this posting list. However, if one or more additional slices were needed in order to store more document IDs, along with term frequencies, one or more requests would then be made for additional Pool Level Three slices to contain them.
[0033] Referring now to
[0034] As can also seen by reference to the example shown in
[0035] The disclosed system and method has been explained above with reference to several embodiments. Other embodiments will be apparent to those skilled in the art in light of this disclosure. Certain aspects of the described method and apparatus may readily be implemented using configurations or steps other than those described in the embodiments above, or in conjunction with elements other than or in addition to those described above. It will also be apparent that in some instances the order of steps described herein may be altered without changing the result of performance of all of the described steps.
[0036] Further, it should also be appreciated that the described method and apparatus can be implemented in numerous ways, including as a process, an apparatus, or a system. The methods described herein may be implemented by program instructions for instructing a processor to perform such methods, and such instructions recorded on a non-transitory computer readable storage medium such as a hard disk drive, floppy disk, optical disc such as a compact disc (CD) or digital versatile disc (DVD), flash memory, etc., or communicated over a computer network wherein the program instructions are sent over optical or electronic communication links. It should be noted that the order of the steps of the methods described herein may be altered and still be within the scope of the disclosure.
[0037] These and other variations upon the embodiments described and shown herein are intended to be covered by the present disclosure, which is limited only by the appended claims.
[0038] In the foregoing specification, the invention is described with reference to specific embodiments thereof, but those skilled in the art will recognize that the invention is not limited thereto. Various features and aspects of the above-described invention may be used individually or jointly. Further, the invention can be utilized in any number of environments and applications beyond those described herein without departing from the broader spirit and scope of the specification. The specification and drawings are, accordingly, to be regarded as illustrative rather than restrictive. It will be recognized that the terms “comprising,” “including,” and “having,” as used herein, are specifically intended to be read as open-ended terms of art.