METHOD FOR COMPRESSING SEQUENTIAL RECORDS OF INTERRELATED DATA FIELDS
20220393699 · 2022-12-08
Inventors
Cpc classification
H03M7/46
ELECTRICITY
International classification
H03M7/30
ELECTRICITY
Abstract
A method for encoding a sequence of records, each record of said sequence of records comprising a plurality of different fields, said different fields being identical for each record of said sequence of records, said method comprising selecting an encoding algorithm for each field of said plurality of fields such that said each field is associated with a selected encoding algorithm; encoding data of said each field using said selected encoding algorithm to determine encoded field data for said each field for said each record; and for said each record, interleaving said encoded field data for said each field to produce an encoded sequence of said records wherein said encoded field data are interleaved for said each record.
Claims
1. A method for encoding a sequence of records, each record of said sequence of records comprising a plurality of different fields, said different fields being identical for each record of said sequence of records, said method comprising: selecting an encoding algorithm for each field of said plurality of fields such that said each field is associated with a selected encoding algorithm; encoding data of said each field using said selected encoding algorithm to determine encoded field data for said each field for said each record; and for said each record, interleaving said encoded field data for said each field to produce an encoded sequence of said records wherein said encoded field data are interleaved for said each record.
2. The method of claim 1, wherein said plurality of different fields comprises fields having different data types.
3. The method of claim 2, wherein said different data types comprise at least two of integers, floating-point numbers, fixed-point numbers, character, Boolean, money, or date.
4. The method of claim 1, wherein said each record comprises a tuple.
5. The method of claim 4, wherein said each record comprises different measurements of an event at a given time or location, and said plurality of different fields of said each record comprises said different measurements at said given time or said location.
6. The method of claim 5, wherein said each record comprises said measurements at a given time.
7. The method of claim 6, wherein said each record is a record of an object in motion.
8. The method of claim 7, wherein said different measurements comprises at least two or more of velocity, yawl, pitch, latitude, longitude, and time stamp.
9. The method of claim 1, wherein said plurality of different fields is timeseries data.
10. The method of claim 9, wherein said selected encoding algorithm is a run-length algorithm.
11. The method of claim 10, wherein said encoding algorithms comprise at least two of varbit, varbitLT, varbit L, XOR, or delta of delta.
12. The method of claim 1, wherein said selecting a run-length encoding algorithm is performed automatically.
13. The method of claim 12, wherein said selecting a run-length encoding algorithm is performed empirically using an optimizer.
14. The method for encoding timeseries data of claim 13, wherein said selecting a run-length encoding algorithm is performed by testing different run-length encoding algorithms on a portion of said different data types to optimize run-length encoding of said each of said plurality of different data types.
15. A system for constructing histograms comprising: one or more processors for executing a plurality of instructions; a display device in communication with the one or more processors; and a storage device in communication with the one or more processors, the storage device holding the plurality of instructions, the plurality of instructions including instructions for: selecting an encoding algorithm for each field of said plurality of fields such that said each field is associated with a selected encoding algorithm; encoding data of said each field using said selected encoding algorithm to determine encoded field data for said each field for said each record; and for said each record, interleaving said encoded field data for said each field to produce an encoded sequence of said records wherein said encoded field data are interleaved for said each record.
16. A non-transitory computer-readable medium comprising instructions, which when executed by one or more processors causes said one or more processors to perform the steps comprising: selecting an encoding algorithm for each field of said plurality of fields such that said each field is associated with a selected encoding algorithm; encoding data of said each field using said selected encoding algorithm to determine encoded field data for said each field for said each record; and for said each record, interleaving said encoded field data for said each field to produce an encoded sequence of said records wherein said encoded field data are interleaved for said each record.
Description
BRIEF DESCRIPTION OF FIGURES
[0014]
DETAILED DESCRIPTION
[0015] In one embodiment, the invention relates to a method for encoding a sequence of records, each record of the sequence of records comprising a plurality of different fields, the method comprising: (a) selecting an encoding algorithm for each field of the plurality of fields such that the each field is associated with a selected encoding algorithm; (b) encoding data of the each field using the selected encoding algorithm to determine encoded field data for the each field for the each record; and (c) for the each record, interleaving the encoded field data for the each field to produce an encoded sequence of the records, wherein the encoded field data are interleaved for the each record. These steps, along with selected alternative embodiments, are described in greater detail below.
[0016] An important feature of the present invention is the interleaving of encoded field data for each record. As each record arrives to be appended to the compressed data, each field is considered, compressed independently and then encoded (i.e. interleaved) into the compressed result. By interleaving the encoded field data for each record, the interrelationship of the field data is maintained by virtue of the interrelated fields being proximate to one another. For example, assuming each record [ ] has the same fields in the same order—e.g. ABCD—then the encoded data is [A′B′C′D′][A′B′C′D′][A′B′C′D′][A′B′C′D′][A′B′C′D′] . . . . Thus, when the data is unpacked, interrelated field data are proximate to each other. Keeping interrelated field data proximate is important because of the way hierarchical computer memory works. For examples, a user can load an entire record into an L1 cache and work with it without more expensive subsequent memory accesses to L2 or higher.
[0017] Interleaving the encoded field data can be performed in various ways. In one embodiment, the interleaving uses a bit packing to minimize storage. Below is one example which describes the mechanics of interleaving encoded field data derived from different compression techniques based on reasonable presumed varbit function bit encoding lengths.
[0018] Assume a series of records with the following fields: [0019] timestamp (64 bit integer) [0020] Temperature (32 bit IEEE float) [0021] Humidity (32 bit integer)
[0022] Assume the following 4 records: [0023] 1000, 78.34, 57% [0024] 1010, 78.21, 55% [0025] 1020, 78.15, 55% [0026] 1030, 78.10, 54%
[0027] Applying the delta-of-delta+varbit run-length compression to the two integer fields and xor+varbit to the float field: [0028] 1000 . . . 78.34 . . . 57 [0029] Varbit(1010-1000) Varbit(78.23 XOR 78.21) . . . Varbit(55-57) [0030] Varbit((1020-1010)-(1010-1000)) Varbit(78.21 XOR 78.15) . . . Varbit((55-55)-(55-57)) [0031] Varbit((1030-1020)-(1020-1010)) Varbit(78.15 XOR 78.10) . . . Varbit((54-55)-(55-55))
[0032] Therefore, the first record is encoded to 64 bits+32 bits+32 bits; the second record is encoded to 7 bits+14 bits+7 bits; the third is encoded to: 1 bit+15 bits+7 bits; and the fourth is encoded to 1 bit+14 bits+7 bits. Thus, the coded series would be 128+28+23+22=201 bits, which amounts to just 26 bytes (with 7 bits of the last byte unused). Therefore, using the bit packing when interleaving the fields reduces considerably the bits used.
[0033] In one embodiment, the sequence of records have uniformly-structured fields. In other words, each record of the sequence of records has the same fields in the same order. Having records of uniformly structured fields simplifies the encoding/interleaving and eliminates the need for additional/complex algorithms to compensate for variation in fields among records.
[0034] In one embodiment, two or more of the fields of a record may have different datatypes. For example, the datatypes may comprise integers, floating-point numbers, fixed-point numbers, character, Boolean, money, or date, just to name a few. For example, a “timed position” recode may be expressed: {timestamp unsigned 64 bit integer, longitude IEEE double, latitude IEEE double}.
[0035] As is known, the type of encoding/compression used tends to depend on the datatype. Accordingly, in one embodiment, the system of the present invention comprises a library of different encoding algorithms which can be selected for a particular field to optimize the encoding of the datatype of that field. Examples of different encoding algorithms include varbit, varbitLT, varbit L, XOR, delta of delta, just to name a few. Referring back to the “timed position” example above, the compression algorithm for the timestamp field might be delta of delta using varbitLT and the longitude and latitude fields might be compressed using XOR with varbitL.
[0036] Selecting the encoding algorithm for each field may be performed in different ways. For example, in one embodiment, the selection is done manually, in which a user determines which algorithm encodes the data of a particular field most effectively and then assigns that algorithm to that field. One of skill in the art will understand how to determine the optimum algorithm for a datatype. For example, in one embodiment, this can be done by running different algorithms on a portion of the data from a particular field to determine which algorithm performs the best or otherwise provides suitable results. In another embodiment, one of skill in the art may be able to determine a suitable algorithm by observing the datatype.
[0037] In another embodiment, selecting the algorithm for a particular field is performed automatically by the system. Again, as described above, there are different ways for doing this. For example, in one embodiment, the system, comprises an optimizer for testing different algorithms on the data of a particular field to determine which algorithm performs the best or otherwise meets a threshold level of suitability.
[0038]
[0039] The computer system 100 may include one or more processors, such as, e.g., but not limited to, processor(s) 104. The processor(s) 104 may be connected to a communication infrastructure 106 (e.g., but not limited to, a communications bus, cross-over bar, interconnect, or network, etc.). Processor 104 may include any type of processor, microprocessor, or processing logic that may interpret and execute instructions (e.g., for example, a field programmable gate array (FPGA)). Processor 104 may comprise a single device (e.g., for example, a single core) and/or a group of devices (e.g., multi-core). The processor 104 may include logic configured to execute computer-executable instructions configured to implement one or more embodiments. The instructions may reside in main memory 108 or secondary memory 110. Processors 104 may also include multiple independent cores, such as a dual-core processor or a multi-core processor. Processors 104 may also include one or more graphics processing units (GPU) which may be in the form of a dedicated graphics card, an integrated graphics solution, and/or a hybrid graphics solution. Various illustrative software embodiments may be described in terms of this illustrative computer system. After reading this description, it will become apparent to a person skilled in the relevant art(s) how to implement the invention and/or parts of the invention using other computer systems and/or architectures.
[0040] Computer system 100 may include a display interface 102 (e.g., the HMI) that may forward, e.g., but not limited to, graphics, text, and other data, etc., from the communication infrastructure 106 (or from a frame buffer, etc., not shown) for display on the display unit 101. The display unit 101 may be, for example, a television, a computer monitor, a touch sensitive display device, or a mobile phone screen. The output may also be provided as sound through a speaker.
[0041] The computer system 100 may also include, e.g., but is not limited to, a main memory 108, random access memory (RAM), and a secondary memory 110, etc. Main memory 108, random access memory (RAM), and a secondary memory 110, etc., may be a computer-readable medium that may be configured to store instructions configured to implement one or more embodiments and may comprise a random-access memory (RAM) that may include RAM devices, such as Dynamic RAM (DRAM) devices, flash memory devices, Static RAM (SRAM) devices, etc.
[0042] The secondary memory 110 may include, for example, (but is not limited to) a hard disk drive 112 and/or a removable storage drive 114, representing a floppy diskette drive, a magnetic tape drive, an optical disk drive, a compact disk drive CD-ROM, flash memory, etc. The removable storage drive 114 may, e.g., but is not limited to, read from and/or write to a removable storage unit 118 in a well-known manner. Removable storage unit 118, also called a program storage device or a computer program product, may represent, e.g., but is not limited to, a floppy disk, magnetic tape, optical disk, compact disk, etc. which may be read from and written to removable storage drive 114. As will be appreciated, the removable storage unit 118 may include a computer usable storage medium having stored therein computer software and/or data.
[0043] In alternative illustrative embodiments, secondary memory 110 may include other similar devices for allowing computer programs or other instructions to be loaded into computer system 100. Such devices may include, for example, a removable storage unit 122 and an interface 120. Examples of such may include a program cartridge and cartridge interface (such as, e.g., but not limited to, those found in video game devices), a removable memory chip (such as, e.g., but not limited to, an erasable programmable read only memory (EPROM), or programmable read only memory (PROM) and associated socket, and other removable storage units 122 and interfaces 120, which may allow software and data to be transferred from the removable storage unit 122 to computer system 100.
[0044] Computer 100 may also include an input device 103 which may include any mechanism or combination of mechanisms that may permit information to be input into computer system 100 from, e.g., a user or operator. Input device 103 may include logic configured to receive information for computer system 100 from, e.g. a user or operator. Examples of input device 103 may include, e.g., but not limited to, a mouse, pen-based pointing device, or other pointing device such as a digitizer, a touch sensitive display device, and/or a keyboard or other data entry device (none of which are labeled). Other input devices 103 may include, e.g., but not limited to, a biometric input device, a video source, an audio source, a microphone, a web cam, a video camera, and/or other camera.
[0045] Computer 100 may also include output devices 115 which may include any mechanism or combination of mechanisms that may output information from computer system 100. Output device 115 may include logic configured to output information from computer system 100. Embodiments of output device 115 may include, e.g., but not limited to, display 101, and display interface 102, including displays, printers, speakers, cathode ray tubes (CRTs), plasma displays, light-emitting diode (LED) displays, liquid crystal displays (LCDs), printers, vacuum fluorescent displays (VFDs), surface-conduction electron-emitter displays (SEDs), field emission displays (FEDs), etc. Computer 100 may include input/output (I/O) devices such as, e.g., (but not limited to) input device 103, communications interface 124, connection 128 and communications path 126, etc. These devices may include, e.g., but are not limited to, a network interface card, onboard network interface components, and/or modems.
[0046] Communications interface 124 may allow software and data to be transferred between computer system 100 and external devices or other computer systems. Computer system 100 may connect to other devices or computer systems via wired or wireless connections. Wireless connections may include, for example, WiFi, satellite, mobile connections using, for example, TCP/IP, 802.15.4, high rate WPAN, low rate WPAN, 61oWPAN, ISA100.11a, 802.11.1, WiFi, 3G, WiMAX, 4G and/or other communication protocols.
[0047] In this document, the terms “computer program medium” and “computer readable medium” may be used to generally refer to media such as, e.g., but not limited to, removable storage drive 114, a hard disk installed in hard disk drive 112, flash memories, removable discs, non-removable discs, etc. In addition, it should be noted that various electromagnetic radiation, such as wireless communication, electrical communication carried over an electrically conductive wire (e.g., but not limited to twisted pair, CATS, etc.) or an optical medium (e.g., but not limited to, optical fiber) and the like may be encoded to carry computer-executable instructions and/or computer data that embodiments of the invention on e.g., a communication network. These computer program products may provide software to computer system 100. It should be noted that a computer-readable medium that comprises computer-executable instructions for execution in a processor may be configured to store various embodiments of the present invention. References to “one embodiment,” “an embodiment,” “example embodiment,” “various embodiments,” etc., may indicate that the embodiment(s) of the invention so described may include a particular feature, structure, or characteristic, but not every embodiment necessarily includes the particular feature, structure, or characteristic.
[0048] Having thus described a few particular embodiments of the invention, various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements as are made obvious by this disclosure are intended to be part of this description though not expressly stated herein, and are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description is by way of example only, and not limiting. The invention is limited only as defined in the following claims and equivalents thereto.