H03M7/3088

K-MER BASED GENOMIC REFERENCE DATA COMPRESSION

A computer-implemented method includes receiving genomic data associated with a plurality of genomes and identifying k-mer sets within the genomic data. The method includes constructing a k-mer subset tree according to the following process: performing iterative pairwise comparisons on the k-mer sets, wherein the iterative pairwise comparisons identify fragments with the most shared k-mers, merging the identified fragments into non-leaf nodes of the k-mer subset tree, and placing each remaining k-mer into a leaf node of the k-mer subset tree. The method includes storing the k-mer subset tree. A computer program product for data compression includes a computer readable storage medium having program instructions embodied therewith. The program instructions are executable by a computer to cause the compute to perform the foregoing method. A system includes a processor and logic. The logic is configured to perform the foregoing method.

Data-driven reduction of log message data

Techniques are provided for data-driven reduction of log message data. An exemplary method comprises: obtaining log files and user-specified configuration parameters, wherein the log files each comprise one or more log messages; generating an event count matrix indicating a number of times each of a plurality of unique messages appeared in a given log file of the log files; generating a correlation graph by inserting similar messages with a mutual undirected edge, wherein similar messages are identified based on a predefined similarity measure; extracting redundant messages from the correlation graph by selecting log messages for inclusion in an uninformative log message filter from sub-graphs of the correlation graph in which any two nodes are connected together, except those log messages satisfying a predefined message frequency criteria; and identifying one or more redundant messages using the uninformative log message filter. The uninformative log message filter is optionally applied to real-time log messages and/or existing file systems.

Dynamic dictionary-based network compression
11863207 · 2024-01-02 · ·

Methods and systems for providing dynamic dictionary-based compression and decompression are described herein. A computing device may receive, during a currently running session with a client device, a plurality of messages. The computing device may determine, based on the plurality of messages, one or more frames. The computing device may determine, based on the one or more frames, data samples. The computing device may compress the one or more frames based on a compression dictionary. The computing device may train, during the currently running session, the compression dictionary based on the determined data samples, to create a new compression dictionary. The computing device may determine, during the currently running session and based on receiving additional messages, one or more additional frames. In addition, the computing device may compress the one or more additional frames based on the new compression dictionary.

Page filtering via compression dictionary filtering

Page filtering in a database using a compression dictionary. A page of a database table is compressed, creating a compression dictionary. The compression dictionary includes entries with a byte sequence from the page and a compression symbol associated with the byte sequence. A part of the compressed page, the compression dictionary, and a page symbol list with compression symbols from the dictionary present in the part of the page, are received. A query having a predicate with a predicate value is received. A predicate symbol list, including symbols in the dictionary whose byte sequences at least partially match the predicate value, is generated. Based on the predicate symbol list and the page symbol list, it is determined that at least one symbol from the predicate symbol list is also present in the part of the page. The query is performed by evaluating the predicate on the part of the page.

Dynamic dictionary-based data symbol encoding
10897270 · 2021-01-19 ·

A dynamic dictionary-based data symbol encoder. A dynamic dictionary data structure is populated with evictable dictionary entries. The evictable dictionary entries are encoded with a dictionary index that is shorter than an original representation of the input symbols. A reference count evicts dictionary indices when eligible for eviction. Through building a dynamic symbol dictionary which is much smaller than (global) alphabet size, locally repetitive symbols can be effectively compressed using dictionary. The dictionary is also dynamically built along with the compression/decompression process and therefore does not carry overhead. However, tables/trees might be appended to enable entropy decoding. The method is also readily combined with the popular LZ77 and its variant encoding methods into composite one-pass encoding algorithms to achieve superior performance.

FAST EVALUATION OF PREDICATES AGAINST COMPRESSED DATA
20210006262 · 2021-01-07 ·

Evaluating LIKE predicates against compressed data. An alphabet, a LIKE predicate, a compressed string, and a compression dictionary for the compressed string are received. Entries in the compression dictionary include a character string and an associated token. The LIKE predicate is converted to an equivalent pattern matching form, involving a search pattern of length m. For each character of the alphabet that appears in a string associated with a token, a mask of predetermined length k is created. For each entry in the compression dictionary a cumulative mask of length k is computed. A bit vector of length k is initialized, based on the search pattern. Successive tokens in the compressed string are processed using a logical shift of the bit vector and a bitwise operation of the bit vector with the cumulative mask associated with the token.

Block compression of tables with repeated values
10884987 · 2021-01-05 · ·

Methods and apparatus, including computer program products, for block compression of tables with repeated values. In general, value identifiers representing a compressed column of data may be sorted to render repeated values contiguous, and block dictionaries may be generated. A block dictionary may be generated for each block of value identifiers. Each block dictionary may include a list of block identifiers, where each block identifier is associated with a value identifier and there is a block identifier for each unique value in a block. Blocks may have standard sizes and block dictionaries may be reused for multiple blocks.

DETERMINING A STATE OF A NETWORK
20200401695 · 2020-12-24 ·

Certain examples described herein relate to components of a network computer system. These components may be one or more of a client computing device and a server computing device communicatively coupled to each other over a network. An example client computing device has a data storage device storing a plurality of files and a system agent. The system agent operates to apply a hash function to binary data read from the plurality of files to generate a set of data signatures. An example server computing device has a database interface to access a database representing a state of the network and data storage to store a set of exemplar data signatures resulting from a scan of one or more exemplar computing devices, each data signature being generated by applying a hash function to binary data representing a file. The client computing device is configured to receive the set of exemplar data signatures and compare these with the generated set of data signatures. The client computing device is also configured to transmit data to the server computing device based on the comparison. The server computing device is configured to obtain data received from the client computing device and update records in the database.

ADVANCED DATABASE COMPRESSION
20200403633 · 2020-12-24 ·

A method, a system, and a computer program product for executing a database compression. A compressed string dictionary having a block size and a front coding bucket size is generated from a dataset. Front coding is applied to one or more buckets of strings in the dictionary having the front coding bucket size to generate one or more front coded buckets of strings. One or more portions of the generated front coded buckets of strings are concatenated to form one or more blocks having the block size. Each block is compressed. A set of compressed blocks is stored. The set of the compressed blocks stores all strings in the dataset.

DEVICES AND METHODS FOR COMPRESSION AND DECOMPRESSION
20200395953 · 2020-12-17 ·

A device for compressing first data which are to be compressed comprises a control unit configured to compress the first data based upon further data to obtain compressed data. The control unit is configured to provide memory area information indicative of a memory location of the further data.