Systems and methods for compressing structured data via latent variable estimation
12531575 ยท 2026-01-20
Assignee
Inventors
- Andrea Montanari (Stanford, CA, US)
- Joseph Gardi (Mountain View, CA, US)
- Eric Mclaughlin Weiner (Los Angeles, CA, US)
Cpc classification
International classification
Abstract
A computer-implemented method for compressing structured data or semi-structured data is provided. The method comprises: (a) estimating one or more latent variables associated with rows or columns of the structured data or semi-structured data; (b) partitioning the structured data or semi-structured data in one or more blocks according to the one or more one or more latent variables; (c) applying a sequential encoding algorithm to each of the blocks; and (d) appending a compressed encoding of the one or more latent variables.
Claims
1. A computer-implemented method for compressing structured data or semi-structured data, comprising: (a) estimating latent variables associated with rows or columns of the structured data or semi-structured data; (b) partitioning the structured data or the semi-structured data into one or more blocks according to the latent variables; (c) applying a sequential encoding algorithm to each of the one or more blocks; and (d) appending a compressed encoding of the latent variables to the compressed encoding of each of the one or more blocks.
2. The computer-implemented method of claim 1, wherein the latent variables comprise row latent variables associated with the rows and column latent variables associated with the columns.
3. The computer-implemented method of claim 1, wherein the latent variables are estimated utilizing a spectral clustering algorithm.
4. The computer-implemented method of claim 1, wherein the latent variables are estimated using side information.
5. The computer-implemented method of claim 4, wherein the side information comprises a column datatype, a column name, or a row name.
6. The computer-implemented method of claim 1, further comprising, prior to (b), reordering the rows and columns of the structured data or semi-structured data based at least in part on the latent variables.
7. The computer-implemented method of claim 1, further comprising, prior to (c), generating a serialized block vector for each of the one or more blocks.
8. Computer-implemented method of claim 7, wherein the sequential encoding algorithm is applied to each serialized block vector.
9. The computer-implemented method of claim 8, wherein applying the sequential encoding algorithm comprises applying a first base compressor to each serialized block vector and a second base compressor to a latent variables of each serialized block vector.
10. The computer-implemented method of claim 9, wherein the first base compressor and the second base compressor are different.
11. The computer-implemented method of claim 1, wherein the structured data or semi-structured data is associated with hyperspectral imaging, image processing, quantum chemistry, or large language models.
12. The computer-implemented method of claim 1, wherein the structured data or semi-structured data is associated with tabular data.
13. The computer-implemented method of claim 1, wherein the method achieves an optimal compression rate.
14. The computer-implemented method of claim 1, wherein the method reduces a compression rate by at least about 5% compared to frequency-based entropy encoders (ANS), Lempel-Ziv encoders, or finite-state encoders.
15. A non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements a lossless compression algorithm, comprising: (a) estimating latent variables associated with rows or columns of a structured data or semi-structured data; (b) partitioning the structured data or semi-structured data into one or more blocks according to the latent variables; (c) applying a sequential encoding algorithm to each of the one or more blocks; and (d) appending a compressed encoding of the latent variables to the compressed encoding of each of the one or more blocks.
16. The non-transitory computer readable medium of claim 15, wherein the latent variables comprise row latent variables associated with the rows and column latent variables associated with the columns.
17. The non-transitory computer readable medium of claim 15, wherein the latent variables are estimated utilizing a spectral clustering algorithm.
18. The non-transitory computer readable medium of claim 15, wherein the latent variables are estimated using side information.
19. The non-transitory computer readable medium of claim 18, wherein the side information comprises a column datatype, a column name, or a row name.
20. A computer program product for compressing structured data or semi-structured data, the computer program product comprising at least one non-transitory computer-readable medium having computer-readable program code portions embodied therein, the computer-readable program code portions comprising: an executable portion configured to estimate variables associated with rows or columns of the structured data or semi-structured data; an executable portion configured to partition the structured data or the semi-structured data into one or more blocks according to the latent variables; an executable portion configured to apply a sequential encoding algorithm to each of the one or more blocks; and an executable portion configured to append a compressed encoding of the latent variables to the compressed encoding of each of the one or more blocks.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also figure and FIG. herein), of which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9) The present disclosure provides improved methods for compressing structured data or semi-structured data. In some embodiments, the method comprises (a) estimating latent variables associated with rows or columns of the structured data or semi-structured data. In some embodiments, the method comprises (b) partitioning the structured data or the semi-structured data into one or more blocks according to the latent variables. In some embodiments, the method comprises (c) applying a sequential encoding algorithm to each of the one or more blocks. In some embodiments, the method comprises (d) appending a compressed encoding of the latent variables to the compressed encoding of each of the one or more blocks.
(10) Overview
(11) Data in structured or semi-structured format, which can be used for analytics and machine learning, may take the form of tables with, e.g., categorical or numerical entries. An unmet need exists for an improved lossless compression algorithm particularly for structured data or semi-structured data. The present disclosure provides systems and methods, which can improve the lossless compression performance over other methods or algorithms.
(12) For example, the present disclosure provides a probabilistic model to identify optimal compression for data, e.g., structured or tabular data. In some cases, latent values can be independent and table entries can be conditionally independent given the latent values. The probabilistic model can be characterized with a well-defined entropy rate to determine the ideal compression rate and to satisfy an asymptotic equipartition property. By applying the probabilistic model to different datasets, the latent estimation approach disclosed herein can achieve the optimal rate. In contrast, other compression methods, e.g., Lempel-Ziv or finite-state encoders, may not be able to achieve the optimal rate.
(13) In some cases, latent variables can be used in data compression methods based on machine learning and probabilistic modeling. Generally, latent variables in statistical models can include random variables that are unmeasured but may be unmeasurable. Latent variables can be included in statistical models for, in some cases, (1) including in the model features of interest that may not be directly measurable or were not measured; (2) constructing estimators that may be more efficient than those constructed from nonlatent variables models; and (3) constructing estimators of manipulation statistics that are unbiased.
(14) For example, stochastically generated codewords (e.g., random latents) can lead to minimum description lengths via bits back coding. This method can be explicitly applied to lossless data compression using arithmetic coding or ANS coding. For example, data compression via low-rank approximation or latents-based approaches can be applied to numerical analysis applications, hyperspectral imaging, image processing (2D or 3D), quantum chemistry, large language or machine learning models, artificial intelligence (AI) models, audio processing, cryptography, genetics or genome data, and the like. However, some latents-based approaches may mainly center on lossy compression and do not precisely quantify compression rate, e.g., they do not count bits.
(15) The present disclosure can provide an improvement in lossless compression rate over other methods. For example, the present disclosure provides a family of lossless compression algorithms for data, e.g., structured or semi-structured data. In some embodiments, the methods herein may comprise: (i) estimating latent variables associated to rows and columns of the table; (ii) partitioning the table in blocks according to the row or column latents; (iii) applyig a sequential encoding algorithm (e.g., Lempel-Ziv compression or entropy coding) to each of the blocks; and (iv) appending a compressed encoding of the latent. Evaluation of the methods on several bench-mark datasets show that latent estimation and row or column reordering can improve compression rate.
(16) In some cases, methods disclosed herein can improve upon mathematical algorithms, e.g., in compressing tabular data, by providing an asymptotically optimal algorithm that overcomes technical problems of other methods or algorithms. The lossless compression algorithms disclosed herein may be applied to various types of documents or data such as documents or data having comma-separated values (CSV) data, tab-separated values (TSV) data, Apache Parquet data, Apache optimized row columnar (ORC) data, Apache Feather data, Microsoft Excel (XLS) data, and the like. In some cases, the lossless compression algorithms disclosed herein may be applied to a document comprising semi-structured data (e.g., XML documents, JSON files, NoSQL databases, HTML code, graphs and tables, emails, etc.) or structured data (e.g., tabular data).
(17) 1 Introduction
(18) Some methods of lossless compression may assume that data takes the form of a random vector X.sup.N=(X.sub.1, X.sub.2, . . . , X.sub.N) of length N with entries in a finite alphabet X. Under suitable ergodicity assumptions and using the Shannon-McMillan-Breiman theorem, the entropy per letter may converge to a limit as h:=lim.sub.N.fwdarw.(X.sup.N)/N. Universal coding schemes (e.g., Lempel-Ziv coding) may not require knowledge of the distribution of X.sup.N and can encode such a sequence without information loss using (e.g., asymptotically) h bits per symbol. While this theory is mathematically useful, its modeling assumptions (e.g., stationarity, ergodicity, asymptotia, and the like) may not be satisfied by actual data in many applications.
(19) For example, consider a data table with m rows and n columns and entries in ,
:=
. An approach to such data can include: (i) serializing, e.g., in row-first order, to form a vector of length N=
, X.sup.N=(X.sub.11, X.sub.12, . . . , X.sub.in, X.sub.21, . . . ,
) and (ii) applying a standard compressor (e.g., Lempel-Ziv) to this vector.
(20) It can be shown, both empirically and mathematically, that such an approach can be suboptimal in the sense of not achieving the optimal compression rate, e.g., data reduction rate (DRR). This can happen even in the limit of large tables, as long as the number of columns and rows are polynomially related (e.g., ) for some small constant & and large constant M).
(21) The present disclosure provides improved methods for compressing structured data or semi-structured data. In some cases, the method may use the following general scheme of operations 1, 2, and 3:
(22) 1. Estimating row latents =(u.sub.1, . . . , u.sub.m)
and column latents
=(v.sub.1, . . . , v.sub.n)
, with
, a finite alphabet. For example,
(23) 2. Partitioning the table in blocks according to the row or column latents. For example, for each pair of latents values ,
, let R(u) be the subset of rows with latent value
=u and C(v) be the subset of columns with latent value
=v. Construct the table X_M (u, v) comprising the rows R (u) and columns C (v), in the same order as they appear in the original table. Then construct the vector or sequence X (u, v) by serializing X_M (u, v). In some cases, serializing the table may comprise scanning the entries of X_M (u, v) in row-first order or in column-first order. This operation can be performed by:
(24)
where vec (M) denote the serialization of matrix M (either row-wise or column-wise).
(25) As illustrated in
(26) 3. Applying a base compressor, which can be generically denoted by Zx: .fwdarw.{0, 1}*, to each block X (u, v):
(27)
(28) 4. Encoding the row latents and column latents using a possibly different compressor :
*.fwdarw.{0,1} *, to get z.sub.row=Z.sub.L(
), Z.sub.col=
(
). For example, as illustrated in
(29)
where denotes concatenation, and header is a header that contains encodings of the lengths of subsequent segments (e.g., ||.sup.2+2 integers). The compressed vectors 660 may then be serialized 661 or concatenated to form a final vector 670.
(30) In some cases, encoding the latents can lead to a suboptimal compression rate. For example, techniques such as bits-back coding can merely yield limited improvement towards obtaining an optimal compression rate. It can be shown that the compression rate improvement achieved by such bits-back coding may only significant in certain special regimes, which is described herein elsewhere.
(31) In some cases, methods described herein can leave several design choices undefined. For example, design choices can include: (1) the latents estimation procedure; (2) the base compressor Z.sub.x for the blocks X(,
); or (3) the base compressor
, for the latents. Implementation alongside empirical evaluation is described herein elsewhere.
(32) 2 Implementation
(33) 2.1 Base compressors
(34) Dictionary-based compression (LZ). As an example, Zstandard (ZSTD) can be used in its Python implementation. ZSTD can implement a Lempel-Ziv style algorithm.
(35) Frequency-based entropy coding (ANS). For each data portion (e.g., each block X(,
) and each of the row latents
and column latents
), compute empirical frequencies of the corresponding symbols. For example, for all
,
,
, compute:
(36)
where N(,
) is the number of im, jn such that
=u,
=v. Apply ANS coding to each block X (u, v), modeling its entries as independent with distribution {circumflex over (Q)}(.Math.|u, v), to the row latents
using the distribution {circumflex over (r)} (.Math.), and to the column latents
using the distribution (.Math.). Separately encode these counts as long integers and prepend them to the file. In some cases, they can be a negligible fraction of the file size.
2.2 Latent Estimation
(37) In some cases, the method implements latents estimation using a spectral clustering algorithm. In some cases, the latents may be estimated using side information. Side information can include columns datatype, column names, row names, and the like.
(38) In some cases, the algorithm may encode the data matrix as an
real-valued matrix
using a map :
.fwdarw.
. In some cases, this map may not be optimized. In some cases, the elements of
can be arbitrarily encoded as 0, 1, . . . |
|1.
(39) In some cases, the singular vector calculation can be the most time consuming part of the algorithm. Computing approximate singular vectors via power iteration can require at least log (m{circumflex over ()}n) matrix vector multiplications for each of k vectors. This can amount to mnk log (m{circumflex over ()}n) operations, which can be larger than the time needed to compress the blocks or to run KMeans. The methods disclosed herein can improve this to (mVn) klog (mVn) n) via subsampling operations, which are described herein elsewhere.
(40) In some cases, for the spectral clustering operations, the method may use KMeans with k clusters, initialized randomly. For example, the algorithm may use the scikit-learn implementation via sklearn.cluster.KMeans. In some cases, the overall latent estimation approach may not estimate or make use of the model Q(.Math.|u, v). In some cases, an alternative approach based on an expectation-maximization (EM) algorithm may use such an estimation.
(41) 3 Empirical Evaluation
(42) Methods disclosed herein were evaluated on tabular datasets with different origins described below. The evaluation can be used to assess the impact of using latents in reordering columns and rows. In some cases, the evaluation may not attempt to achieve the best possible data reduction rate (DRR) on each dataset but rather to compare compression with latents and without latents. In some cases, the evaluation processed categorical variables by preprocessing the data to fit in this setting as described herein elsewhere. In some cases, the preprocessing included dropping some of the columns of the original table. The number of columns before preprocessing is denoted by , and after by
. Differences between the implementation used in the evaluation and the methods described herein elsewhere are conceptually minor but can be practically important. Differences included: using different sizes for rows latent alphabet and column latent alphabet |
||
| and choosing |
|, |
| by optimizing compression size.
(43) 3.1 Datasets
(44) The following datasets were utilized in the evaluation: taxicab, network, card transactions, business price index, forest, US census, and jokes.
(45) Taxicab. This table has =62, 495 rows,
=20 columns comprising data for taxi rides in New York City during January 2022. After preprocessing, this table had
=18 columns. For the LZ (ZSTD) compressor, 9 row latents were used. For the ANS compressor, 8 row latents were used. Both methods used
column latents with each column compressed separately.
(46) Network. Four social networks from Stanford Network Analysis Platform (SNAP) Datasets, representing either friends as undirected edges for Facebook (FB) or directed following relationships on Google Plus (GP). It regards these as four distinct tables with 0-1 entries, with dimensions, respectively =
{333, 747, 786, 1187}. For each table, the evaluation used 5 row latents and 5 column latents.
(47) Card transactions. A table of simulated credit card transactions containing information like card ID, merchant city, zip code, etc. This table has =24,386,900 rows and
=15 columns. After preprocessing, the table had
=12 columns. For this table, the evaluation used 3 row latents and n column latents.
(48) Business price index. A table of the values of the consumer price index of various goods in New Zealand between 1996 and 2022. This table has =72,750 rows and
=12 columns from the Business price indexes: March 2022 quarter-CSV file. After preprocessing, this table had n=10 columns. Due to the highly correlated nature of consecutive rows, the data was shuffled before compressing. For the LZ method, the evaluation used 4 row latents. For the ANS method, the evaluation used 7 row latents. Both methods used
column latents with each column compressed separately.
(49) Forest. This table from the UC Irvine (UCI) Machine Learning Repository has =581,011 cartographic measurements with
=55 attributes, to predict forest cover type based on information gathered from US Geological Survey. The data contained binary qualitative variables and some continuous values like elevation and slope. After preprocessing, this data had
=55 columns. For the LZ method, the evaluation used 9 row latents. For the ANS method, the evaluation used 8 row latents. Both methods used n column latents with each column compressed separately
(50) US Census. This table from the UCI Machine Learning Repository has =2,458,285 and
=68 categorical attributes related to demographic information, income, and occupation information. After preprocessing, this data had
=68 columns. For this data, the evaluation used 9 row latents and n column latents.
(51) Jokes. A table containing ratings of a series of jokes by 24,983 users collected between April 1999-May 2003. These ratings are real numbers on a scale from 10 to 10, and a value of 99 is given to jokes that were not rated. This table has =23,983 rows and
=101. The first column identifies how many jokes were rated by a user, and the rest of the columns contain the ratings. After preprocessing, this data had
=101 columns, all quantized. For the LZ method, the evaluation used 2 row latents. For the ANS method, the evaluation used 5 row latents. Both methods used n column latents.
(52) 3.2 Preprocessing
(53) Methods disclosed herein may preprocess different columns as follows: if a column comprises K256 unique values, then it maps the values to {0, . . . ,K1}; if a column is numerical and comprises more than 256 unique values, the method may calculate the quartiles for the data and map each entry to its quartile membership (e.g., 0 for the lowest quartile, 1 for the next largest, 2 for the next, and 3 for the largest); and if a column does not meet either of the above criteria, it may be discarded.
(54) In some evaluations, the data was randomly permuted before compression because some of the above datasets had rows already ordered in a way that can make nearby rows highly correlated.
(55) 3.3 Results
(56) Given a lossless encoder : .fwdarw.{0,1} *, its compression rate and data reduction rate (DRR) can be defined as:
(57)
where larger DRR generally means better compression.
(58) The DRR of each algorithm is illustrated in
(59) The results in
(60) 4 a Probabilistic Model
(61) In order to better understand the technical problems of other approaches and the improved optimality of latent-based compression, the presented disclosure provides a probabilistic model for the table . The model assumes the true latents (
).sub.im,
to be independent random variables with:
(62)
(63) Also, the model assumes that the entries , are conditionally independent given
=(
).sub.im,
=(
).sub.jn, with:
(64)
(65) The distributions r, c, and conditional distribution Q are parameters of the model with a total of 2(||1)+|
|.sup.2(|
|1) real parameters. In some cases, (X.sup.m,n,
,
)
(Q, r, c;
,
) can indicate that the triple (
,
,
) is distributed according to the probabilistic model, occasionally omits a subset of the variables, and thus results as (
)
(Q,r,c;
,
). In some cases, the statements may be non-asymptotic, in which case
,
,
,
, Q, r, c can be fixed. In some cases, the statements may be asymptotic, in which case a sequence of problems can be indexed by
.
(66)
(67) As an example of Symmetric Binary Model (SBM), the following SBM can be used, which parallels the symmetric stochastic block model for community detection. For example, take =[k]: ={1, . . . , k},
={0,1}, r=c=Unif ([k]), the uniform distribution over [k], and
(68) ,
)
(
,
, k;
,
) when this distribution is used.
(69) =
=1000, k=3, and DRR values are averaged over 4 realizations. In some cases, the use of latents can be irrelevant along the line p.sub.1p.sub.0, which may not impact the distribution of X.sub.ij. In some cases, the use of latents can become important when p.sub.1 and p.sub.0 are significantly different.
(70) 5 Theoretical analysis
(71) 5.1 Ideal Compression
(72) Lemma 5.1. A first theorem or lemma can provide upper and lower bounds on the entropy per symbol H ()/
. The first lemma 5.1 can be shown as:
(73)
(74) Further, for any estimators :
.fwdarw.
,
:
.fwdarw.
let
(75)
(the minimum is over permutations of the letters in ). Letting .sub.U: =
.sub.U, .sub.V:=
.sub.V, the formula becomes:
(76)
(77) Corollary 5.2. Recall the definition of compression rate from Eq. 3.1. Then, there may exist a lossless compressor such that:
(78)
(79) Further, for any lossless compressor , R.sub.(
)H(X|U,V)+H(U)/
)+H(
)/m.sub.m,n2 log.sub.2(
)/
.
(80) The simpler bound of Eq. 5.1 may imply that the entropy per entry is H(X|U, V)+0(1/)). The operational interpretation of this result can be that it should be able to achieve the same compression rate per symbol as if the latents were given.
(81) The additional terms
(82)
in Eq. 5.2 can account for the additional memory required for the latents. The lower bound in Eq. 5.2 may imply that, if the latents can be accurately estimated from the data (e.g., if .sub.U, .sub.v are small), then this overhead can be essentially or substantially unavoidable.
(83) The nearly ideal compression rate in Eq. 5.3 can be achieved by Huffmann or arithmetic coding, which may require knowledge of the probability distribution of . Under these schemes, the length of the codeword associated to
is within a constant number of bits from log.sub.2P(
), where P(X.sub.0):=
(
=X.sub.0) is the probability mass function of the random table
. The next lemma 5.3 may imply that the length concentrates tightly around the entropy.
(84) Lemma 5.3. Asymptotic Equipartition Property. For X.sub.0, let P(X.sub.0)=
(X.sub.0) the probability of X.sup.m,n=X, under model X.sup.m,n
(Q,r, c;
,
). Assume there exists a constant c>0 such that Q (x|u, v)c. Then, there can exist a constant C, which can depend on c, such that the following can happen: for
(Q,r, c;
,
) and any t0 with probability at least 12e.sup.t:
(85)
(86) For simplicity, in the last statement, assumptions can be made that are appropriate to the case, in which Q, r, c may be independent of ,
. A more general statement with stronger, e.g., more complicate, bounds is described herein elsewhere.
(87) 5.2 Failure of Classical Compression Schemes
(88) Two types of codes are analyzed herein: finite-state encoders and Lempel-Ziv codes. Both can operate on the serialized data X.sup.N=vec(X.sup.m,n), N=, which can be obtained by scanning the table in row-first order where column-first can yield symmetric results.
(89) 5.2.1 Finite state encoders. A finite state (FS) encoder can take the form of a triple (, f, g) with a finite set of cardinality M=|| and
(90)
(91) Assume that contains a special initialization symbol s.sub.init. Starting from state s.sub.0=s.sub.init, the encoder can scan the input X.sup.N sequentially. Assume after the first input symbols, the state is in state
and produced encoding
. Given input symbol
, the encoder appends f(
(
) to the codeword and updates its state to
=g(
,
).
(92) Denote (
, s.sub.init){0,1} *, the binary sequence obtained by applying the finite state encoder to the vector
=(X.sub.1, . . . ,
). The FS encoder can be defined as information lossless if for any
,
(
, s.sub.init) is injective.
(93) Theorem 1. Let (Q, r, c;
,
) and (, f, g) be an information lossless finite state encoder. Define the corresponding compression rate R.sub.,f,g(
), as per Eq. 3.1. Assuming
>10, ||
|, and log.sub.2||
|X |/9, then,
(94)
where the leading term of the above lower bound is H(X|U)/log.sub.2|X|.
(95) Since conditioning can reduce entropy, this is strictly larger than the ideal rate which is roughly H(X|U,V)/log.sub.2|X|, e.g., Eq. 5.3. The next term can be negligible provided log||<<n log|X|, e.g., modulo logarithmic factors which may be a proof artifact. This condition can be easy to interpret. For example, the finite state machine may not have enough states to memorize a row.
(96) 5.2.2 Lempel-Ziv
(97) The pseudocode of the Lempel-Ziv algorithm is shown in
(98)
which may appear in the past. The encoder can then encode a pointer to the position of the earlier k appearance of the string T.sub.k and its length L.sub.k.
(99) The method can then encode the pointer T.sub.k in plain binary using [log.sub.2 (N+|x|)]bite. In some cases, T.sub.k {|X|+1, . . . , 1, . . . , N}), and Ly can use an instantaneous prefix-free code, e.g., Elias -code, taking 2 [log.sub.2L.sub.k]+1. This can be sub-optimal, but the space taken by the encoding of L.sub.k can be of lower order with respect to the one of T.sub.k
(100) Assumption 1. The following can hold:
(101) 1. There exist a constant c.sub.o>0 such that:
(102)
(103) 2. Consider sequences of instances with Q r, c, X, fixed and m, n=.fwdarw. so that:
(104)
equivalently =
(105) As mentioned herein, consider sequences of instances with ,
.fwdarw., which can mean the sequence to be indexed by
, and let
=
, which can depend on
such that Eq. 5.8 holds.
(106) Theorem 2. Define the asymptotic Lempel-Ziv rate as:
(107)
(108)
(109) The asymptotics of the Lempel-Ziv rate can be given by the minimum of two expressions, which can correspond to different behaviors of the encoder. For example, for , define (
): =H(X|U=
, V)/(H(X|U=
)H(X|U=
, V)) with (
)= if H(X|U=
)=H(X|U=
,V)).
(110) Then, if <.sub.*(), then it may be a skinny table regime. For example, the algorithm may mostly deduplicate segments in rows with latent u by using strings in different rows but aligned in the same columns. If <(
), then it may be a fat table regime. For example, the algorithm may mostly deduplicate segments on rows with latent u by using rows and columns that are not the same as the current segment.
(111) Example. Symmetric Binary Model (SBM), dense regime. Under the SBM, (
,
,k;
,
) example, the optimal compression rate of Corollary 5.2, the finite state compression rate of Theorem 1, the Lempel-Ziv rate of Theorem 2 can be computed.
(112) Assume ,
of order one, and
=
as
,
.fwdarw.. In some cases, the resulting tales can be dense in the sense of containing a constant fraction of non-zeros. Then,
(113)
where the right hand sides correspond to the contour lines in
5.3 Practical Latent-Based Compression
(114) Achieving the ideal compression rate of Corollary 5.2 via arithmetic or Huffmann coding may require, a priori, computing the probability P () of the table
. The method herein can achieve a compression rate that is close to the ideal rate via latents estimation. For example, consider general latents estimators
:
.fwdarw.
,
:
.fwdarw.
. Accuracy can be measured by the overlaps as:
(115)
where the minimization is over the set of permutations of the latents alphabet
.
(116) Any estimators ,
can be used to reorder rows and columns and compress the table
according to the methods described herein. Denote by R.sub.lat (
), the compression rate achieved by such a procedure. From Eq. 1.3, the following can be obtained:
(117)
where {circumflex over (X)}(,
)=)=vec(X.sub.ij:
(X)=
(X)=v are the estimated blocks of X. This rate can depend on the base compressors
,
but can be omitted these from notations. The first result can imply that, if the latent estimators are consistent, e.g., they recover the true latents with high probability up to permutations, then the resulting rate may be close to the ideal one.
(118) Lemma 5.4. Assume data distributed according to model (Q, r, c;
,
), with
,
log.sub.2|
|. Further, assume r(
), c(
)>c for all
,
. Let R.sub.lat (
) be the rate achieved by the latent-based scheme with latents estimators
,
and base encoders
=
=Z. Then,
(119)
where .sub.z (N; *) can be the worst case overhead of encoder Z over sources with distributions in
*.
(120) For example, the overheads of Lempel-Ziv, frequency-based arithmetic coding, and frequency-based ANS coding can be upper bounded respectively as (e.g., where the last bound requires Q,r,c to be independent of N):
(121)
(122) The proof of this lemma and the main content of the lemma is in the general bound of Eq. 5.15. More explicitly, .sub.z (N; *) can be defined by:
(123)
where ([k]): ={(
).sub.ik
:
0i.sub.ik
=1} is the set of probability distributions over [k], and Y.sup.N is a vector with entries Y.sub.i
. The upper bound (5.16) may be closely related to classical results.
A. Notations, Definition, and Basic Facts
(124) The data take form the form
(125)
Where X is a finite alphabet. Consider lossless encoders E: X.sup.n.fwdarw.{0,1}.sup.*, and denote by
(126)
the length of the encoded sequence on input .
(127) For ab, it uses
(128)
to denote the substring of x. Given a sequence , it denotes by
(129)
the empirical distribution of sequences of length k. Namely, for wcX.sup.k, let,
(130)
(131) The Shannon entropy of a probability distribution p over a finite or countable set Z is given by,
(132)
(133) Denote by h() the entropy of a Bernoulli random variable Z with (Z=1)=1
(Z=0)= by,
(134)
(135) Lemma Let A be a finite set and F: A.fwdarw.{0, 1} * be an injective map. Then, for any probability distribution p over A
(136)
(137) Proof. Assume without loss of generality that A={1, . . . , M}, with |A|=K, and that the elements of A have all non-vanishing probability and are ordered by decreasing probability p.sub.1p.sub.2 . . . p.sub.k>0. Let : =2+4+ . . . +
=
+1-2. Then the expected length is minimized any map F such that len(F())=
for
.sub.1a
with the maximum length
being defined by
.sub.1<K
. For A
, L: =len(F(A)), then:
(138)
where (a) is the chain rule of entropy and (b) follows because by injectivity, given len(F(A))=, A can take at most
values.
B Proofs of results on ideal compression
Proof of Lemma 5.1
(139) Claiming that:
(140)
by the definition of mutual information, obtain H()=H(
|
,
)+I(
|
,
). Equation B.1 follows by noting that:
(141)
where (a) follows from the fact that the (X.sub.i,j) are conditionally independent given U.sup.m, V.sup.n, ad since the conditional distribution of X.sub.i,j only depends on U.sup.m, V.sup.n via U.sub.i, V.sub.j; (b) holds because the triples (X.sub.i,j, U.sub.i, V.sub.j) are identically distributed.
(142) The lower bound in Eq. 5.1 holds because mutual information is non-negative, and the upper bound because I (,
,
)H (
,
)=
+
.
(143) Finally, Eq. 5.2 holds because
(144)
Proof of Lemma 5.3
(145) Lemma B.1. Let =, =
=
be collections of mutually independent random variables taking values in a measurable space Z.x.Z.sup.3.fwdarw.Z, F:
.fwdarw.
. Define x(, , )
via x(, , ).sub.ij=x (.sub.ij, .sub.i, .sub.j).
(146) Given a vector of independent random variables z, let Var.sub.zj(f(z)):=[(f(z)
f(z)).sup.2]Define the quantities:
(147)
(148) Then, for any t0, the following holds with probability at least 18e.sup.t:
|F(x(,,)F(x(,,)2 max(2V.sub.*t+2V.sub.1t+2V.sub.2t;(B.sub.*+B.sub.1+B.sub.2)t). (Eq. B.8)
(149) Proof. Let zZ.sup.N be a vector of independent random variables and f: Z.sup.N.fwdarw.. Define the Martingale X.sub.k: =
[f(z)|
] (where
:=(Z.sub.1, . . . , Z.sub.k)). Then it follows:
(150)
(151) By Freedman's inequality, with probability at least 1-2e.sup.t, it follows:
(152)
(153) Define E(,):=F(x(, , )), L():=
(x(, , )). Applying the above inequality, each of the following holds with probability at least 1-2e.sup.t:
(154)
and the claim follows by union bound.
(155) Following proves a more stronger version of Lemma 5.3.
(156) Lemma B.2. For X, let P(X)=
(X) the probability applied by the model
(Q, r, c;
,
) to matrix X, i.e.,
P(X)=.sub.(i,j)[m][n]Q(X.sub.ij
,
)
r(
).sub.j[n]c(
.sub.)(Eq. B.16)
(157) Define the following quantities:
(158)
(159) Then, for X(Q, r, c; m, n) and any t0 the following bound holds with probability at least 2e.sup.t:
(160)
(161) Proof. Let
(162)
be such that x(.sub.ij, .sub.i, .sub.i)|.sub.i, .sub.iQ (.Math.|.sub.i; .sub.i). It defines F(x)=log P(x), and will apply Lemma B.1 to this function. Using the notation from that lemma, it claims that B.sub.*M.sub.*, B.sub.1M.sub.1, B.sub.2M.sub.2
, and V
, V.sub.1
s.sub.1, V.sub.2
s.sub.2.
(163) Note that, if (x.sub.ij), (x.sub.ij) differ only for entry i, j, then:
(164)
where denotes expectation with respect to the posterior measure P(
,
|X=x). This immediately implies BM.
(165) Next consider the constant B.sub.1 defined in Eq. (B.3). Using the exchangeability of the (.sub.i,., .sub.i), it gets:
(166)
where the bound B.sub.2M.sub.2m is proved analogously.
(167) Consider now the quantity Va of Eq. (B.5). Denote by .sub.(ij)(t) the array obtained by replacing entry (i, j) in {by t, and by x (t)=x (.sub.(ij)(t), , ). Then it follows:
(168)
(169) Next, as claimed:
(170)
(171) Finally, consider the quantity V.sub.1 of Eq. (B.6) (the argument is similar for V.sub.2). Denote by .sub.(i)(t) the vector obtained by replacing entry i in by t. Proceeding as above, it follows:
(172)
(173) Therefore,
(174)
C Proofs for Finite State Encoders
(175) It shows above that a finite-state encoder is defined by a triple (, f, g). Formally, it can define the action of f, g on
(176)
recursively via
(177)
and the encoder is thus given by
(178)
(179) The state space is non-degenerate if, for each s.sub.1 there exists
(180)
such that
(181)
Notice, that if state space is degenerate, it can remove one or more symbols from without changing the encoder and making the state-space non-degenerate. The method may assume non-degeneracy without mentioning it.
(182) The FS encoder is information lossless (IL) if for any
(183)
is injective.
(184) Remark C.1. An information-lossless encoder satisfies a stronger condition: for any m and any s, , the map
(185)
is injective
(186) Assume this were not the case, then there may exist two distinct inputs
(187)
and a state s, such that
(188)
By non-degeneracy, there exists
(189)
such that
(190)
Defining
(191)
is not easy to check that these inputs are distinct but
(192)
(193) Proposition C.1. Define the compression rate on input
(194)
Then for any 1, the following holds (where n:=n2
and recall that M:=||):
(195)
(196) Proof. The method denotes by L(x.sub.1.sup.m; s.sub.*) the length of the encoding of x.sub.1.sup.m when starting in state s.sub.*:
(197)
(198) It then follows, for any b {0, . . . , 1}, and setting by convention s.sub.0=S.sub.init, it may get:
(199)
(200) By averaging over b, and introducing the shorthand : =
2
, it gets:
(201)
where (a) holds by Lemma A.1. By the chain rule of entropy (recalling that M:=[]), it follows:
(202)
where the claim (C.3) follows by using the last inequality in Eq. C.8.
(203) Theorem 3. Let (Q, r, c,
,
) and (, f, g) be an information lossless finite state encoder. With an abuse of notation, denote f.sub.mn (X.sup.mn, S.sub.init){0,1} * the binary sequence obtained by applying the finite state encoder to the vector vec (X.sup.mn)
obtained by scanning X.sup.mn in row-first order. Define the compression rate by:
(204)
(205) Assuming n>10, ||||, and log .sub.2<nlog.sub.2|
|/9, the expected compression rate is lower bounded as follows:
(206)
(207) Proof. Let N:=, N:=
2
where
/3 will be selected later. Let X.sup.N:=vec (
) for the vectorization
, X.sup.N, for the vector comprising its first N entries. Recall the definition of empirical distribution. For any fixed
:
(208)
(209) Let S:={i[N+1]: [i, i+
2] n
=0}. These are the subset of blocks of length
that do not cross the end of a line in the table. Since for each line break there are at most
1 such blocks, it follows |S|N
+1(
1)(
1). Consider the following modified empirical distribution:
(210)
(211) Then, by construction:
(212)
where is the empirical distribution of blocks that do cross the line. By concavity of the entropy, it follows:
(213)
(214) Further, since /3:
(215)
(216) Now, let the row latents : =(
).sub.im be fixed, and denote by
their weighted empirical distribution, defined as follows:
(217)
where is the empirical distribution of the latents (
).sub.im where row i is weighted by its contribution to S. Note that all the weights are equal to (n2(
1))/|S| except, potentially, for the last one.
(218) It follows:
(219)
(220) Using Eq. (C.11), (C.12), and concavity of the entropy, it gets:
(221)
(222) By Proposition C.1, it gets:
(223)
where in the last inequality it used the fact that H()
log.sub.2|X|. It is chosen:
(224)
(225) Substituting and simplifying, it gets:
(226)
(227) Finally, letting (W.sub.1, . . . , , U)
be random variables with joint distribution
(228)
Then,
(229)
and therefore
(230)
finishing the proof.
D Proofs for Lempel-Ziv coding
(231) It is useful to define for each kN,
(232)
Proof of Theorem 2
(233) Lemma D.1. Under Assumption 1, there exists a constant C such that the following holds with probability at least 1N.sup.10:
(234)
(235) Proof. Consider a slightly different setting, and it then shows that the question reduces to this setting. Let (Z.sub.i).sub.i1 be independent random variables with Z.sub.iq.sub.i a probability distribution over . Further assume max.sub.xXqi(x)1c for all i1. Then it claims that, for any t,
1, it follows:
(236)
(237) Condition on the event
(238)
for some x.sub.1, . . . , x.sub.t , then the event
(239)
implies that, for i {t+1, . . . , t+1}, Z.sub.i=xp (i) where It (i)=i mod t, n (i) [1, t]. Then,
(240)
proving claim (D.4).
(241) Reconsider the original setting:
(242)
where the last inequality follows from claim (D.4), since the (X.sub.i).sub.iN are conditionally independent given the latents ,
, with probability mass function upper bounded by 1c. The thesis follows by taking
=12 log N/log (1/(1c)).
(243) For i [], j[
], it defines
ij
: =(i1)n+j. In words, k=
ij
is the of entry at row i column j when the table
is scanned in row first order. For
1, define the events:
(244)
(245) Then, it follows:
(246)
(247) The next two lemmas control the probabilities of these events.
(248) Lemma D.2. Let (, u):=[(1+)(log N)/H(X|U=u)], n=n
(, u), and m.sub.0=m.sup.1-on(1). Under Assumption 1, for any >0, there exist constants C, >0 independent of u
, such that the following hold:
(249)
(250) Lemma D.3. Let (, u)=[(1+)(log m)/H(X|U=u, V)], n=n
(, u), and m.sub.0=m.sup.1-on(1). Under Assumption 1, for any >0, there exist constants C, >0 independent of u
, such that the following hold:
(251)
and is ready to prove Theorem 2.
(252) Proof of Theorem 2. It denotes by (k(1), . . . , k(M)) the values taken by k in the while loop of the Lempel-Ziv pseudocode. In particular,
(253)
(254) Therefore, the total length of the code is:
(255)
(256) By Lemma D.1 (and recalling that len(elias(L))2 log.sub.2L+1), it has, with high probability, len(elias(
))2 log.sub.2 (C log N). Letting G denote the good event that this bound holds, it has on G:
(257)
(258) Since || is a constant, this means that for any >0, there exists N.sub.0() such that, for all NN.sub.0(), with probability at least 1:
(259)
where on the right len(LZ))N log .sub.2|
| by construction. It follows:
(260)
that is,
(261)
(262) It is left with the task of bounding {M.Math.1.sub.G}.
(263) Begin by the lower bound. Define the set of bad indices B(, )[
][
],
(264)
(265) The method drops the arguments , for economy of notation and writes B:=B(
, ). It further defines:
(266)
where S() is the set of positions (i, j) of the table
where words in the LZ parsing begin.
(267) It also writes N()=
. |{i[
]:
=
}| for the total number of rows in
with row latent equal to
and L.sub.i.sup. for the length of the first segment in row i initiated in row i1:
(268)
where the last inequality holds on event G. By taking expectation on this event, it gets:
(269)
by Lemmas D.2 and D.2,
(270)
(271) Recalling the definition of (
; ),
(
; ) and the fact that is arbitrary, n the last inequality yields:
(272)
where summing over , noting that
|S(
)|=M, and substituting in Eq. (D.19) yields the lower bound on the rate in Eq. (5.10).
(273) Finally, the upper bound is proved by a similar strategy as for the lower bound. Define the set of bad indices B_=B_(, )[
][
],
(274)
(275) It also denotes by Lt the length of the last segment in row i. It then has:
(276)
where the last inequality holds on event G. By taking expectation on this event, it gets:
(277)
(278) By Lemmas D.2 and D.2,
(279)
(280) The proof is completed exactly as for the lower bound.
(281) Proof of Lemma D.2
(282) The following standard lemmas are used.
(283) Lemma D.4. Let X be a centered random variable with (Xx.sub.0)=1, x.sub.0>0. Then, lettings
(284)
it has:
(285)
(286) Proof. This simply follows from exp(x)1+x+c(x.sub.0)x.sup.2 for xx.sub.0.
(287) Lemma D.5. Let (P.sub.i).sub.i1, (q.sub.i).sub.i1, be probability distributions on with supi.sub.i1 max.sub.xx P.sub.i (x)1c, and sup.sub.i1 .sub.xXP.sub.i(x).sup.2 (logp; (x))C for constants c, C.
(288) Let be be independent random variables with X.sub.ip.sub.i, and set X=(X.sub.1, . . .
). Let Y(j)
, j1 be a sequence of i. i. d. random vectors, with
independent and (Y.sub.i(j)q.sub.i. Finally, let T:=min {t1: Y(t)=X}.
(289) Then, for any >0, there exists =(, c, C)>0 such that (letting):
(290)
(291)
(292) Further, the same bound holds (with a different (, c, C))(Y(j)).sub.j1 are independent not identically distributed, if there exist a finite set
(293)
and a map b: =.fwdarw.[K] such that
(294)
(295) Proof. It denotes by Y a vector distributed as Y (i). Conditional on X=, T is a geometric random variables with mean 1/(1
(Y=x)). Hence, for the
(): =
,
(296)
(297) Hence,
(298)
(299) By Chernoff bound, for any 0, (
)exp{l(,
)}, where:
(300)
(301) By Holder inequality, for [0, 1] we have [q.sub.i(X.sub.i).sup.](.sub.xp(x).sup.).sup.1/ where =1(1). Therefore,
(302)
(303) Consider the random variable
(304)
where X.sub.iP.sub.i. Under the assumptions of the lemma, for [0, 1/2] it has (Z.sub.i)=0:
(305)
(306) Using Lemma D.4, it gets:
(307)
whence
(308)
(309) By maximizing this expression over , we find that (/2)exp(.sub.0()
) which completes the proof for the case of i.i.d. vectors Y(j).
(310) The case of non-identically distributed vectors follows by union bound over [K]
(311) Lemma D.6. Let (p.sub.i).sub.i1, be probability distributions on x, with sup.sub.i1max.sub.xXP.sub.i(x)1c, and sup.sub.i21 ExEx P.sub.i (x) (log pi (x))C for constants c, C.
(312) Let (X.sub.i).sub.il be independent random variables with X.sub.i, X=(X.sub.1, . . . ,
). Let Y(j)
, j1 be a sequence of i.i.d. copies of X. Finally, let T=min {t1: Y(t)=X}.
(313) Then, for any >0, there exists = (, c, C)>0, such that
(314)
(315) Proof. The proof follows the same argument as for Lemma D.5. Denote by Y a vector distributed as Y(i). and define t.sub.l():=
(316)
(317) Hence,
(318)
(319) It claims that, for each >0,
(u)
for some .sub.0(u)>0. Using again Chernoff's bound, it gets, for any >0,
(u)e.sup.l(,u), where:
(320)
where in the last line X.sub.iP.sub.i. Under the assumptions of the lemma, W.sub.iC almost surely and applying again Lemma D.4, it gets {tilde over ()}(; p.sub.i)log(1+c,.sup.2) for 1. The proof is completed by selecting for each u>0, >0 so that ulog (1+c,.sup.2>0.
(321) It is ready to prove Lemma D.2.
(322) Proof of Lemma D.2. It begin by proving the bound (D.8).
(323) Fix i, j
, u
, >0, and write
=
(, u.sub.i). Define R.sub.ij:={i,j): max(1, j1)jj1} and S.sub.ij={, j): {<i or i=i, j<jl}. Finally, for t{0, . . . , l1}, let S.sub.ij(t)=S.sub.ij n {i, j): {ij>=t mod l}.
(324) By union bound,
(325)
(326) Now, by the bound of Eq. (D.4),
(327)
for suitable constants C,.sub..
(328) Next, for any t{0, . . . , l1}, the vectors
(329)
are mutually independent and independent of
(330)
Conditional on u, the coordinates of
(331)
are independent with marginal distributions (note that independence of the coordinates holds because l<m/2 and therefore
(332)
does not include two entries in the same column). Note that the collection of marginal distributions (.Math.|
), u
satisfies the conditions of Lemma D.5 by assumption. Further, the vector
(333)
can have at most one of K=||.sup.2(
+1) distributions (depending on the latents value and the occurrence of a line break in the block.)
(334) Applying Lemma D.5, it obtains:
(335)
(336) Summing over t{0, . . . , 1} and adjusting the constants yields the claim (D.8).
(337) Next consider the bound (D.9). Fix u, i
, j
, and write
=
(,
) for brevity below:
(338)
(339) Here, t{0, . . . , 1} can be chosen arbitrarily. Let S.sub.ij (t;u)={i, j) S.sub.ij(t)s. t.
=
, j<n} Conditional on u, the vectors
(340)
are i.i.d. and independent of
(341)
Further, they are distributed as
(342)
Finally,
(343)
where (u) is the number rows i<i such that
=u. Since ii and
(
=u)
(
)>0, by Chernoff bound there exist constants C, c.sub.0 such that, for all m, n large enough (since im.sub.0):
(344)
(345) Further, for any >0 we can choose positive constants .sub.0, .sub.1>0 such that the following holds for all ,
, large enough:
(346)
(347) Let T.sub.ij be the rank of the first (i, j) in the set defined above such that
(348)
and T.sub.ij= if no such vector exists. It can continue from Eq. (D.45) to get:
(349)
where in () it used Lemma D.6. This completes the proof of Eq. (D.9).
Proof of Lemma D.3
(350) It begins by considering the bound (D.10).
(351) Fix i, j
,
, v
, >0, and write
=
(,
),
. By union bound:
(352)
(353) Note that for a fixed s, and conditional on u, v, the vectors
(354)
are mutually independent and independent of
(355)
Further,
(356)
has independent coordinates with marginals Qx|
(.Math.|
v.sub.s, (recall that we are conditioning both on
and v). In particular, the marginal distributions satisfy the assumption of Lemma D.5 and the law of
(357)
can take one Of K=||.sup.2 (
+1) possible values. Letting i-T(s) the last row at which
(358)
(with T(s)if no such row exists), it has, for some constants C, c.sub.0>0,
(359)
where in (a) it used Lemma D.5, and it defined
(360)
(361) Taking expectation with respect to v, it gets:
(362)
where in () it used Chernoff bound. This completes the proof of Eq. (D.10).
(363) Finally, the proof Eq. (D.11) is similar to the one of Eq. (D.9). It fixes , i
, j
, and writes
=
(,
).
(364)
(365) Let
(366)
(367) Conditional on , v, the vectors
(368)
are i.i.d. and independent copies of
(369)
Finally,
(370)
(371) is the number rows i<i such that =
. By Chernoff bound there exist constants C, c0 such that, for all m, n large enough (recalling it needs only to consider im.sub.0):
(372)
(373) Since m.sub.0m.sup.1-on(1), for any >0 it can choose constants .sub.0, .sub.1>0 so that:
(374)
(375) Recall the definition
(376)
By an an application of Chernoff bound:
(377)
(378) Let T.sub.i be the rank of the first i in the set defined above such that
(379)
and T.sub.i= if no such vector exists. From Eq. (D.48) it gets:
(380)
where in (a) it used Lemma D.6.
E Proofs for Latent-Based Encoders
Proof of Lemma 5.4
(381) General bound (5.15)
(382) Define the ideal expected compression rate (e.g., the rate achieved by a compressor that is given the latents):
(383)
(384) Since R.sub.lat(X)1 by construction, it has:
(385)
where in step (*) we bounded [len(Z(
))
]=
[len(Z(
))
]
[len(Z(
))], because, on the event {.sub.U(X;
)=1},
coincides with
up to relabelings, and the compressed length is invariant under relabelings. Similar arguments were applied to len(Z(v)) and len (Z(X(
, v))).
(386) It follows, by the definition of .sub.z(N; k) in Eq. (5.19),
(387)
where in the last line is the empirical distribution of the row latents and is the empirical distribution of the column latents. By taking expectation in the last expression, it gets:
(388)
(389) Finally, the header contains ||.sup.2+2 integers of maximum size
, whence len(header)4 log.sub.2(mn). It concludes that:
(390)
(391) The claim (5.15) follows from the first bound in Eq. (5.2) noticing that, under the stated assumptions on ,
,
(392)
(393) Overheads of specific encoders: Eqs. (5.16)-(5.5).
(394) LZ coding. Let X.sup.N=(X.sub.1, . . . , X.sub.N) be a vector with i.i.d. symbols X.sub.iq with q a probability distribution over . There are two important differences with the analysis above: data are i.i.d. (not matrix structured) and it is desirable to derive a sharper estimate (not just the entropy term but bounding the overhead as well).
(395) It defines L.sub.k(X.sup.N), T.sub.k(X.sup.N), and lets (k(1), . . . , k(M)) be the values taken by k in the while loop of the Lempel-Ziv pseudocode. In particular,
(396)
(397) Therefore, the total length of the code is:
(398)
where the last step follows by Jensen's inequality. By one more application of Jensen,
(399)
(400) For , define the set of bad positions as:
(401)
whence,
(402)
and therefore, for any c(0, 1):
(403)
Certain Definitions
(404) While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.
(405) Whenever the term at least, greater than, or greater than or equal to precedes the first numerical value in a series of two or more numerical values, the term at least, greater than or greater than or equal to applies to each of the numerical values in that series of numerical values. For example, greater than or equal to 1, 2, or 3 is equivalent to greater than or equal to 1, greater than or equal to 2, or greater than or equal to 3.
(406) Whenever the term no more than, less than, or less than or equal to precedes the first numerical value in a series of two or more numerical values, the term no more than, less than, or less than or equal to applies to each of the numerical values in that series of numerical values. For example, less than or equal to 3, 2, or 1 is equivalent to less than or equal to 3, less than or equal to 2, or less than or equal to 1.
(407) The present disclosure uses boldface for vectors and uppercase boldface for matrices, without making any typographic distinction between numbers and random variables. The dimensions of a matrix or a vector is indicated by subscripts. For example, u.sup.m is a vector of length m, and X.sup.m,n is a matrix of dimensions mn.
(408) If X, Y are random variables on a common probability space (, F, P), their entropies are denoted by H(X), H(Y), H(X, Y) denotes their joint entropy, H(X|Y) denotes the conditional entropy of X given Y.
(409) Computer Systems
(410) The present disclosure provides computer systems that are programmed to implement methods of the disclosure. In some cases, the data objects to be processed may be stored in a cloud via services such as Amazon AWS or Google GCP. For instance, cloud compute instances obtained through the same providers may host the data reducers that process data according to the methods described herein.
(411) The computer system 501 includes a central processing unit (CPU, also processor and computer processor herein) 505, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 501 also includes memory or memory location 510 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 515 (e.g., hard disk), communication interface 520 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 525, such as cache, other memory, data storage and/or electronic display adapters. The memory 510, storage unit 515, interface 520 and peripheral devices 525 are in communication with the CPU 505 through a communication bus (solid lines), such as a motherboard. The storage unit 515 can be a data storage unit (or data repository) for storing data. The computer system 501 can be operatively coupled to a computer network (network) 530 with the aid of the communication interface 520. The network 530 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 530 in some cases is a telecommunication and/or data network. The network 530 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 530, in some cases with the aid of the computer system 501, can implement a peer-to-peer network, which may enable devices coupled to the computer system 501 to behave as a client or a server.
(412) The CPU 505 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 510. The instructions can be directed to the CPU 505, which can subsequently program or otherwise configure the CPU 505 to implement methods of the present disclosure. Examples of operations performed by the CPU 505 can include fetch, decode, execute, and writeback.
(413) The CPU 505 can be part of a circuit, such as an integrated circuit. One or more other components of the system 501 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
(414) The storage unit 515 can store files, such as drivers, libraries and saved programs. The storage unit 515 can store user data, e.g., user preferences and user programs. The computer system 501 in some cases can include one or more additional data storage units that are external to the computer system 501, such as located on a remote server that is in communication with the computer system 501 through an intranet or the Internet.
(415) The computer system 501 can communicate with one or more remote computer systems through the network 530. For instance, the computer system 501 can communicate with a remote computer system of a user (e.g., laptop). Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple iPad, Samsung Galaxy Tab), telephones, Smart phones (e.g., Apple iphone, Android-enabled device, Blackberry), or personal digital assistants. The user can access the computer system 501 via the network 530.
(416) Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 501, such as, for example, on the memory 510 or electronic storage unit 515. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 505. In some cases, the code can be retrieved from the storage unit 515 and stored on the memory 510 for ready access by the processor 505. In some situations, the electronic storage unit 515 can be precluded, and machine-executable instructions are stored on memory 510.
(417) The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
(418) Aspects of the systems and methods provided herein, such as the computer system 501, can be embodied in programming. Various aspects of the technology may be thought of as products or articles of manufacture typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. Storage type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible storage media, terms such as computer or machine readable medium refer to any medium that participates in providing instructions to a processor for execution.
(419) Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
(420) The computer system 501 can include or be in communication with an electronic display 535 that comprises a user interface (UI) 540 for providing, for example, compression results and analytics. Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface.
(421) Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 505.
(422) While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
(423) While preferred embodiments of the present subject matter have been shown and described herein, it can be understood that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now without departing from the present subject matter. It can be understood that various alternatives to the embodiments of the present subject matter described herein may be employed in practicing the present subject matter.