Systems and methods for compressing structured data via latent variable estimation

12531575 · 2026-01-20

Assignee

GRANICA COMPUTING, INC. (Mountain View, CA, US)

Inventors

Cpc classification

International classification

Abstract

A computer-implemented method for compressing structured data or semi-structured data is provided. The method comprises: (a) estimating one or more latent variables associated with rows or columns of the structured data or semi-structured data; (b) partitioning the structured data or semi-structured data in one or more blocks according to the one or more one or more latent variables; (c) applying a sequential encoding algorithm to each of the blocks; and (d) appending a compressed encoding of the one or more latent variables.

Claims

1. A computer-implemented method for compressing structured data or semi-structured data, comprising: (a) estimating latent variables associated with rows or columns of the structured data or semi-structured data; (b) partitioning the structured data or the semi-structured data into one or more blocks according to the latent variables; (c) applying a sequential encoding algorithm to each of the one or more blocks; and (d) appending a compressed encoding of the latent variables to the compressed encoding of each of the one or more blocks.

2. The computer-implemented method of claim 1, wherein the latent variables comprise row latent variables associated with the rows and column latent variables associated with the columns.

3. The computer-implemented method of claim 1, wherein the latent variables are estimated utilizing a spectral clustering algorithm.

4. The computer-implemented method of claim 1, wherein the latent variables are estimated using side information.

5. The computer-implemented method of claim 4, wherein the side information comprises a column datatype, a column name, or a row name.

6. The computer-implemented method of claim 1, further comprising, prior to (b), reordering the rows and columns of the structured data or semi-structured data based at least in part on the latent variables.

7. The computer-implemented method of claim 1, further comprising, prior to (c), generating a serialized block vector for each of the one or more blocks.

8. Computer-implemented method of claim 7, wherein the sequential encoding algorithm is applied to each serialized block vector.

9. The computer-implemented method of claim 8, wherein applying the sequential encoding algorithm comprises applying a first base compressor to each serialized block vector and a second base compressor to a latent variables of each serialized block vector.

10. The computer-implemented method of claim 9, wherein the first base compressor and the second base compressor are different.

11. The computer-implemented method of claim 1, wherein the structured data or semi-structured data is associated with hyperspectral imaging, image processing, quantum chemistry, or large language models.

12. The computer-implemented method of claim 1, wherein the structured data or semi-structured data is associated with tabular data.

13. The computer-implemented method of claim 1, wherein the method achieves an optimal compression rate.

14. The computer-implemented method of claim 1, wherein the method reduces a compression rate by at least about 5% compared to frequency-based entropy encoders (ANS), Lempel-Ziv encoders, or finite-state encoders.

15. A non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements a lossless compression algorithm, comprising: (a) estimating latent variables associated with rows or columns of a structured data or semi-structured data; (b) partitioning the structured data or semi-structured data into one or more blocks according to the latent variables; (c) applying a sequential encoding algorithm to each of the one or more blocks; and (d) appending a compressed encoding of the latent variables to the compressed encoding of each of the one or more blocks.

16. The non-transitory computer readable medium of claim 15, wherein the latent variables comprise row latent variables associated with the rows and column latent variables associated with the columns.

17. The non-transitory computer readable medium of claim 15, wherein the latent variables are estimated utilizing a spectral clustering algorithm.

18. The non-transitory computer readable medium of claim 15, wherein the latent variables are estimated using side information.

19. The non-transitory computer readable medium of claim 18, wherein the side information comprises a column datatype, a column name, or a row name.

20. A computer program product for compressing structured data or semi-structured data, the computer program product comprising at least one non-transitory computer-readable medium having computer-readable program code portions embodied therein, the computer-readable program code portions comprising: an executable portion configured to estimate variables associated with rows or columns of the structured data or semi-structured data; an executable portion configured to partition the structured data or the semi-structured data into one or more blocks according to the latent variables; an executable portion configured to apply a sequential encoding algorithm to each of the one or more blocks; and an executable portion configured to append a compressed encoding of the latent variables to the compressed encoding of each of the one or more blocks.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also figure and FIG. herein), of which:

(2) FIG. 1 shows an example of a spectral clustering algorithm in the pseudo-code for estimating spectral latents, in accordance with some embodiments;

(3) FIG. 2 shows an example of data reduction rate (DRR) using different datasets achieved by classical and latent-based compressors on real tabular data, in accordance with some embodiments. Datasets included Facebook (FB) Networks 1, 2, and 3, GP Network 1, Forest, Card Transactions, Business price index, US census, and Jokes. LZ refers to row-major order ZSTD, LZ (c) refers to column-major order ZSTD. KMeans was run on the data 5 times with random initializations finding the DRR each time and reporting the average. Datasets marked with(s) had rows randomly permuted (e.g., shuffled) before compressing. ZSTD generally refers to a Zstandard fast compression algorithm or method;

(4) FIG. 3A shows a comparison of data reduction rate (DRR) of naive ZSTD coding and latent-based ZSTD coding for synthetically generated data, in accordance with some embodiments. ZSTD generally refers to a Zstandard fast compression algorithm or method;

(5) FIG. 3B shows comparison of data reduction rate (DRR) of naive ANS coding and latent-based asymmetric numeral systems (ANS) coding for synthetically generated data. Contour lines correspond to the information-theoretically optimal compression rate, in accordance with some embodiments;

(6) FIG. 4 shows an example of Lempel-Ziv algorithm, in accordance with some embodiments;

(7) FIG. 5 shows a computer system that is programmed or otherwise configured to implement methods and algorithms provided herein, in accordance with some embodiments; and

(8) FIGS. 6-7 show an example of partitioning a table into blocks according to the latent variables, in accordance with some embodiments.

DETAILED DESCRIPTION

(9) The present disclosure provides improved methods for compressing structured data or semi-structured data. In some embodiments, the method comprises (a) estimating latent variables associated with rows or columns of the structured data or semi-structured data. In some embodiments, the method comprises (b) partitioning the structured data or the semi-structured data into one or more blocks according to the latent variables. In some embodiments, the method comprises (c) applying a sequential encoding algorithm to each of the one or more blocks. In some embodiments, the method comprises (d) appending a compressed encoding of the latent variables to the compressed encoding of each of the one or more blocks.

(10) Overview

(11) Data in structured or semi-structured format, which can be used for analytics and machine learning, may take the form of tables with, e.g., categorical or numerical entries. An unmet need exists for an improved lossless compression algorithm particularly for structured data or semi-structured data. The present disclosure provides systems and methods, which can improve the lossless compression performance over other methods or algorithms.

(12) For example, the present disclosure provides a probabilistic model to identify optimal compression for data, e.g., structured or tabular data. In some cases, latent values can be independent and table entries can be conditionally independent given the latent values. The probabilistic model can be characterized with a well-defined entropy rate to determine the ideal compression rate and to satisfy an asymptotic equipartition property. By applying the probabilistic model to different datasets, the latent estimation approach disclosed herein can achieve the optimal rate. In contrast, other compression methods, e.g., Lempel-Ziv or finite-state encoders, may not be able to achieve the optimal rate.

(13) In some cases, latent variables can be used in data compression methods based on machine learning and probabilistic modeling. Generally, latent variables in statistical models can include random variables that are unmeasured but may be unmeasurable. Latent variables can be included in statistical models for, in some cases, (1) including in the model features of interest that may not be directly measurable or were not measured; (2) constructing estimators that may be more efficient than those constructed from nonlatent variables models; and (3) constructing estimators of manipulation statistics that are unbiased.

(14) For example, stochastically generated codewords (e.g., random latents) can lead to minimum description lengths via bits back coding. This method can be explicitly applied to lossless data compression using arithmetic coding or ANS coding. For example, data compression via low-rank approximation or latents-based approaches can be applied to numerical analysis applications, hyperspectral imaging, image processing (2D or 3D), quantum chemistry, large language or machine learning models, artificial intelligence (AI) models, audio processing, cryptography, genetics or genome data, and the like. However, some latents-based approaches may mainly center on lossy compression and do not precisely quantify compression rate, e.g., they do not count bits.

(15) The present disclosure can provide an improvement in lossless compression rate over other methods. For example, the present disclosure provides a family of lossless compression algorithms for data, e.g., structured or semi-structured data. In some embodiments, the methods herein may comprise: (i) estimating latent variables associated to rows and columns of the table; (ii) partitioning the table in blocks according to the row or column latents; (iii) applyig a sequential encoding algorithm (e.g., Lempel-Ziv compression or entropy coding) to each of the blocks; and (iv) appending a compressed encoding of the latent. Evaluation of the methods on several bench-mark datasets show that latent estimation and row or column reordering can improve compression rate.

(16) In some cases, methods disclosed herein can improve upon mathematical algorithms, e.g., in compressing tabular data, by providing an asymptotically optimal algorithm that overcomes technical problems of other methods or algorithms. The lossless compression algorithms disclosed herein may be applied to various types of documents or data such as documents or data having comma-separated values (CSV) data, tab-separated values (TSV) data, Apache Parquet data, Apache optimized row columnar (ORC) data, Apache Feather data, Microsoft Excel (XLS) data, and the like. In some cases, the lossless compression algorithms disclosed herein may be applied to a document comprising semi-structured data (e.g., XML documents, JSON files, NoSQL databases, HTML code, graphs and tables, emails, etc.) or structured data (e.g., tabular data).

(17) 1 Introduction

(18) Some methods of lossless compression may assume that data takes the form of a random vector X.sup.N=(X.sub.1, X.sub.2, . . . , X.sub.N) of length N with entries in a finite alphabet X. Under suitable ergodicity assumptions and using the Shannon-McMillan-Breiman theorem, the entropy per letter may converge to a limit as h:=lim.sub.N.fwdarw.(X.sup.N)/N. Universal coding schemes (e.g., Lempel-Ziv coding) may not require knowledge of the distribution of X.sup.N and can encode such a sequence without information loss using (e.g., asymptotically) h bits per symbol. While this theory is mathematically useful, its modeling assumptions (e.g., stationarity, ergodicity, asymptotia, and the like) may not be satisfied by actual data in many applications.

(19) For example, consider a data table with m rows and n columns and entries in custom character , :=. An approach to such data can include: (i) serializing, e.g., in row-first order, to form a vector of length N=, X.sup.N=(X.sub.11, X.sub.12, . . . , X.sub.in, X.sub.21, . . . , ) and (ii) applying a standard compressor (e.g., Lempel-Ziv) to this vector.

(20) It can be shown, both empirically and mathematically, that such an approach can be suboptimal in the sense of not achieving the optimal compression rate, e.g., data reduction rate (DRR). This can happen even in the limit of large tables, as long as the number of columns and rows are polynomially related (e.g., custom character ) for some small constant & and large constant M).

(21) The present disclosure provides improved methods for compressing structured data or semi-structured data. In some cases, the method may use the following general scheme of operations 1, 2, and 3:

(22) 1. Estimating row latents custom character =(u.sub.1, . . . , u.sub.m) and column latents =(v.sub.1, . . . , v.sub.n) , with , a finite alphabet. For example, FIGS. 6-7 illustrate an example of applying the method to structured data in, e.g., a table. As illustrated in FIG. 6, latents 620, including the row latents and column latents, are computed 611 for the structured data, e.g., table 610. In some cases, the method may implement latents estimation using a spectral clustering algorithm or using side information, described herein elsewhere. Side information can include: columns datatype, column names, row names, and the like.

(23) 2. Partitioning the table in blocks according to the row or column latents. For example, for each pair of latents values custom character , , let R(u) be the subset of rows with latent value =u and C(v) be the subset of columns with latent value =v. Construct the table X_M (u, v) comprising the rows R (u) and columns C (v), in the same order as they appear in the original table. Then construct the vector or sequence X (u, v) by serializing X_M (u, v). In some cases, serializing the table may comprise scanning the entries of X_M (u, v) in row-first order or in column-first order. This operation can be performed by:

(24) $\begin{matrix} X (u, v) = vec (X_{i j} : u_{i} = u, v_{j} = v) & (Eq . 1.1) \end{matrix}$
where vec (M) denote the serialization of matrix M (either row-wise or column-wise).

(25) As illustrated in FIG. 6, constructing the table X_M (u, v) may include reordering rows 621 and columns 621 of the table 620 based on the latents to obtain a reordered table 630. Then, the reordered table 630 can be partitioned into blocks 631 based on the latents. For example, as illustrated in FIG. 7, the blocks may be partitioned based on the latent values. In some cases, the blocks may have variable sizes based on the reordered values of the latent variables. Next, the partitioned blocks in the table 640 may be serialized 641 to generate a plurality of vectors 650. Each vector may correspond to a block or be referred to as a serialized block vector X (u, v).

(26) 3. Applying a base compressor, which can be generically denoted by Zx: custom character .fwdarw.{0, 1}*, to each block X (u, v):

(27) $\begin{matrix} z (u, v) = Z_{} (X (u, v)), u, v & (Eq . 1.2) \end{matrix}$

(28) 4. Encoding the row latents and column latents using a possibly different compressor custom character : *.fwdarw.{0,1} *, to get z.sub.row=Z.sub.L(), Z.sub.col=(). For example, as illustrated in FIG. 7, in the sequence compression 651, a base compressor such as sequence compression can be applied to the block vectors 651, and a compressor, which can be different than the base compressor, can be applied to the latents to obtain a plurality of compressed vectors 660. Considerations for selecting the base compressors are described herein elsewhere. Then, output the concatenation of all the above as:

(29) $\begin{matrix} E n c (X^{m, n}) = header z_{r o w} z_{col}_{u, v} z (u, v) & (Eq . 1.3) \end{matrix}$
where denotes concatenation, and header is a header that contains encodings of the lengths of subsequent segments (e.g., | custom character |.sup.2+2 integers). The compressed vectors 660 may then be serialized 661 or concatenated to form a final vector 670.

(30) In some cases, encoding the latents can lead to a suboptimal compression rate. For example, techniques such as bits-back coding can merely yield limited improvement towards obtaining an optimal compression rate. It can be shown that the compression rate improvement achieved by such bits-back coding may only significant in certain special regimes, which is described herein elsewhere.

(31) In some cases, methods described herein can leave several design choices undefined. For example, design choices can include: (1) the latents estimation procedure; (2) the base compressor Z.sub.x for the blocks X( custom character , ); or (3) the base compressor , for the latents. Implementation alongside empirical evaluation is described herein elsewhere.

(32) 2 Implementation

(33) 2.1 Base compressors

(34) Dictionary-based compression (LZ). As an example, Zstandard (ZSTD) can be used in its Python implementation. ZSTD can implement a Lempel-Ziv style algorithm.

(35) Frequency-based entropy coding (ANS). For each data portion (e.g., each block X( custom character , ) and each of the row latents and column latents ), compute empirical frequencies of the corresponding symbols. For example, for all , , , compute:

(36) $\overset{}{Q} (x | u, v) := \frac{1}{N (u, v)} \underset{i : u_{i} = u}{.Math.} \underset{j : v_{j} = v}{.Math.} 1_{x_{i j} = x}$ $\overset{}{r} (u) := \frac{1}{m} {.Math.}_{i = 1}^{m} 1_{u_{i} = u}, and$
where N( custom character , ) is the number of im, jn such that =u, =v. Apply ANS coding to each block X (u, v), modeling its entries as independent with distribution {circumflex over (Q)}(.Math.|u, v), to the row latents using the distribution {circumflex over (r)} (.Math.), and to the column latents using the distribution (.Math.). Separately encode these counts as long integers and prepend them to the file. In some cases, they can be a negligible fraction of the file size.
2.2 Latent Estimation

(37) In some cases, the method implements latents estimation using a spectral clustering algorithm. In some cases, the latents may be estimated using side information. Side information can include columns datatype, column names, row names, and the like. FIG. 1 shows an example of a spectral clustering algorithm in the pseudo-code.

(38) In some cases, the algorithm may encode the data matrix custom character as an real-valued matrix using a map : .fwdarw.. In some cases, this map may not be optimized. In some cases, the elements of can be arbitrarily encoded as 0, 1, . . . ||1.

(39) In some cases, the singular vector calculation can be the most time consuming part of the algorithm. Computing approximate singular vectors via power iteration can require at least log (m{circumflex over ()}n) matrix vector multiplications for each of k vectors. This can amount to mnk log (m{circumflex over ()}n) operations, which can be larger than the time needed to compress the blocks or to run KMeans. The methods disclosed herein can improve this to (mVn) klog (mVn) n) via subsampling operations, which are described herein elsewhere.

(40) In some cases, for the spectral clustering operations, the method may use KMeans with k clusters, initialized randomly. For example, the algorithm may use the scikit-learn implementation via sklearn.cluster.KMeans. In some cases, the overall latent estimation approach may not estimate or make use of the model Q(.Math.|u, v). In some cases, an alternative approach based on an expectation-maximization (EM) algorithm may use such an estimation.

(41) 3 Empirical Evaluation

(42) Methods disclosed herein were evaluated on tabular datasets with different origins described below. The evaluation can be used to assess the impact of using latents in reordering columns and rows. In some cases, the evaluation may not attempt to achieve the best possible data reduction rate (DRR) on each dataset but rather to compare compression with latents and without latents. In some cases, the evaluation processed categorical variables by preprocessing the data to fit in this setting as described herein elsewhere. In some cases, the preprocessing included dropping some of the columns of the original table. The number of columns before preprocessing is denoted by custom character , and after by . Differences between the implementation used in the evaluation and the methods described herein elsewhere are conceptually minor but can be practically important. Differences included: using different sizes for rows latent alphabet and column latent alphabet |||| and choosing | custom character |, || by optimizing compression size.

(43) 3.1 Datasets

(44) The following datasets were utilized in the evaluation: taxicab, network, card transactions, business price index, forest, US census, and jokes.

(45) Taxicab. This table has custom character =62, 495 rows, =20 columns comprising data for taxi rides in New York City during January 2022. After preprocessing, this table had =18 columns. For the LZ (ZSTD) compressor, 9 row latents were used. For the ANS compressor, 8 row latents were used. Both methods used column latents with each column compressed separately.

(46) Network. Four social networks from Stanford Network Analysis Platform (SNAP) Datasets, representing either friends as undirected edges for Facebook (FB) or directed following relationships on Google Plus (GP). It regards these as four distinct tables with 0-1 entries, with dimensions, respectively custom character ={333, 747, 786, 1187}. For each table, the evaluation used 5 row latents and 5 column latents.

(47) Card transactions. A table of simulated credit card transactions containing information like card ID, merchant city, zip code, etc. This table has custom character =24,386,900 rows and =15 columns. After preprocessing, the table had =12 columns. For this table, the evaluation used 3 row latents and n column latents.

(48) Business price index. A table of the values of the consumer price index of various goods in New Zealand between 1996 and 2022. This table has custom character =72,750 rows and =12 columns from the Business price indexes: March 2022 quarter-CSV file. After preprocessing, this table had n=10 columns. Due to the highly correlated nature of consecutive rows, the data was shuffled before compressing. For the LZ method, the evaluation used 4 row latents. For the ANS method, the evaluation used 7 row latents. Both methods used custom character column latents with each column compressed separately.

(49) Forest. This table from the UC Irvine (UCI) Machine Learning Repository has custom character =581,011 cartographic measurements with =55 attributes, to predict forest cover type based on information gathered from US Geological Survey. The data contained binary qualitative variables and some continuous values like elevation and slope. After preprocessing, this data had =55 columns. For the LZ method, the evaluation used 9 row latents. For the ANS method, the evaluation used 8 row latents. Both methods used n column latents with each column compressed separately

(50) US Census. This table from the UCI Machine Learning Repository has custom character =2,458,285 and =68 categorical attributes related to demographic information, income, and occupation information. After preprocessing, this data had =68 columns. For this data, the evaluation used 9 row latents and n column latents.

(51) Jokes. A table containing ratings of a series of jokes by 24,983 users collected between April 1999-May 2003. These ratings are real numbers on a scale from 10 to 10, and a value of 99 is given to jokes that were not rated. This table has custom character =23,983 rows and =101. The first column identifies how many jokes were rated by a user, and the rest of the columns contain the ratings. After preprocessing, this data had =101 columns, all quantized. For the LZ method, the evaluation used 2 row latents. For the ANS method, the evaluation used 5 row latents. Both methods used n column latents.

(52) 3.2 Preprocessing

(53) Methods disclosed herein may preprocess different columns as follows: if a column comprises K256 unique values, then it maps the values to {0, . . . ,K1}; if a column is numerical and comprises more than 256 unique values, the method may calculate the quartiles for the data and map each entry to its quartile membership (e.g., 0 for the lowest quartile, 1 for the next largest, 2 for the next, and 3 for the largest); and if a column does not meet either of the above criteria, it may be discarded.

(54) In some evaluations, the data was randomly permuted before compression because some of the above datasets had rows already ordered in a way that can make nearby rows highly correlated.

(55) 3.3 Results

(56) Given a lossless encoder : custom character .fwdarw.{0,1} *, its compression rate and data reduction rate (DRR) can be defined as:

(57) $\begin{matrix} R_{} (X^{m, n}) := \frac{len ((X^{m, n}))}{mn \log_{2} .Math. .Math.}, D R R_{} (X^{m, n}) := 1 - R_{} (X^{m, n}) & (Eq . 3.1) \end{matrix}$
where larger DRR generally means better compression.

(58) The DRR of each algorithm is illustrated in FIG. 2. Datasets included datasets described herein elsewhere, e.g., taxicab, network, card transactions, business price index, forest, US census, and jokes. For the table of results in FIG. 2, LZ refers to row-major order ZSTD, LZ (c) refers to column-major order ZSTD. KMeans was run on the data 5 times, with random initializations finding the DRR each time, and reporting the average. Data marked with (s) had rows randomly permuted (e.g., shuffled) before compressing. ZSTD generally refers to a Zstandard fast compression algorithm or method.

(59) The results in FIG. 2 show that: (1) Latent+ANS encoders can achieve systematically the best DRR; (2) the use of latents in several cases can yield a DRR improvement of at least about 5% of the uncompressed size; and (3) this improvement is larger for data with a large number of columns, e.g., the network data of FB Network 1, FB Network 2, FB Network 3, and GP Network 1.

(60) 4 a Probabilistic Model

(61) In order to better understand the technical problems of other approaches and the improved optimality of latent-based compression, the presented disclosure provides a probabilistic model for the table custom character . The model assumes the true latents ().sub.im, to be independent random variables with:

(62) $\begin{matrix} (u_{i} = u) = r (u), (v_{i} = v) = c (v) & (Eq . 4.1) \end{matrix}$

(63) Also, the model assumes that the entries custom character , are conditionally independent given =().sub.im, =().sub.jn, with:

(64) $\begin{matrix} (X_{i j} = x | u^{m}, v^{n}) = Q (x | u_{i}, v_{j}) & (Eq . 4.2) \end{matrix}$

(65) The distributions r, c, and conditional distribution Q are parameters of the model with a total of 2(| custom character |1)+||.sup.2(||1) real parameters. In some cases, (X.sup.m,n, , )(Q, r, c; , ) can indicate that the triple (, , ) is distributed according to the probabilistic model, occasionally omits a subset of the variables, and thus results as ()(Q,r,c; ,). In some cases, the statements may be non-asymptotic, in which case custom character , , , , Q, r, c can be fixed. In some cases, the statements may be asymptotic, in which case a sequence of problems can be indexed by .

(66) FIG. 3A shows a comparison of data reduction rate (DRR) of naive ZSTD coding and latent-based ZSTD coding for synthetically generated data. Contour lines correspond to the information-theoretically optimal compression rate.

(67) As an example of Symmetric Binary Model (SBM), the following SBM can be used, which parallels the symmetric stochastic block model for community detection. For example, take custom character =[k]: ={1, . . . , k}, ={0,1}, r=c=Unif ([k]), the uniform distribution over [k], and

(68) $\begin{matrix} Q (1 | u, v) = {\begin{matrix} p_{1} & if u = v \\ p_{0} & if u v \end{matrix} & (Eq . 4.3) \end{matrix}$ where (X.sup.m,n, custom character , )(, , k; , ) when this distribution is used.

(69) FIG. 3A and FIG. 3B show the results of simulations within this model, respectively for ZSTD-based. In this case custom character ==1000, k=3, and DRR values are averaged over 4 realizations. In some cases, the use of latents can be irrelevant along the line p.sub.1p.sub.0, which may not impact the distribution of X.sub.ij. In some cases, the use of latents can become important when p.sub.1 and p.sub.0 are significantly different. FIG. 3B shows comparison of data reduction rate (DRR) of naive ANS coding and latent-based ANS coding for synthetically generated data. Contour lines correspond to the information-theoretically optimal compression rate.

(70) 5 Theoretical analysis

(71) 5.1 Ideal Compression

(72) Lemma 5.1. A first theorem or lemma can provide upper and lower bounds on the entropy per symbol H ( custom character )/. The first lemma 5.1 can be shown as:

(73) $\begin{matrix} (Eq . 5.1) \end{matrix}$ $H (X | U, V) \frac{1}{m n} H (X^{m, n}) H (X | U, V) + \frac{1}{n} H (U) + \frac{1}{m} H (V)$

(74) Further, for any estimators custom character : .fwdarw., : .fwdarw.let

(75) 0 ${\hat{A}}_{U} := \min_{} {.Math.}_{i = 1}^{m} 1_{{\hat{u}}_{i} (u_{i})} / m, {\hat{A}}_{V} := \min_{} {.Math.}_{i = 1}^{n} 1_{{\hat{v}}_{i} = (v_{i})} / n$
(the minimum is over permutations of the letters in custom character ). Letting .sub.U: =.sub.U, .sub.V:=.sub.V, the formula becomes:

(76) $\begin{matrix} H (X | U, V) + \frac{1}{n} H (U) + \frac{1}{m} H (V) -_{m, n} \frac{1}{m n} H (X^{m, n}) H (X | U, V) + \frac{1}{n} H (U) + \frac{1}{m} H (V)_{m, n} := \frac{1}{n} [h (_{U}) +_{u} \log (.Math. .Math. - 1)] + \frac{1}{m} [h (_{V}) +_{V} \log (.Math. .Math. - 1)] & (Eq . 5.2) \end{matrix}$

(77) Corollary 5.2. Recall the definition of compression rate from Eq. 3.1. Then, there may exist a lossless compressor such that:

(78) $\begin{matrix} (Eq . 5.3) \end{matrix}$ $R_{} (X^{m, n}) \frac{1}{\log_{2} .Math. .Math.} {H (X | U, V) + \frac{1}{n} H (U) + \frac{1}{m} H (V) + \frac{1}{m n}}$

(79) Further, for any lossless compressor , custom character R.sub.()H(X|U,V)+H(U)/)+H()/m.sub.m,n2 log.sub.2()/.

(80) The simpler bound of Eq. 5.1 may imply that the entropy per entry is H(X|U, V)+0(1/ custom character )). The operational interpretation of this result can be that it should be able to achieve the same compression rate per symbol as if the latents were given.

(81) The additional terms

(82) $\frac{1}{n} H (U) + \frac{1}{m} H (V)$
in Eq. 5.2 can account for the additional memory required for the latents. The lower bound in Eq. 5.2 may imply that, if the latents can be accurately estimated from the data custom character (e.g., if .sub.U, .sub.v are small), then this overhead can be essentially or substantially unavoidable.

(83) The nearly ideal compression rate in Eq. 5.3 can be achieved by Huffmann or arithmetic coding, which may require knowledge of the probability distribution of custom character . Under these schemes, the length of the codeword associated to is within a constant number of bits from log.sub.2P(), where P(X.sub.0):=(=X.sub.0) is the probability mass function of the random table . The next lemma 5.3 may imply that the length concentrates tightly around the entropy.

(84) Lemma 5.3. Asymptotic Equipartition Property. For X.sub.0 custom character , let P(X.sub.0)=(X.sub.0) the probability of X.sup.m,n=X, under model X.sup.m,n(Q,r, c; , ). Assume there exists a constant c>0 such that Q (x|u, v)c. Then, there can exist a constant C, which can depend on c, such that the following can happen: for (Q,r, c; , ) and any t0 with probability at least 12e.sup.t:

(85) $\begin{matrix} .Math. - \log P (X^{m, n}) .Math. C \sqrt{m n (m + n)} t & (Eq . 5.4) \end{matrix}$

(86) For simplicity, in the last statement, assumptions can be made that are appropriate to the case, in which Q, r, c may be independent of custom character ,. A more general statement with stronger, e.g., more complicate, bounds is described herein elsewhere.

(87) 5.2 Failure of Classical Compression Schemes

(88) Two types of codes are analyzed herein: finite-state encoders and Lempel-Ziv codes. Both can operate on the serialized data X.sup.N=vec(X.sup.m,n), N= custom character , which can be obtained by scanning the table in row-first order where column-first can yield symmetric results.

(89) 5.2.1 Finite state encoders. A finite state (FS) encoder can take the form of a triple (, f, g) with a finite set of cardinality M=|| and

(90) $\begin{matrix} f : .Math. .fwdarw. {0, 1}^{*}, g : .Math. .fwdarw. .Math. & (Eq . 5.5) \end{matrix}$

(91) Assume that contains a special initialization symbol s.sub.init. Starting from state s.sub.0=s.sub.init, the encoder can scan the input X.sup.N sequentially. Assume after the first custom character input symbols, the state is in state and produced encoding . Given input symbol , the encoder appends f(() to the codeword and updates its state to =g(, ).

(92) Denote custom character (, s.sub.init){0,1} *, the binary sequence obtained by applying the finite state encoder to the vector =(X.sub.1, . . . , ). The FS encoder can be defined as information lossless if for any , (, s.sub.init) is injective.

(93) Theorem 1. Let custom character (Q, r, c; , ) and (, f, g) be an information lossless finite state encoder. Define the corresponding compression rate R.sub.,f,g(), as per Eq. 3.1. Assuming >10, |||, and log.sub.2|||X |/9, then,

(94) $\begin{matrix} R_{, f, g} (X^{m, n}) \frac{H (X | U)}{\log_{2} .Math. .Math.} - 1 0 \sqrt{\frac{\log .Math. .Math.}{n \log .Math. .Math.}} \log (n \log .Math. .Math.) & (Eq . 5.6) \end{matrix}$
where the leading term of the above lower bound is H(X|U)/log.sub.2|X|.

(95) Since conditioning can reduce entropy, this is strictly larger than the ideal rate which is roughly H(X|U,V)/log.sub.2|X|, e.g., Eq. 5.3. The next term can be negligible provided log||<<n log|X|, e.g., modulo logarithmic factors which may be a proof artifact. This condition can be easy to interpret. For example, the finite state machine may not have enough states to memorize a row.

(96) 5.2.2 Lempel-Ziv

(97) The pseudocode of the Lempel-Ziv algorithm is shown in FIG. 4. FIG. 4 shows that, after the first k characters of the input have been parsed, the encoder can find the longest string

(98) $X_{k}^{k + - 1},$
which may appear in the past. The encoder can then encode a pointer to the position of the earlier k appearance of the string T.sub.k and its length L.sub.k.

(99) The method can then encode the pointer T.sub.k in plain binary using [log.sub.2 (N+|x|)]bite. In some cases, T.sub.k {|X|+1, . . . , 1, . . . , N}), and Ly can use an instantaneous prefix-free code, e.g., Elias -code, taking 2 [log.sub.2L.sub.k]+1. This can be sub-optimal, but the space taken by the encoding of L.sub.k can be of lower order with respect to the one of T.sub.k

(100) Assumption 1. The following can hold:

(101) 1. There exist a constant c.sub.o>0 such that:

(102) $\begin{matrix} \max_{x} \max_{u, v} Q (x .Math. u, v) 1 - c_{0} & (Eq . 5.7) \end{matrix}$

(103) 2. Consider sequences of instances with Q r, c, X, custom character fixed and m, n=.fwdarw. so that:

(104) $\begin{matrix} \lim_{n .fwdarw.} \frac{\log m}{\log n} = (0,) & (Eq . 5.8) \end{matrix}$
equivalently custom character =

(105) As mentioned herein, consider sequences of instances with custom character ,.fwdarw., which can mean the sequence to be indexed by , and let =, which can depend on such that Eq. 5.8 holds.

(106) Theorem 2. Define the asymptotic Lempel-Ziv rate as:

(107) 0 $\begin{matrix} R_{L Z}^{} := \frac{1}{\log_{2} .Math. X .Math.} {.Math.}_{u} r (u) [H (X .Math. U = u) (\frac{1 +}{}) H (X .Math. U = u, .Math. V)] & (Eq . 5.9) \end{matrix}$ Then, under Assumption 1,

(108) $\begin{matrix} \lim_{m, n .fwdarw.} R_{L Z} (X^{m, n}) = R_{L Z}^{} & (Eq . 5.1) \end{matrix}$

(109) The asymptotics of the Lempel-Ziv rate can be given by the minimum of two expressions, which can correspond to different behaviors of the encoder. For example, for custom character , define (): =H(X|U=, V)/(H(X|U=)H(X|U=, V)) with ()= if H(X|U=)=H(X|U=,V)).

(110) Then, if <.sub.*( custom character ), then it may be a skinny table regime. For example, the algorithm may mostly deduplicate segments in rows with latent u by using strings in different rows but aligned in the same columns. If <(), then it may be a fat table regime. For example, the algorithm may mostly deduplicate segments on rows with latent u by using rows and columns that are not the same as the current segment.

(111) Example. Symmetric Binary Model (SBM), dense regime. Under the SBM, custom character (,,k;,) example, the optimal compression rate of Corollary 5.2, the finite state compression rate of Theorem 1, the Lempel-Ziv rate of Theorem 2 can be computed.

(112) Assume custom character , of order one, and = as , .fwdarw.. In some cases, the resulting tales can be dense in the sense of containing a constant fraction of non-zeros. Then,

(113) $\begin{matrix} R_{opt} (X^{m, n}) = (1 - \frac{1}{k}) h (p_{0}) + \frac{1}{k} h (p_{1}) + o_{n} (1) & (Eq . 5.11) \end{matrix}$ $\begin{matrix} R_{, f, g} (X^{m, n}) h (\overset{}{p}) + o_{n} (1), \overset{}{p} := (1 -^{\underline{1}} k) p_{0} + \frac{1}{k} p_{1} & (Eq . 5.12) \end{matrix}$ $\begin{matrix} R_{L Z} (X^{m, n}) = h (\overset{}{p}) (\frac{1 +}{}) ((1 - \frac{1}{k}) h (p_{0}) + \frac{1}{k} h (p_{1})) + o_{n} (1) & (Eq . 5.13) \end{matrix}$
where the right hand sides correspond to the contour lines in FIG. 3A and FIG. 3B and where ANS is a finite-state encoder.
5.3 Practical Latent-Based Compression

(114) Achieving the ideal compression rate of Corollary 5.2 via arithmetic or Huffmann coding may require, a priori, computing the probability P ( custom character ) of the table . The method herein can achieve a compression rate that is close to the ideal rate via latents estimation. For example, consider general latents estimators :.fwdarw., : .fwdarw.. Accuracy can be measured by the overlaps as:

(115) $_{U} (X; \hat{u}) := \frac{1}{m} \min_{} {.Math.}_{i = 1}^{m} 1_{{\hat{u}}_{i}} (X) (u_{i})$ $_{V} (X) : \overset{}{v}) : = \frac{1}{n} \min_{} {.Math.}_{i = 1}^{n} 1_{{\hat{v}}_{i}} (X) (v_{i})$
where the minimization is over the set custom character of permutations of the latents alphabet .

(116) Any estimators custom character , can be used to reorder rows and columns and compress the table according to the methods described herein. Denote by R.sub.lat (), the compression rate achieved by such a procedure. From Eq. 1.3, the following can be obtained:

(117) $\begin{matrix} R_{l a t} (X^{m, n}) = \frac{1}{mn \log_{2} .Math. X .Math.} {len (header) + len (Z_{} (\hat{u})) + len (Z_{} (\overset{}{v})) + {.Math.}_{u,} len (Z_{} .Math. \overset{}{X} (u, v)))} & (Eq . 5.14) \end{matrix}$
where {circumflex over (X)}( custom character , )=)=vec(X.sub.ij: (X)=(X)=v are the estimated blocks of X. This rate can depend on the base compressors , but can be omitted these from notations. The first result can imply that, if the latent estimators are consistent, e.g., they recover the true latents with high probability up to permutations, then the resulting rate may be close to the ideal one.

(118) Lemma 5.4. Assume data distributed according to model custom character (Q, r, c; , ), with , log.sub.2||. Further, assume r(), c()>c for all , . Let R.sub.lat () be the rate achieved by the latent-based scheme with latents estimators , and base encoders ==Z. Then,

(119) $\begin{matrix} R_{l a t} (X^{m, n}) \frac{H (X^{m, n})}{mn \log_{2} .Math. X .Math.} + 2 ({\hat{A}}_{U} (X^{m, n}; \hat{u}) < 1) + 2 ({\hat{A}}_{V} (X^{m, n}; \overset{}{v}) < 1) + \frac{4 \log (m n)}{m n} + {.Math. .Math.}^{2}_{Z} (c .Math. mn; {Q (.Math. .Math. u, v)}_{u, v} + 2_{Z} (m n; {r, c}) & (Eq . 5.15) \end{matrix}$
where .sub.z (N; custom character *) can be the worst case overhead of encoder Z over sources with distributions in *.

(120) For example, the overheads of Lempel-Ziv, frequency-based arithmetic coding, and frequency-based ANS coding can be upper bounded respectively as (e.g., where the last bound requires Q,r,c to be independent of N):

(121) $\begin{matrix} _{LZ} (N; *) \frac{C (\log {.Math. .Math.}^{2})}{\log N} & (Eq . 5.16) \end{matrix}$ $\begin{matrix} _{AC} (N; *) \frac{2 .Math. .Math. (\log N)}{N} & (Eq . 5.17) \end{matrix}$ $\begin{matrix} _{ANS} (N; *) \frac{2 .Math. .Math. (\log N + C_{| |}}{N} & (Eq . 5.18) \end{matrix}$

(122) The proof of this lemma and the main content of the lemma is in the general bound of Eq. 5.15. More explicitly, .sub.z (N; custom character *) can be defined by:

(123) $\begin{matrix} _{Z} (N; *) := \frac{1}{N \log_{2} k}, \max_{p *} {_{p} len (Z (Y^{N})) - H (Y^{N})} & (Eq . 5.19) \end{matrix}$
where custom character ([k]): ={().sub.ik :0i.sub.ik =1} is the set of probability distributions over [k], and Y.sup.N is a vector with entries Y.sub.i. The upper bound (5.16) may be closely related to classical results.
A. Notations, Definition, and Basic Facts

(124) The data take form the form

(125) $X_{1}^{n} X^{n},$
Where X is a finite alphabet. Consider lossless encoders E: X.sup.n.fwdarw.{0,1}.sup.*, and denote by

(126) $k (x_{1}^{n}; E) = len (E (x_{1}^{n}))$
the length of the encoded sequence on input custom character .

(127) For ab, it uses

(128) 0 $x_{a}^{b} = (x_{a}, .Math., x_{b})$
to denote the substring of x. Given a sequence custom character , it denotes by

(129) ${\overset{}{p}}_{k} (.Math.; x_{1}^{n})$
the empirical distribution of sequences of length k. Namely, for wcX.sup.k, let,

(130) ${\overset{}{p}}_{k} (w; x_{1}^{n}) := \frac{1}{n - k + 1} {.Math.}_{i = 1}^{n - k + 1} 1_{x_{i}^{i + k - 1} = w}$

(131) The Shannon entropy of a probability distribution p over a finite or countable set Z is given by,

(132) $H (p) = - \underset{z}{.Math.} p (z) \log_{2} p (z)$

(133) Denote by h() the entropy of a Bernoulli random variable Z with custom character (Z=1)=1(Z=0)= by,

(134) $h () = - \log_{2} - (1 -) \log_{2} (1 -)$

(135) Lemma Let A be a finite set and F: A.fwdarw.{0, 1} * be an injective map. Then, for any probability distribution p over A

(136) $\underset{a A}{.Math.} p (a) len (F (a)) H (p) - \log_{2} \log_{2} (.Math. A .Math. + 2)$

(137) Proof. Assume without loss of generality that A={1, . . . , M}, with |A|=K, and that the elements of A have all non-vanishing probability and are ordered by decreasing probability p.sub.1p.sub.2 . . . p.sub.k>0. Let custom character : =2+4+ . . . +=+1-2. Then the expected length is minimized any map F such that len(F())=for .sub.1a with the maximum length being defined by .sub.1<K. For A, L: =len(F(A)), then:

(138) $H (p) := {H (A)}_{}^{(a)} H (L) + H (A | L) \log_{2}_{k} + {.Math.}_{= 1}^{_{k}} (L =) H (A | L =) \overset{(b)}{} \log_{2}_{k} + {.Math.}_{= 1}^{_{M}} (L =) \log_{2} \log_{2} (K + 2) + \underset{a A}{.Math.} p (a) len (F (a))$
where (a) is the chain rule of entropy and (b) follows because by injectivity, given len(F(A))= custom character , A can take at most values.
B Proofs of results on ideal compression
Proof of Lemma 5.1

(139) Claiming that:

(140) $\begin{matrix} \frac{1}{m n} H (X^{m, n}) = H (X | U, V) + \frac{1}{m n} I (X^{m, n}, U^{m}, V^{n}) & (B .1) \end{matrix}$
by the definition of mutual information, obtain H( custom character )=H(|, )+I(|, ). Equation B.1 follows by noting that:

(141) $H (X^{m, n} | U^{m}, V^{n}) \underset{u^{m}}{.Math.} \underset{v^{n}}{.Math.} (U^{m} = u, V^{n} = v) H (X^{m, n} | U^{m} = u, V^{n} = v) \overset{(a)}{=} {.Math.}_{i = 1}^{m} {.Math.}_{j = 1}^{n} \underset{u}{.Math.} \underset{v^{n}}{.Math.} (U^{m} =, V^{n} = v) H (X_{i, j} | U_{i} = u, V_{j} = v) = {.Math.}_{i = 1}^{m} {.Math.}_{j = 1}^{n} \underset{u}{.Math.} \underset{v}{.Math.} (U_{i} = u_{i}, V_{j} = v_{j}) H (X_{i, j} | U_{i} = u_{i}, V_{j} = v_{j}) = {.Math.}_{i = 1}^{m} {.Math.}_{j = 1}^{n} H (X_{i, j} | U_{i}, V_{j}) \overset{(b)}{=} mnH (X_{1, 1} | U_{1}, V_{1})$
where (a) follows from the fact that the (X.sub.i,j) are conditionally independent given U.sup.m, V.sup.n, ad since the conditional distribution of X.sub.i,j only depends on U.sup.m, V.sup.n via U.sub.i, V.sub.j; (b) holds because the triples (X.sub.i,j, U.sub.i, V.sub.j) are identically distributed.

(142) The lower bound in Eq. 5.1 holds because mutual information is non-negative, and the upper bound because I ( custom character , , )H (, )=+.

(143) Finally, Eq. 5.2 holds because

(144) $I (X^{m, n}; U^{m}, V^{n}) = H (U^{m}, V^{n}) - H (U^{m}, V^{n} | X^{m, n}) m H (U_{1}) + n H (V_{1}) - {.Math.}_{i = 1}^{m} H (U_{i} | X^{m, n}) - {.Math.}_{j = 1}^{n} H (V_{j} | X^{m, n}) = m H (U_{1}) + n H (V_{1}) - m H (U_{1} | X^{m, n}) - n H (V_{1} | X^{m, n})$
Proof of Lemma 5.3

(145) Lemma B.1. Let = custom character , == be collections of mutually independent random variables taking values in a measurable space Z.x.Z.sup.3.fwdarw.Z, F: .fwdarw.. Define x(, , )via x(, , ).sub.ij=x (.sub.ij, .sub.i, .sub.j).

(146) Given a vector of independent random variables z, let Var.sub.zj(f(z)):= custom character [(f(z)f(z)).sup.2]Define the quantities:

(147) 0 $\begin{matrix} B_{*} := \max .Math. F (x) - F (x^{}) .Math. & (Eq . B .2) \end{matrix}$ $x, x^{}^{m n}$ $d (x, x^{}) 1$ $B_{1} := \max \max |_{} F (x (,,)) -_{} F (x (,^{},)) |$ $\begin{matrix} ,^{}^{m}^{n} & (Eq . B .3) \end{matrix}$ $d (,^{}) 1$ $\begin{matrix} B_{2} := \max \max |_{} F (x (,,)) -_{} F (x (,,^{})) | & (Eq . B .4) \end{matrix}$ $^{m},^{}^{n}$ $d (,^{}) 1$ $\begin{matrix} V_{*} := \sup_{,,} {.Math.}_{i m, j n} {Var}_{_{i j}} {F (x (,, T))} & (Eq . B .5) \end{matrix}$ $\begin{matrix} V_{1} := \sup_{,} {.Math.}_{i m} {Var}_{_{i}} {_{} F (x (,,))}, & (Eq . B .6) \end{matrix}$ $\begin{matrix} V_{2} := \sup_{,} {.Math.}_{j n} {Var}_{_{j}} {_{} F (x (,,))} & (Eq . B .7) \end{matrix}$

(148) Then, for any t0, the following holds with probability at least 18e.sup.t:
|F(x(,,) custom character F(x(,,)2 max(2V.sub.*t+2V.sub.1t+2V.sub.2t;(B.sub.*+B.sub.1+B.sub.2)t). (Eq. B.8)

(149) Proof. Let zZ.sup.N be a vector of independent random variables and f: Z.sup.N.fwdarw. custom character . Define the Martingale X.sub.k: =[f(z)|] (where :=(Z.sub.1, . . . , Z.sub.k)). Then it follows:

(150) $\begin{matrix} ess \sup .Math. X_{k} - X_{k - 1} .Math. B_{0} := \sup_{d (x, z^{}) 1} .Math. f (z) - f (z^{}) .Math., & (Eq . B .9) \end{matrix}$ $\begin{matrix} {.Math.}_{k = 1}^{N} [{(X_{k} - X_{k - 1})}^{2} |_{k - 1}] = & (Eq . B .10) \end{matrix}$ ${.Math.}_{k = 1}^{N} [{(E [f | z_{< k}, z_{k}] -_{z_{k}^{}} [f | z_{< k}, z_{k}^{}])}^{2} | z_{< k}]$ $\begin{matrix} V_{0} : = \sup_{z^{N}} {.Math.}_{k = 1}^{N} {Var}_{z_{k}} (f (z)) & (Eq . B .11) \end{matrix}$

(151) By Freedman's inequality, with probability at least 1-2e.sup.t, it follows:

(152) $\begin{matrix} .Math. f (z) - f (z) .Math. \max (\sqrt{2 V_{0} t} : \frac{2 B_{0} t}{3}) & (Eq . B .12) \end{matrix}$

(153) Define E(,):= custom character F(x(, , )), L():=(x(, , )). Applying the above inequality, each of the following holds with probability at least 1-2e.sup.t:

(154) $\begin{matrix} .Math. F (x (,,)) - E (,) .Math. \max (\sqrt{2 V_{*} t} : \frac{2 B_{*} t}{3}) & (Eq . B .13) \end{matrix}$ $\begin{matrix} .Math. E (,) - L () .Math. \max (\sqrt{2 V_{1} t} : \frac{2 B_{1} t}{3}) & (Eq . B .14) \end{matrix}$ $\begin{matrix} .Math. L () - F (x) .Math. \max (\sqrt{2 V_{2} t} : \frac{2 B_{2} t}{3}) & (Eq . B .15) \end{matrix}$
and the claim follows by union bound.

(155) Following proves a more stronger version of Lemma 5.3.

(156) Lemma B.2. For X custom character , let P(X)=(X) the probability applied by the model (Q, r, c; , ) to matrix X, i.e.,
P(X)=.sub.(i,j)[m][n]Q(X.sub.ij,)r().sub.j[n]c(.sub.)(Eq. B.16)

(157) Define the following quantities:

(158) $\begin{matrix} M_{*} := \max_{x, x^{} X} \max_{u, v} .Math. \log \frac{Q (x | u, v)}{Q (x | u, v)} .Math. & (Eq . B .17) \end{matrix}$ $\begin{matrix} M_{1} := \max_{,,^{}} {.Math. Q (.Math. |,) - Q (.Math. |^{},) .Math.}_{T V} \max_{u, v, x, x^{}} .Math. \log \frac{Q (x | u, v)}{Q (x | u, v)} .Math. & (Eq . B .18) \end{matrix}$ $\begin{matrix} M_{2} := \max_{,^{} .} {.Math. Q (.Math. |,) - Q (.Math. |,^{}) .Math.}_{T V} \max_{u, v, x, x^{}} .Math. \log \frac{Q (x | u, v)}{Q (x | u, v)} .Math. & (Eq . B .19) \end{matrix}$ $\begin{matrix} (Eq . B .20) \end{matrix}$ $s_{*} := \frac{1}{2} \max_{u_{0}, v_{0}} {.Math.}_{x, x^{} X} Q (x | u_{0} v_{0}) Q (x^{} | u_{0} v_{0}) {\max_{u, v} (\log \frac{Q (x | u, v)}{Q (x | u, v)})}^{2}$ $\begin{matrix} (Eq . B .21) \end{matrix}$ $s_{1} := \frac{1}{2} \max_{u_{0}, u_{0}^{} v_{0}} {.Math. Q (.Math. | u_{0}, v_{0}) - Q (.Math. | u_{0}, v_{0}) .Math.}_{T V} \max_{x, x^{}} {\max_{u, v} (\log \frac{Q (x | u, v)}{Q (x | u, v)})}^{2}$ $\begin{matrix} (Eq . B .22) \end{matrix}$ $s_{2} := \frac{1}{2} \max_{u_{0}, v_{0}, v_{0}} {.Math. Q (.Math. | u_{0}, v_{0}) - Q (.Math. | u_{0}, v_{0}^{}) .Math.}_{TV} \max_{x, x^{}} {\max_{u, v} (\log \frac{Q (x | u, v)}{Q (x | u, v)})}^{2}$

(159) Then, for X custom character (Q, r, c; m, n) and any t0 the following bound holds with probability at least 2e.sup.t:

(160) $\begin{matrix} .Math. - \log P (X) - H (X) .Math. 3 \max (\sqrt{s_{*} m n t} + \sqrt{s_{1} m n^{2} t} + \sqrt{s_{2} m^{2} n t}, M_{*} + M_{1} n + M_{2} m) & (Eq . B .23) \end{matrix}$

(161) Proof. Let

(162) $= {(_{i})}_{i m}_{i i d} r, = {(_{i})}_{i n}_{i i d} c, = {(_{i j})}_{i m, j n}_{i i d} Unif ([0, 1]), and x : [0, 1] .fwdarw.$
be such that x(.sub.ij, .sub.i, .sub.i)|.sub.i, .sub.iQ (.Math.|.sub.i; .sub.i). It defines F(x)=log P(x), and will apply Lemma B.1 to this function. Using the notation from that lemma, it claims that B.sub.*M.sub.*, B.sub.1M.sub.1 custom character , B.sub.2M.sub.2, and V , V.sub.1s.sub.1, V.sub.2s.sub.2.

(163) Note that, if (x.sub.ij), (x.sub.ij) differ only for entry i, j, then:

(164) $\begin{matrix} F (x) - F (x^{}) = - \log E_{u, v | x} {\frac{Q (x_{i j}^{} | u_{i}, v_{j})}{Q (x_{i j} | u_{i}, v_{j})}} & (Eq . B .24) \end{matrix}$
where custom character denotes expectation with respect to the posterior measure P(, |X=x). This immediately implies BM.

(165) Next consider the constant B.sub.1 defined in Eq. (B.3). Using the exchangeability of the (.sub.i,., .sub.i), it gets:

(166) $B_{1} = \max_{} .Math._{} \log_{u, v | x} {\overset{n}{\underset{j = 1}{.Math.}} \frac{Q (x (_{1, j,}_{1}^{},_{j}) | u_{1}, v_{j})}{Q (x (_{1, j,}_{1},_{j} (| u_{1}, v_{j})}} .Math. \max_{}_{} \max_{u, v} .Math. \log {{.Math.}_{j = 1}^{n} \frac{Q (x (_{1, j,}_{1}^{},_{j}) | u_{1}, v_{j})}{Q (x (_{1, j,}_{1},_{j}) | u_{1}, v_{j})}} .Math. \max_{} {.Math.}_{j = 1}^{n}_{} \max_{u, v} .Math. \log {\frac{Q (x (_{1, j,}_{1}^{},_{j}) | u_{1}, v_{j})}{Q (x (_{1, j,}_{1},_{j}) | u_{1}, v_{j})}} .Math. n \max_{,^{}}_{} \max_{u, v} .Math. \log \frac{Q (x (_{1},_{1}^{},_{j}) | u_{1}, v_{j})}{Q (x (_{1},_{1},_{j}) | u_{1}, v_{j})} .Math. n \max_{,^{}} {.Math. Q (.Math. |,) - Q (.Math. |^{},) .Math.}_{TV} \max_{u, v, x, x^{}} .Math. \log \frac{Q (x | u, v)}{Q (x^{} | u, v)} .Math. = M_{1}$
where the bound B.sub.2M.sub.2m is proved analogously.

(167) Consider now the quantity Va of Eq. (B.5). Denote by .sub.(ij)(t) the array obtained by replacing entry (i, j) in {by t, and by x (t)=x (.sub.(ij)(t), , ). Then it follows:

(168) ${Var}_{_{i j}} (F (x)) = \frac{1}{2}_{^{},^{}} {{(F (x (_{(i j)} (^{}),,)) - F (x (_{(i j)} (^{}),,)))}^{2}} = \frac{1}{2}_{^{},^{}} {{(\log E_{u, v | x (^{})} {\frac{Q (x (^{},_{i},_{j}) | u_{i}, v_{j})}{Q (x (^{},_{i},_{j}) | u_{i}, v_{j})}})}^{2}} \frac{1}{2}_{^{},^{}} {\max_{u, v} (\log {\frac{Q (x (^{},_{i},_{j}) | u_{i}, v_{j})}{Q (x (^{},_{i},_{j}) | u_{i}, v_{j})}})}^{2} = \frac{1}{2} \underset{x, x^{}}{.Math.} Q (x |,) Q (x^{} |,) {\max_{u, v} (\log {\frac{Q (x | u, v)}{Q (x^{} | u, v)}})}^{2}$

(169) Next, as claimed:

(170) 0 $V_{*} \max_{,,} \underset{i m, j n}{.Math.} {Var}_{_{ij}} {F (x)} mn \max_{,,} {Var}_{_{ij}} {F (x)} {mns}_{*}$

(171) Finally, consider the quantity V.sub.1 of Eq. (B.6) (the argument is similar for V.sub.2). Denote by .sub.(i)(t) the vector obtained by replacing entry i in by t. Proceeding as above, it follows:

(172) ${Var}_{_{i}} (_{} F (x)) = \frac{1}{2}_{^{},^{}} {{(_{} F (x (,_{(i)} (^{}),)) -_{} F (x (,_{(i)},)))}^{2}} = \frac{1}{2}_{^{},^{}} {{(_{} \log E_{u, v | x (^{})} {{.Math.}_{j = 1}^{n} \frac{Q (x (_{ij},^{},_{j}) | u_{i}, v_{j})}{Q (x (_{ij},^{},_{j}) | u_{i}, v_{j})}})}^{2}} \frac{1}{2}_{^{},^{}} {{(_{} \log {{.Math.}_{j = 1}^{n} \max_{u, v} \frac{Q (x (_{ij},^{},_{j}) | u_{i}, v_{j})}{Q (x (_{ij},^{},_{j}) | u_{i}, v_{j})}})}^{2}} \frac{1}{2}_{^{},^{}} {{({.Math.}_{j = 1}^{n}_{} \log {\max_{u, v} \frac{Q (x (,^{},_{j}) | u, v)}{Q (x (,^{},_{j}) | u, v)}})}^{2}} \frac{n^{2}}{2} \max_{}_{^{},^{}} {{(_{} \log {\max_{u, v} \frac{Q (x (,^{},_{j}) | u, v)}{Q (x (,^{},_{j}) | u, v)}})}^{2}} \frac{n^{2}}{2} \max_{,^{}} {.Math. Q (.Math. |,) - Q (.Math. |^{},) .Math.}_{TV} {\max_{x, x^{1}} (_{} \log {\max_{u, v} \frac{Q (x^{} | u, v)}{Q (x | u, v)}})}^{2} = n^{2} s_{1}$

(173) Therefore,

(174) $V_{1} = \max_{,} {.Math.}_{i = 1}^{m} {Var}_{_{i}} (_{} F (x)) m n^{2} s_{1}$
C Proofs for Finite State Encoders

(175) It shows above that a finite-state encoder is defined by a triple (, f, g). Formally, it can define the action of f, g on

(176) $x_{1}^{n}^{n}$
recursively via

(177) $\begin{matrix} f_{m + 1} (x_{1}^{m + 1}, s_{0}) = f_{m} (x_{1}^{m}, s_{0}) f (x_{m + 1}, g (x_{1}^{m}, s_{0})) & (Eq . C .1) \end{matrix}$ $\begin{matrix} g_{m + 1} (x_{1}^{m + 1}, s_{0}) = g_{m} (x_{m + 1}, g (x_{1}^{m}, s_{0})) & (Eq . C .2) \end{matrix}$
and the encoder is thus given by

(178) $E (x_{1}^{n}) = f_{n} (x_{1}^{n}, s_{i n i t}) .$

(179) The state space is non-degenerate if, for each s.sub.1 there exists

(180) $m, x_{1}^{m}^{m}$
such that

(181) $g_{m} (x_{1}^{m}, s_{i n i t}) = s_{1} .$
Notice, that if state space is degenerate, it can remove one or more symbols from without changing the encoder and making the state-space non-degenerate. The method may assume non-degeneracy without mentioning it.

(182) The FS encoder is information lossless (IL) if for any

(183) $n, x_{1}^{n} f_{n} (x_{1}^{n}, s_{i n i t})$
is injective.

(184) Remark C.1. An information-lossless encoder satisfies a stronger condition: for any m custom character and any s, , the map

(185) $x_{1}^{m} f_{m} (x_{1}^{m}, s_{*})$
is injective

(186) Assume this were not the case, then there may exist two distinct inputs

(187) 0 $x_{1}^{m}, {\tilde{x}}_{1}^{m}^{m}$
and a state s, such that

(188) $f_{m} (x_{1}^{m}, s_{*}) = f_{m} ({\overset{}{x}}_{1}^{m}, s_{*}) .$
By non-degeneracy, there exists

(189) $a_{1}^{}^{}$
such that

(190) $s_{*} = g_{} (a_{1}^{}, s_{i n i t}),$
Defining

(191) $n = + m, y_{1}^{n} = a_{1}^{} x_{1}^{m}, {\overset{}{y}}_{1}^{n} = a_{1}^{} {\overset{}{x}}_{1}^{m},$
is not easy to check that these inputs are distinct but

(192) $f_{n} (y_{1}^{n}, s_{i n i t}) = f_{n} ({\overset{}{x}}_{1}^{n}, s_{i n i t}) .$

(193) Proposition C.1. Define the compression rate on input

(194) $x_{1}^{n} as R (x_{1}^{n}) = len (f_{n} (x_{1}^{n}, S_{i n i t})) / (n \log_{2} .Math. .Math.) .$
Then for any custom character 1, the following holds (where n:=n2 and recall that M:=||):

(195) $\begin{matrix} R (x_{1}^{n}) \frac{n - 2}{n \log_{2} .Math. .Math.} H (\begin{matrix} {\overset{}{p}}^{} \\ x_{1}^{n^{}} \end{matrix}) - \frac{1}{\log_{2} .Math. .Math.} (\log_{2} (.Math. .Math. .Math.) + \log_{2} \log_{2} .Math. .Math.) & (Eq . C .3) \end{matrix}$

(196) Proof. The method denotes by L(x.sub.1.sup.m; s.sub.*) the length of the encoding of x.sub.1.sup.m when starting in state s.sub.*:

(197) $\begin{matrix} L (x_{1}^{n}; s_{*}); = len (f_{n} (x_{1}^{m}, s_{*})) & (Eq . C .4) \end{matrix}$

(198) It then follows, for any b {0, . . . , custom character 1}, and setting by convention s.sub.0=S.sub.init, it may get:

(199) $\begin{matrix} R (x_{1}^{n}) \frac{1}{n \log_{2} .Math. .Math.} {.Math.}_{k = 0}^{.Math. n / .Math. - 2} L (x_{k + b + 1}^{(k + 1) + b}; s_{k + b}) & (Eq . C .5) \end{matrix}$

(200) By averaging over b, and introducing the shorthand custom character : =2, it gets:

(201) 0 $\begin{matrix} R (x_{1}^{n}) \frac{1}{n \log_{2} .Math. .Math.} {.Math.}_{m = 1}^{(.Math. n / .Math. - 1)} L (x_{m}^{m + - 1}; s_{m - 1}) & (Eq . C .6) \end{matrix}$ $\begin{matrix} \frac{n - 2}{n \log_{2} .Math. .Math.} {.Math.}_{s} {.Math.}_{u_{1}^{}^{}} {{\overset{}{p}}_{x_{1}^{n^{}}}^{} (u_{1}^{}, s) L (u_{1}^{}; s)} & (Eq . C .7) \end{matrix}$ $\begin{matrix} \underset{}{(a)} \frac{n - 2}{n \log_{2} .Math. .Math.} {.Math.}_{s} {{\overset{}{p}}_{x_{1}^{n^{}}}^{} (s) H ({\overset{}{p}}_{x_{1}^{n^{}}}^{} (. .Math. s)) - \log_{2} \log_{2} ({.Math. .Math.}^{})} & (Eq . C .8) \end{matrix}$
where (a) holds by Lemma A.1. By the chain rule of entropy (recalling that M:=[]), it follows:

(202) $\begin{matrix} \underset{s}{.Math.} {\overset{}{p}}_{x_{1}^{n^{}}}^{}, (s) H ({\overset{}{p}}_{x_{1}^{n^{}}}^{} (. .Math. s)) = H (X_{1}^{} .Math. S) = H (X_{1}^{}) + H (S .Math. X_{1}^{}) - H (S) H (X_{1}^{}) - \log_{2} M = H ({\overset{}{p}}_{x_{1}^{n^{}}}^{}) - \log_{2} M & (Eq . C .3) \end{matrix}$
where the claim (C.3) follows by using the last inequality in Eq. C.8.

(203) Theorem 3. Let custom character (Q, r, c, , ) and (, f, g) be an information lossless finite state encoder. With an abuse of notation, denote f.sub.mn (X.sup.mn, S.sub.init){0,1} * the binary sequence obtained by applying the finite state encoder to the vector vec (X.sup.mn) obtained by scanning X.sup.mn in row-first order. Define the compression rate by:

(204) $\begin{matrix} R (X^{m, n}) : = \frac{len (f_{m n} (X^{m n}, s_{i n i t}))}{mn \log_{2} .Math. .Math.} & (Eq . C .9) \end{matrix}$

(205) Assuming n>10, ||| custom character |, and log .sub.2<nlog.sub.2||/9, the expected compression rate is lower bounded as follows:

(206) $\begin{matrix} R (X^{m, n}) (\frac{H (X .Math. U)}{\log_{2} .Math. X .Math.}) - 1 0 \sqrt{\frac{\log .Math. .Math.}{n \log .Math. .Math.}} .Math. \log (n \log .Math. .Math. .Math.) & (Eq . C .10) \end{matrix}$

(207) Proof. Let N:= custom character , N:=2 where /3 will be selected later. Let X.sup.N:=vec () for the vectorization , X.sup.N, for the vector comprising its first N entries. Recall the definition of empirical distribution. For any fixed :

(208) ${\overset{}{p}}_{X^{N^{}}}^{} () := \frac{1}{N^{} - + 1} {.Math.}_{i = 1}^{N^{} - + 1} 1_{x_{i}^{i + - 1} =}$

(209) Let S:={i[N custom character +1]: [i, i+2] n=0}. These are the subset of blocks of length that do not cross the end of a line in the table. Since for each line break there are at most 1 such blocks, it follows |S|N+1(1)(1). Consider the following modified empirical distribution:

(210) ${\overset{}{p}}_{X^{N^{}}}^{} () := \frac{1}{.Math. S .Math.} \underset{i s}{.Math.} 1_{X_{i}^{i + - 1} =}$

(211) Then, by construction:

(212) ${\overset{}{p}}_{X^{N^{}}}^{} () := (1 -_{}) {\overset{}{p}}_{X^{N^{}}}^{} () + q_{X^{N^{}}}^{} ()$ $_{} : = 1 - \frac{.Math. S .Math.}{N^{} - + 1} = \frac{(m - 1) (- 1)}{N^{} - + 1}$
where custom character is the empirical distribution of blocks that do cross the line. By concavity of the entropy, it follows:

(213) $\begin{matrix} H ({\overset{}{p}}_{X^{N^{}}}^{}) (1 -) H ({\overset{}{p}}_{X^{N^{}}}^{}) + H (q_{X^{N^{}}}^{}) (1 -_{}) H ({\overset{}{p}}_{X^{N^{}}}^{}) & (Eq . C .11) \end{matrix}$

(214) Further, since custom character /3:

(215) $\begin{matrix} _{} = \frac{(m - 1) (- 1)}{(m n - 3 + 1)} \frac{(m - 1) (- 1)}{(m n - 3 + 1)} \frac{(m - 1)}{(m - 1) n} \frac{}{n} & (Eq . C .12) \end{matrix}$

(216) Now, let the row latents custom character : =().sub.im be fixed, and denote by their weighted empirical distribution, defined as follows:

(217) ${\overset{}{r}}_{u}^{s} (s) := {.Math.}_{i = 1}^{m} \frac{.Math. S [(i - 1) n + 1, in] .Math.}{.Math. S .Math.} 1_{u_{i = s}}$
where custom character is the empirical distribution of the latents ().sub.im where row i is weighted by its contribution to S. Note that all the weights are equal to (n2(1))/|S| except, potentially, for the last one.

(218) It follows:

(219) 0 $p_{*}^{} () := [{\overline{p}}_{X^{N^{}}}^{} ()] = \underset{u}{.Math.} {\overset{}{r}}_{u}^{S} (u) {.Math.}_{i = 1}^{} Q_{x | u} (_{i} | u), Q_{x | u} (| u) := \underset{u}{.Math.} Q (| u, v) c (v)$

(220) Using Eq. (C.11), (C.12), and concavity of the entropy, it gets:

(221) $\begin{matrix} [H ({\overset{}{p}}_{X^{N^{}}}^{}) | u] (1 - \frac{}{n}) H (p_{*}^{}) & (Eq . C .13) \end{matrix}$

(222) By Proposition C.1, it gets:

(223) $[R (X^{m, n}) | u] \frac{m n - 2}{mn \log_{2} .Math. X .Math.} (1 - \frac{}{n}) H (p_{*}^{}) - \frac{1}{\log_{2} .Math. X .Math.} (\log_{2} (.Math. .Math. .Math.) + \log_{2} \log_{2} .Math. .Math.) \frac{1}{\log_{2} .Math. X .Math.} H (p_{*}^{}) - \frac{2}{n} - \frac{1}{\log_{2} .Math. X .Math.} (\log_{2} (.Math. .Math. .Math.) + \log_{2} \log_{2} .Math. .Math.)$
where in the last inequality it used the fact that H( custom character )log.sub.2|X|. It is chosen:

(224) $\begin{matrix} = \sqrt{\frac{n \log_{2} .Math. .Math.}{\log_{2} .Math. .Math.}} \frac{n}{3} & (Eq . C .14) \end{matrix}$

(225) Substituting and simplifying, it gets:

(226) $\begin{matrix} [R (X^{m, n}) | u] \frac{H (p_{*}^{})}{\log_{2} .Math. .Math.} - \frac{1 0}{\log_{2} .Math. .Math.} .Math. \sqrt{\frac{\log .Math. .Math. .Math.}{\log .Math. .Math.}} .Math. \log (n \log .Math. .Math. .Math.) & (Eq . C . 15) \end{matrix}$

(227) Finally, letting (W.sub.1, . . . , custom character , U) be random variables with joint distribution

(228) ${\overset{}{r}}_{u}^{S} (u) {.Math.}_{i = 1}^{} Q_{x | u} (_{i} | u) .$
Then,

(229) $\begin{matrix} H (p_{*}^{}) {.Math.}_{u} {\overset{}{r}}_{u}^{S} (u) H (Q_{x | u}^{.Math.} (.Math. | u)) & (Eq . C .16) \end{matrix}$ $\begin{matrix} {.Math.}_{u} {\overset{}{r}}_{u}^{S} (u) H (X | U = u) & (Eq . C . 17) \end{matrix}$
and therefore

(230) $H (p_{*}^{}) H (X | U),$
finishing the proof.
D Proofs for Lempel-Ziv coding

(231) It is useful to define for each kN,

(232) $\begin{matrix} L_{k} (X^{N}) : = \max {1 : j {1, .Math., k - 1} s . t ._{j}^{j + - 1} = X_{k}^{k + - 1}} & (Eq . D .1) \end{matrix}$ $\begin{matrix} T_{k} (X^{N}) := \max {j {1, .Math., k - 1} s . t ._{j}^{j + L_{K} - 1} = X_{k}^{k + L_{k} - 1} & (Eq . D .2) \end{matrix}$
Proof of Theorem 2

(233) Lemma D.1. Under Assumption 1, there exists a constant C such that the following holds with probability at least 1N.sup.10:

(234) $\begin{matrix} \max_{k N} L_{k} (X^{N}) C \log N & (Eq . D .3) \end{matrix}$

(235) Proof. Consider a slightly different setting, and it then shows that the question reduces to this setting. Let (Z.sub.i).sub.i1 be independent random variables with Z.sub.iq.sub.i a probability distribution over custom character . Further assume max.sub.xXqi(x)1c for all i1. Then it claims that, for any t, 1, it follows:

(236) 0 $\begin{matrix} (Z_{1}^{l} = Z_{t + 1}^{t +}) {(1 - c)}^{} & (Eq . D .4) \end{matrix}$

(237) Condition on the event

(238) $Z_{1}^{t} = x_{1}^{t}$
for some x.sub.1, . . . , x.sub.t custom character , then the event

(239) $Z_{1}^{} = Z_{t + 1}^{t +}$
implies that, for i {t+1, . . . , t+1}, Z.sub.i=xp (i) where It (i)=i mod t, n (i) [1, t]. Then,

(240) $(Z_{1}^{} = Z_{t + 1}^{t +}) \max_{x_{1}^{t}^{t}} (Z_{1}^{} = Z_{t + 1}^{t +} | Z_{1}^{t} = x_{1}^{t}) \max_{x_{1}^{t}^{t}} (Z_{i} = x_{(i)} i {t + 1, .Math., t + | Z_{1}^{t} = x_{1}^{t}) \max_{x_{1}^{t}^{t}} \overset{t +}{\underset{i = t + 1}{.Math.}} (Z_{i} = x_{(i)}) {(1 - c)}^{}$
proving claim (D.4).

(241) Reconsider the original setting:

(242) $(\max_{k N} L_{k} (X^{N})) = (i < j N : X_{i}^{i + - 1} = X_{j}^{j + - 1}) N^{2} \max_{i < j N} (X_{i}^{i + - 1} = X_{j}^{j + - 1}) N^{2} \max_{u^{m}^{m}, v^{n}^{n}} \max_{i < j N} (X_{i}^{i + - 1} = X_{j}^{j + - 1} | u^{m}, v^{n}) {N^{2} (1 - c)}^{}$
where the last inequality follows from claim (D.4), since the (X.sub.i).sub.iN are conditionally independent given the latents custom character , , with probability mass function upper bounded by 1c. The thesis follows by taking =12 log N/log (1/(1c)).

(243) For i [ custom character ], j[], it defines ij: =(i1)n+j. In words, k=ij is the of entry at row i column j when the table is scanned in row first order. For 1, define the events:

(244) $\begin{matrix} _{i, j} () := {i^{} [m], j^{} [n] : .Math. i^{} j^{} .Math. < .Math. ij .Math., .Math. j^{} - j .Math., X_{.Math. i^{} j^{} .Math.}^{.Math. i^{} j^{} .Math. + - 1} = X_{.Math. i^{} j^{} .Math.}^{.Math. i^{} j^{} .Math. + - 1}} & (Eq . D .5) \end{matrix}$ $\begin{matrix} _{i, j} () := {i^{} [m], j^{} [n] : .Math. i^{} j^{} .Math. < .Math. ij .Math., .Math. j^{} - j .Math., X_{.Math. i^{} j^{} .Math.}^{.Math. i^{} j^{} .Math. + - 1} = X_{.Math. ij .Math.}^{.Math. ij .Math. + - 1}} & (Eq . D .6) \end{matrix}$

(245) Then, it follows:

(246) $\begin{matrix} (L_{.Math. ij .Math.} (X^{N})) (_{i, j} ()) + (_{i, j} ()) & (Eq . D .7) \end{matrix}$

(247) The next two lemmas control the probabilities of these events.

(248) Lemma D.2. Let custom character (, u):=[(1+)(log N)/H(X|U=u)], n=n(, u), and m.sub.0=m.sup.1-on(1). Under Assumption 1, for any >0, there exist constants C, >0 independent of u, such that the following hold:

(249) $\begin{matrix} \max_{i m,, j n^{}} (_{i, j} ((, u_{i}))) C N^{-} & (Eq . D .8) \end{matrix}$ $\begin{matrix} \min_{m_{0} i m, j n^{}} (_{i, j} ((-, u_{i}))) 1 - C N^{-} & (Eq . D .9) \end{matrix}$

(250) Lemma D.3. Let custom character (, u)=[(1+)(log m)/H(X|U=u, V)], n=n(, u), and m.sub.0=m.sup.1-on(1). Under Assumption 1, for any >0, there exist constants C, >0 independent of u, such that the following hold:

(251) $\begin{matrix} \max_{i m, j n_{c}^{}} (_{i, j} (_{c} (-, u_{i}))) {Cm}^{-} & (Eq . D .10) \end{matrix}$ $\begin{matrix} \min_{m_{0} i m, j n_{c}^{}} (_{i, j} (_{c} (-, u_{i}))) 1 - {Cm}^{-} & (Eq . D .11) \end{matrix}$
and is ready to prove Theorem 2.

(252) Proof of Theorem 2. It denotes by (k(1), . . . , k(M)) the values taken by k in the while loop of the Lempel-Ziv pseudocode. In particular,

(253) $\begin{matrix} k (1) = 1 & (Eq . D .12) \end{matrix}$ $\begin{matrix} k (+ 1) = K () + L_{k ()} (X^{N}) & (Eq . D .13) \end{matrix}$ $\begin{matrix} k (M) = N & (Eq . D .14) \end{matrix}$

(254) Therefore, the total length of the code is:

(255) 00 $\begin{matrix} len (L Z (X^{m, n})) = M .Math. \log_{2} (N + .Math. .Math.) .Math. + {.Math.}_{= 1}^{M} l e n (e l i a s (L_{k ()})) & (Eq . D .15) \end{matrix}$

(256) By Lemma D.1 (and recalling that len(elias(L))2 log.sub.2L+1), it has, with high probability, custom character len(elias())2 log.sub.2 (C log N). Letting G denote the good event that this bound holds, it has on G:

(257) 01 $\begin{matrix} M \log_{2} N len (L Z (X^{m, n})) M .Math. \log_{2} (N + .Math. .Math.) .Math. + 2 M \log_{2} (C \log N) & (Eq . D .16) \end{matrix}$

(258) Since | custom character | is a constant, this means that for any >0, there exists N.sub.0() such that, for all NN.sub.0(), with probability at least 1:

(259) 02 $\begin{matrix} M .Math. 1_{} \log_{2} N len (L Z (X^{m, n})) (1 +) M .Math. 1_{} \log_{2} N + N .Math. 1_{^{c}} \log_{2} .Math. X .Math. & (Eq . D .17) \end{matrix}$
where on the right len(LZ custom character ))N log .sub.2|| by construction. It follows:

(260) 03 $\begin{matrix} {M .Math. 1_{}} \frac{\log_{2} N}{N \log_{2} .Math. X .Math.} R_{L Z} (X^{m, n}) (1 +) {M .Math. 1_{}} \frac{\log_{2} N}{N \log_{2} .Math. X .Math.} + & (Eq . D .18) \end{matrix}$
that is,

(261) 04 $\begin{matrix} \lim \inf_{m, n .fwdarw.} R_{L Z} (X^{m, n}) \lim \inf_{m, n .fwdarw.} {M . 1_{}} .Math. \frac{\log_{2} N}{N \log_{2} .Math. .Math.} & (Eq . D .19) \end{matrix}$ $\begin{matrix} \lim \sup_{m, n .fwdarw.} R_{L Z} (X^{m, n}) \lim \sup_{m, n .fwdarw.} {M .Math. 1_{}} .Math. \frac{\log_{2} N}{N \log_{2} .Math. X .Math.} & (Eq . D .20) \end{matrix}$

(262) It is left with the task of bounding custom character {M.Math.1.sub.G}.

(263) Begin by the lower bound. Define the set of bad indices B( custom character , )[][],

(264) 05 $\begin{matrix} B (X^{m, n},) := {(i, j) [m] [n] :_{i j} ((, u_{i})) or_{i, j} ((, u_{i}))} & (Eq . D .21) \end{matrix}$

(265) The method drops the arguments custom character , for economy of notation and writes B:=B(, ). It further defines:

(266) 06 $\begin{matrix} S (u) = S (u; X^{m, n}) := {(i, j) [m] [n] : u_{i =} u and M : .Math. i j .Math. = k ()} & (Eq . D .22) \end{matrix}$
where S( custom character ) is the set of positions (i, j) of the table where words in the LZ parsing begin.

(267) It also writes N( custom character )=. |{i[]: =}| for the total number of rows in with row latent equal to and L.sub.i.sup. for the length of the first segment in row i initiated in row i1:

(268) 07 $N (u) \underset{(i, j) S (u)}{.Math.} L_{.Math. i j .Math.} + \underset{i m : u_{i} = u}{.Math.} L_{i}^{-} \underset{(i, j) S (u) B^{c}}{.Math.} L_{.Math. i j .Math.} + \underset{(i, j)}{.Math.} L_{.Math. i j .Math.} + \underset{i m : u_{i = u}}{.Math.} L_{i}^{-} \underset{(i, j) S (u)}{.Math.} (u;)_{c} (u;) + (.Math. B .Math. + m) .Math. C \log N .Math. S (u) .Math. (u;)_{c} (u;) + (.Math. B .Math. + m) .Math. C \log N,$
where the last inequality holds on event G. By taking expectation on this event, it gets:

(269) 08 ${N (u) .Math. 1_{}} {.Math. S (u) .Math. .Math. 1_{}}} .Math. (u;)_{c} (u;) + (.Math. B .Math. + m) .Math. C \log N$
by Lemmas D.2 and D.2,

(270) 09 $\begin{matrix} (.Math. B .Math.) m_{0} n + \underset{m_{0} i m, j n^{}}{.Math.} (_{i, j} ((, u_{i}))_{i, j} (_{c} ((, u_{i}))) + C_{m} \log N m_{o} n + C m^{1 -} n + Cm \log N \frac{C N}{{(\log N)}^{2}} (.Math. B .Math.) {Cm}^{1 -} n + Cm \log n . & (Eq . D . 23) \end{matrix}$ $Further {N (u)} = N r (u) and {N (u) .Math. 1_{}} {N (u)} - N (^{C}),$ $whence \lim \inf_{m, n .fwdarw.} \frac{1}{N} {.Math. S (u) .Math. .Math. 1_{}} .Math. (u;)_{c} (u;) r (u)$

(271) Recalling the definition of custom character (; ), (; ) and the fact that is arbitrary, n the last inequality yields:

(272) 0 $\begin{matrix} \lim \inf_{m, n .fwdarw.} {.Math. S (u) .Math. 1_{}} \frac{\log_{2} N}{N} r (u) [H (X .Math. U = u) (\frac{1 +}{}) H (X .Math. U = u, V)] & (Eq . D .24) \end{matrix}$
where summing over custom character , noting that |S()|=M, and substituting in Eq. (D.19) yields the lower bound on the rate in Eq. (5.10).

(273) Finally, the upper bound is proved by a similar strategy as for the lower bound. Define the set of bad indices B_=B_( custom character , )[][],

(274) $\begin{matrix} B_{-} (X^{m, n},) := {(i, j) [m] [n] :_{i, j}^{c} ((-, u_{i})) or_{i, j}^{c} (_{c} (-, u_{i}))} & (Eq . D .25) \end{matrix}$

(275) It also denotes by Lt the length of the last segment in row i. It then has:

(276) $N (u) \underset{(i, j) S (u)}{.Math.} L_{.Math. i j .Math.} - \underset{i m : u_{i} = (u)}{.Math.} L_{i}^{+} \underset{(i, j) S (u) B^{\underline{c}}}{.Math.} L_{.Math. i j .Math.} - \underset{i m : u_{i} = (u)}{.Math.} L_{i}^{+} \underset{(i, j) S (u) B^{\underline{c}}}{.Math.} (u; -)_{c} (u; -) - \underset{i m : u_{i} = (u)}{.Math.} L_{i}^{+} .Math. S (u) .Math. (u;)_{c} (u;) - (.Math. B_{-} .Math. + m) .Math. C \log N$
where the last inequality holds on event G. By taking expectation on this event, it gets:

(277) ${N (u) .Math. 1_{}} {.Math. S (u) .Math. .Math. 1_{}}} .Math. (-, u)_{c} (-, u) - (.Math. B_{-} .Math. + m) .Math. C \log N$

(278) By Lemmas D.2 and D.2,

(279) $(.Math. B_{-} .Math.) m_{0} n + \underset{m o i m, j n^{}}{.Math.} (_{i, j}^{c} ((-, u_{i}))_{i, j}^{c} (_{c} (-, u_{i}))) + C_{m} \log N m_{0} n + {Cm}^{1 -_{n}} + C_{m} \log N \frac{C N}{{(\log N)}^{2}}$

(280) The proof is completed exactly as for the lower bound.

(281) Proof of Lemma D.2

(282) The following standard lemmas are used.

(283) Lemma D.4. Let X be a centered random variable with custom character (Xx.sub.0)=1, x.sub.0>0. Then, lettings

(284) $c (x_{0}) = (e^{x 0} - 1 - x_{0}) / x_{0}^{2},$
it has:

(285) $\begin{matrix} (e^{X}) 1 + c (x_{0}) (X^{2}) & (Eq . D .26) \end{matrix}$

(286) Proof. This simply follows from exp(x)1+x+c(x.sub.0)x.sup.2 for xx.sub.0.

(287) Lemma D.5. Let (P.sub.i).sub.i1, (q.sub.i).sub.i1, be probability distributions on custom character with supi.sub.i1 max.sub.xx P.sub.i (x)1c, and sup.sub.i1 .sub.xXP.sub.i(x).sup.2 (logp; (x))C for constants c, C.

(288) Let be custom character be independent random variables with X.sub.ip.sub.i, and set X=(X.sub.1, . . . ). Let Y(j), j1 be a sequence of i. i. d. random vectors, with independent and (Y.sub.i(j)q.sub.i. Finally, let T:=min {t1: Y(t)=X}.

(289) Then, for any >0, there exists =(, c, C)>0 such that (letting):

(290) $\overset{}{H} (p) :=^{- 1} {.Math.}_{i = 1}^{} H (p_{i})) :$

(291) $\begin{matrix} (T e^{[\overset{}{H} (p) -]} e^{-} & (Eq . D .27) \end{matrix}$

(292) Further, the same bound holds (with a different (, c, C))(Y(j)).sub.j1 are independent not identically distributed, if there exist a finite set

(293) ${(q_{i}^{a})}_{i 1 [K]}, K^{C_{o}},$
and a map b: custom character =.fwdarw.[K] such that

(294) 0 $Y (j) q_{1}^{b (j)} .Math. .Math. .Math. q_{}^{b (j)} .$

(295) Proof. It denotes by Y a vector distributed as Y (i). Conditional on X= custom character , T is a geometric random variables with mean 1/(1(Y=x)). Hence, for the (): =,

(296) $(T t_{} () .Math. X = x) = 1 - {(1 - (Y = x))}^{t_{} ()} t_{()} (Y = x)$

(297) Hence,

(298) $\begin{matrix} (T t_{} ()) e^{- / 2} + ((Y = X .Math. X) {t_{} (/ 2)}^{- 1}) & (Eq . D .28) \end{matrix}$ $\begin{matrix} = e^{- / 2} + P_{} (/ 2) & (Eq . D .29) \end{matrix}$ $\begin{matrix} P_{} (u) := ({.Math.}_{i = 1}^{} \log \frac{1}{q_{i} (X_{i})} < {.Math.}_{i = 1}^{} H (p i) -) & (Eq . D .30) \end{matrix}$

(299) By Chernoff bound, for any 0, custom character ()exp{l(,)}, where:

(300) $\begin{matrix} (, u) := u - \frac{1}{} {.Math.}_{i = 1}^{} [H (p_{i}) + \log [{q_{i} (X_{i})}^{}] & (Eq . D .31) \end{matrix}$

(301) By Holder inequality, for [0, 1] we have custom character [q.sub.i(X.sub.i).sup.](.sub.xp(x).sup.).sup.1/ where =1(1). Therefore,

(302) $(; p) := H (p) + (1 -) \log (\underset{x}{.Math.} {p (x)}^{1 / (1 -)}) = (1 -) \log_{X ~ p} \exp (\frac{}{1 -} (\log p (X) + H (p)))$

(303) Consider the random variable

(304) $Z_{i} := \frac{}{1 -} (\log p_{i} (X_{i}) + H (p))$
where X.sub.iP.sub.i. Under the assumptions of the lemma, for [0, 1/2] it has custom character (Z.sub.i)=0:

(305) $\begin{matrix} Z_{i} \log (1 - c) + H (p) \log [.Math. .Math. (1 - c)] & (Eq . D .32) \end{matrix}$ $\begin{matrix} [Z_{i}^{2}] {(\frac{}{1 -})}^{2} {.Math.}_{x} p_{i} (x) {(\log p_{i} (x))}^{2} 4 C^{2} & (Eq . D .33) \end{matrix}$

(306) Using Lemma D.4, it gets:

(307) $\begin{matrix} (; p_{i}) = (1 -) \log e^{Z_{i}} & (Eq . D .34) \end{matrix}$ $\begin{matrix} (1 -) \log (1 + c_{0} (Z_{i}^{2})) & (Eq . D .35) \end{matrix}$ $\begin{matrix} \log (1 + c_{*}^{2}) & (Eq . D .36) \end{matrix}$
whence

(308) $(, u) u - \log (1 + c_{*}^{2})$

(309) By maximizing this expression over , we find that custom character (/2)exp(.sub.0()) which completes the proof for the case of i.i.d. vectors Y(j).

(310) The case of non-identically distributed vectors follows by union bound over [K]

(311) Lemma D.6. Let (p.sub.i).sub.i1, be probability distributions on x, with sup.sub.i1max.sub.xXP.sub.i(x)1c, and sup.sub.i21 ExEx P.sub.i (x) (log pi (x))C for constants c, C.

(312) Let (X.sub.i).sub.il be independent random variables with X.sub.i custom character , X=(X.sub.1, . . . , ). Let Y(j), j1 be a sequence of i.i.d. copies of X. Finally, let T=min {t1: Y(t)=X}.

(313) Then, for any >0, there exists = (, c, C)>0, such that

(314) $\begin{matrix} (letting \overline{H} (p) :=^{- 1} {.Math.}_{i = 1}^{} H (p_{i})) : (T e^{[\overline{H} (p) +]}) e^{-} & (Eq . D .37) \end{matrix}$

(315) Proof. The proof follows the same argument as for Lemma D.5. Denote by Y a vector distributed as Y(i). and define t.sub.l():= custom character

(316) 0 $(T t_{} () .Math. X = x) = {(1 - (Y = x))}^{t ()} \exp (- t_{l ()} (Y = x))$

(317) Hence,

(318) $\begin{matrix} (T t_{} ()) \exp {e^{/ 2}} + ((Y = X .Math. X) {t_{} (/ 2)}^{- 1}) & (Eq . D .38) \end{matrix}$ $\begin{matrix} e^{/ 2} + {\tilde{P}}_{} (/ 2) & (Eq . D .39) \end{matrix}$ $\begin{matrix} {\tilde{P}}_{} (u) := ({.Math.}_{i = 1}^{} \log \frac{1}{p_{i} (X_{i})} {.Math.}_{i = 1}^{} H (p_{i}) + u) & (Eq . D .40) \end{matrix}$

(319) It claims that, for each custom character >0, (u)for some .sub.0(u)>0. Using again Chernoff's bound, it gets, for any >0, (u)e.sup.l(,u), where:

(320) $\begin{matrix} \tilde{} (, u) := u - \frac{1}{} {.Math.}_{i = 1}^{} \tilde{} (; p_{i}) & (Eq . D .41) \end{matrix}$ $\begin{matrix} \tilde{} (; p_{i}) := \log \exp (W_{i}), W_{i} := \log \frac{1}{p_{i} (X_{i})} - H (p_{i}) & (Eq . D .42) \end{matrix}$
where in the last line X.sub.iP.sub.i. Under the assumptions of the lemma, W.sub.iC almost surely and applying again Lemma D.4, it gets {tilde over ()}(; p.sub.i)log(1+c,.sup.2) for 1. The proof is completed by selecting for each u>0, >0 so that ulog (1+c,.sup.2>0.

(321) It is ready to prove Lemma D.2.

(322) Proof of Lemma D.2. It begin by proving the bound (D.8).

(323) Fix i custom character , j, u, >0, and write =(, u.sub.i). Define R.sub.ij:={i,j): max(1, j1)jj1} and S.sub.ij={, j): {<i or i=i, j<jl}. Finally, for t{0, . . . , l1}, let S.sub.ij(t)=S.sub.ij n {i, j): {ij>=t mod l}.

(324) By union bound,

(325) $(_{i, j} () .Math. u) A + {.Math.}_{t = 0}^{- 1} B (t) A := \underset{(rs) R_{ij}}{.Math.} (X_{.Math. rs .Math.}^{.Math. rs .Math. + - 1} = X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1} .Math. u) B (t) := ((r, s) S_{ij} (t) : X_{.Math. rs .Math.}^{.Math. rs .Math. + - 1} = X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1} .Math. u)$

(326) Now, by the bound of Eq. (D.4),

(327) $\begin{matrix} A . {(1 - c)}^{} C N^{-} & (Eq . D .43) \end{matrix}$
for suitable constants C,.sub..

(328) Next, for any t{0, . . . , l1}, the vectors

(329) ${X_{.Math. rs .Math.}^{.Math. rs .Math. + - 1}}$
are mutually independent and independent of

(330) ${X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1}} .$
Conditional on u, the coordinates of

(331) $X_{.Math. r s .Math.}^{.Math. r s .Math. + - 1} = (x_{.Math. rs .Math.}, .Math., X_{.Math. rs .Math. + - 1})$
are independent with marginal distributions custom character (note that independence of the coordinates holds because l<m/2 and therefore

(332) $X_{.Math. r s .Math.}^{.Math. r s .Math. + - 1}$
does not include two entries in the same column). Note that the collection of marginal distributions custom character (.Math.|), usatisfies the conditions of Lemma D.5 by assumption. Further, the vector

(333) $X_{.Math. r s .Math.}^{.Math. r s .Math. + - 1}$
can have at most one of K=| custom character |.sup.2(+1) distributions (depending on the latents value and the occurrence of a line break in the block.)

(334) Applying Lemma D.5, it obtains:

(335) 0 $\begin{matrix} B (t) e^{- 0^{}} C N^{-} & (Eq . D .44) \end{matrix}$

(336) Summing over t{0, . . . , custom character 1} and adjusting the constants yields the claim (D.8).

(337) Next consider the bound (D.9). Fix u custom character , i, j, and write =(, ) for brevity below:

(338) $\begin{matrix} (_{i, j}^{C} () .Math. u) ((i^{}, j^{}) S_{i j} (t) s . t . u_{i^{}} = u_{i}, j^{} < n^{} : X_{.Math. i^{} j^{} .Math.}^{.Math. i j .Math. + - 1} X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1} .Math. u) & (Eq . D .45) \end{matrix}$

(339) Here, t{0, . . . , custom character 1} can be chosen arbitrarily. Let S.sub.ij (t;u)={i, j) S.sub.ij(t)s. t. =, j<n} Conditional on u, the vectors

(340) ${(X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1})}_{(i, j) S_{i j} (t; u)}$
are i.i.d. and independent of

(341) $X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1} .$
Further, they are distributed as

(342) $X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1}$
Finally,

(343) $N_{i j} (u) := .Math. S_{i j} (t; u) .Math. \frac{n (m_{i} (u) - C \log N}{} \frac{m_{i} (u) n}{C \log N} - C^{} n$
where custom character (u) is the number rows i<i such that =u. Since ii and (=u)()>0, by Chernoff bound there exist constants C, c.sub.0 such that, for all m, n large enough (since im.sub.0):

(344) $\begin{matrix} (N_{i j} (u) \frac{c_{0} m_{0} n}{\log N}) 1 - C e^{- m_{0 / C}} & (Eq . D .46) \end{matrix}$

(345) Further, for any >0 we can choose positive constants .sub.0, .sub.1>0 such that the following holds for all custom character , , large enough:

(346) $\begin{matrix} \frac{c_{0} m_{0} n}{\log N} N^{1 -_{1}} e^{[H (X | U = u_{i)} +_{0}]} & (Eq . D .47) \end{matrix}$

(347) Let T.sub.ij be the rank of the first (i, j) in the set defined above such that

(348) $X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1} = X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1},$
and T.sub.ij= if no such vector exists. It can continue from Eq. (D.45) to get:

(349) $(_{i,}^{c} ()) (T_{i j} N_{i j} (u)) (T_{i j} N_{i j} (u); N_{i j} (u) e^{[H (X | U = u_{i)} +_{0}]}) + C e^{- m_{0} / C} (a) \exp {-_{0} \min_{u} ((-; u))} + C e^{- m_{0} / C} {CN}^{-}$
where in () it used Lemma D.6. This completes the proof of Eq. (D.9).
Proof of Lemma D.3

(350) It begins by considering the bound (D.10).

(351) Fix i custom character , j, , v, >0, and write =(, ), . By union bound:

(352) 0 $(_{i, j} () | u,) = ({.Math.}_{s [n], | j - s | <} B (s) | u,) B (s) := {r < i : X_{.Math. r s .Math.}^{.Math. r s .Math. + - 1} = X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1}}$

(353) Note that for a fixed s, and conditional on u, v, the vectors

(354) ${(X_{.Math. r s .Math.}^{.Math. r s .Math. + - 1})}_{1 s i - 1}$
are mutually independent and independent of

(355) $X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1} .$
Further,

(356) $X_{.Math. r s .Math.}^{.Math. r s .Math. + - 1}$
has independent coordinates with marginals custom character Qx|(.Math.|v.sub.s, (recall that we are conditioning both on and v). In particular, the marginal distributions satisfy the assumption of Lemma D.5 and the law of

(357) $X_{.Math. r s .Math.}^{.Math. r s .Math. + - 1}$
can take one Of K=| custom character |.sup.2 (+1) possible values. Letting i-T(s) the last row at which

(358) $X_{.Math. r s .Math.}^{.Math. r s .Math. + - 1} = X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1}$
(with T(s)if no such row exists), it has, for some constants C, c.sub.0>0,

(359) $(_{i, j} () .Math. u, v) ({.Math.}_{s [n], | j - s | <} {T (s) i - 1} {i - 1 e^{[\overset{}{H} - E]}} u, v) + 1 (i - 1 e^{[\overset{}{H} - E]}) \overset{(a)}{} 2 e^{-} + 1 (m > {e^{}}^{[\overset{}{H} -]}) {Cm}^{- c_{0}} + 1 (m > e^{[\overset{}{H} -]})$
where in (a) it used Lemma D.5, and it defined

(360) $\overset{}{H} :=^{- 1} {.Math.}_{k = j}^{j + - 1} H (X | U = u_{i}, V = v_{k} .$

(361) Taking expectation with respect to v, it gets:

(362) $(_{i, j} () | u) C m^{- c_{0}} + (\frac{1}{} {.Math.}_{k = j}^{j + - 1} H (X | U = u_{i}, V = v_{k}) < \frac{1}{1 +} (H (X | U = u_{i}, V) +)) \overset{(a)}{} {Cm}^{- c_{0}} + e^{-} C^{} m^{- c_{0}}$
where in () it used Chernoff bound. This completes the proof of Eq. (D.10).

(363) Finally, the proof Eq. (D.11) is similar to the one of Eq. (D.9). It fixes custom character , i, j, and writes =(, ).

(364) $\begin{matrix} (_{i, j}^{c} () | u) (_{i}^{} < {iu}_{i^{}} = u_{i} : X_{.Math. i^{} j .Math.}^{.Math. i^{} j .Math. + - 1} X_{.Math. ij .Math.}^{.Math. ij .Math. + - 1} | u) & (Eq . D . 48) \end{matrix}$

(365) Let

(366) 0 $S_{i j}^{c} (u) := {(i^{}, j) s . t . u_{i^{}}, = u_{i}, i^{} < i} .$

(367) Conditional on custom character , v, the vectors

(368) ${(X_{.Math. i^{} j^{} .Math.}^{.Math. i^{} j^{} .Math. + - 1})}_{(i^{}, j^{}) S_{ij}^{c} (u)}$
are i.i.d. and independent copies of

(369) $X_{.Math. ij .Math.}^{.Math. ij .Math. + - 1} .$
Finally,

(370) $N_{i}^{c} (u) := .Math. S_{i j}^{c} (u) .Math.$

(371) is the number rows i<i such that custom character =. By Chernoff bound there exist constants C, c0 such that, for all m, n large enough (recalling it needs only to consider im.sub.0):

(372) $\begin{matrix} (N_{i}^{c} (u) c_{0} m_{0}) 1 - C e^{- m 0 / C} & (Eq . D .49) \end{matrix}$

(373) Since m.sub.0m.sup.1-on(1), for any >0 it can choose constants .sub.0, custom character .sub.1>0 so that:

(374) $\begin{matrix} c_{0} m_{0} m^{1 -_{1}} e^{[H (X | U = u_{i}, V) + 2_{0}]} & (Eq . D .50) \end{matrix}$

(375) Recall the definition

(376) $\overset{}{H} :=^{- 1} {.Math.}_{k = j}^{j + - 1} H (X | U = u_{i}, V = v_{k}) .$
By an an application of Chernoff bound:

(377) $\begin{matrix} (N_{i}^{c} (u) e^{[H (X .Math. U = u_{i,} V) +_{0}]}) 1 - C m^{-} - {Ce}^{- m 0 / C} & (Eq . D .51) \end{matrix}$

(378) Let T.sub.i be the rank of the first i in the set defined above such that

(379) $X_{.Math. i^{} j .Math.}^{.Math. i^{} j .Math. + - 1} = X_{.Math. i j .Math.}^{.Math. i j .Math. + - 1},$
and T.sub.i= if no such vector exists. From Eq. (D.48) it gets:

(380) $(_{ij}^{c} ()) (T_{i} N_{i} (u)) (T_{i} N_{i} (u); N_{i} (u) e^{[H (X .Math. U = u_{i,} V) +_{0}]}) + C m^{-} (a) \exp {-_{u}^{\min} (_{c} (-; u))} + C m^{-} 2 C m^{-}$
where in (a) it used Lemma D.6.
E Proofs for Latent-Based Encoders
Proof of Lemma 5.4

(381) General bound (5.15)

(382) Define the ideal expected compression rate (e.g., the rate achieved by a compressor that is given the latents):

(383) 0 $R_{#} := \frac{1}{mn \log_{2} .Math. .Math.} {[len (header)] + [len (Z (u))] + [len (Z (v))] + \underset{u, v}{.Math.} [len (Z (X (u, v)))]}$

(384) Since R.sub.lat(X)1 by construction, it has:

(385) $R_{lat} (X) {R_{lat} (X) 1_{{\hat{A}}_{U (X; \hat{u}) = 1}} 1_{{\hat{A}}_{V (X; \hat{v}) = 1}}} + ({\hat{A}}_{U} (X, \hat{u}) < 1) + (A_{V} (X; v) < 1) (*) R_{#} + ({\hat{A}}_{V} (X; v) < 1)$
where in step (*) we bounded custom character [len(Z())]=[len(Z())][len(Z())], because, on the event {.sub.U(X;)=1}, coincides with up to relabelings, and the compressed length is invariant under relabelings. Similar arguments were applied to len(Z(v)) and len (Z(X(, v))).

(386) It follows, by the definition of .sub.z(N; k) in Eq. (5.19),

(387) $\begin{matrix} \frac{[len (Z (u))]}{mn \log_{2} .Math. .Math.} \frac{H (U)}{n \log_{2} .Math. .Math.} + + \frac{1}{n}_{z} (m n; {r, c}) & (Eq . E .1) \end{matrix}$ $\begin{matrix} \frac{[len (Z (v))]}{mn \log_{2} .Math. .Math.} \frac{H (V)}{n \log_{2} .Math. .Math.} + + \frac{1}{m}_{z} (m n; {r, c}) & (Eq . E .2) \end{matrix}$ $\begin{matrix} \frac{[len (Z (x (u, v))) .Math. u, v]}{mn \log_{2} .Math. .Math.} \hat{r} (u) \hat{c} (v) \frac{H (X .Math. U = u, V = v)}{\log_{2} .Math. .Math.} +_{z} (c .Math. mn; {Q (.Math. .Math. u, v)}_{i, v}) & (Eq . E .3) \end{matrix}$
where in the last line is the empirical distribution of the row latents and is the empirical distribution of the column latents. By taking expectation in the last expression, it gets:

(388) $\begin{matrix} {.Math.}_{u, v} \frac{[len (Z (x (u, v))) .Math. u, v]}{mn \log_{2} .Math. .Math.} \frac{H (X .Math. U, V)}{\log_{2} .Math. .Math.} + {.Math. .Math.}^{2}_{2} (c .Math. mn; {Q (.Math. u, v)}_{u, v}) & (Eq . E .4) \end{matrix}$

(389) Finally, the header contains | custom character |.sup.2+2 integers of maximum size , whence len(header)4 log.sub.2(mn). It concludes that:

(390) $R_{#} \frac{1}{\log_{2} .Math. x .Math.} {H (X .Math. U, V) + \frac{1}{n} H (U) + \frac{1}{n} H (V)} + \frac{2 \log_{2} (mn)}{mn} + {.Math. .Math.}^{2}_{Z} (c .Math. mn; {Q (.Math. .Math. u, v)}_{u, v}) + 2_{Z} (m n; {r, c})$

(391) The claim (5.15) follows from the first bound in Eq. (5.2) noticing that, under the stated assumptions on custom character , ,

(392) $\begin{matrix} \frac{1}{n} [h (_{U}) +_{u} \log (.Math. .Math. - 1)]_{U} ({\hat{A}}_{U} (X^{m, n}; \hat{u}) < 1) & (Eq . E .5) \end{matrix}$

(393) Overheads of specific encoders: Eqs. (5.16)-(5.5).

(394) LZ coding. Let X.sup.N=(X.sub.1, . . . , X.sub.N) be a vector with i.i.d. symbols X.sub.iq with q a probability distribution over custom character . There are two important differences with the analysis above: data are i.i.d. (not matrix structured) and it is desirable to derive a sharper estimate (not just the entropy term but bounding the overhead as well).

(395) It defines L.sub.k(X.sup.N), T.sub.k(X.sup.N), and lets (k(1), . . . , k(M)) be the values taken by k in the while loop of the Lempel-Ziv pseudocode. In particular,

(396) $\begin{matrix} k (1) = 1 & (Eq . E .6) \end{matrix}$ $\begin{matrix} k (+ 1) = k () + L_{k ()} (X^{N}) & (Eq . E .7) \end{matrix}$ $\begin{matrix} k (M) = N & (Eq . E .8) \end{matrix}$

(397) Therefore, the total length of the code is:

(398) $\begin{matrix} len (LZ (X^{N})) = M .Math. \log_{2} (N + .Math. .Math.) .Math. + {.Math.}_{= 1}^{M} len (elias (L_{k ()})) M .Math. \log_{2} (N + .Math. .Math.) .Math. + 2 {.Math.}_{= 1}^{M} \log_{2} (L_{k ()}) & (Eq . E .9) \end{matrix}$ $\begin{matrix} M .Math. \log_{2} (N + .Math. .Math. .Math. + 2 M \log_{2} (N / M) & (Eq . E .10) \end{matrix}$
where the last step follows by Jensen's inequality. By one more application of Jensen,

(399) $\begin{matrix} R_{LZ} (X^{N}) \frac{1}{\log_{2} .Math. .Math.}, \frac{M}{N}, {\log_{2} (N + .Math. .Math.) + 2 \log_{2} (N / M)} & (Eq . E .11) \end{matrix}$

(400) For custom character , define the set of bad positions as:

(401) $\begin{matrix} B () := {k .Math. N .Math. : L_{k} (X^{N})} & (Eq . E .12) \end{matrix}$
whence,

(402) 0 $\begin{matrix} N = {.Math.}_{j = 1}^{M} L_{k (j)} (M - .Math. B () .Math.) & (Eq . E .13) \end{matrix}$
and therefore, for any c(0, 1):

(403) $\begin{matrix} \frac{M}{N} \frac{1}{} + \frac{1}{N} {.Math.}_{k = 1}^{N} (L_{k} (X^{N})) & (Eq . E .14) \end{matrix}$ $\begin{matrix} \frac{1}{} + N^{- 1 + c} + \frac{1}{N} {.Math.}_{k = N^{c}}^{N} (L_{k} (X^{N})) & (Eq . E .15) \end{matrix}$
Certain Definitions

(404) While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.

(405) Whenever the term at least, greater than, or greater than or equal to precedes the first numerical value in a series of two or more numerical values, the term at least, greater than or greater than or equal to applies to each of the numerical values in that series of numerical values. For example, greater than or equal to 1, 2, or 3 is equivalent to greater than or equal to 1, greater than or equal to 2, or greater than or equal to 3.

(406) Whenever the term no more than, less than, or less than or equal to precedes the first numerical value in a series of two or more numerical values, the term no more than, less than, or less than or equal to applies to each of the numerical values in that series of numerical values. For example, less than or equal to 3, 2, or 1 is equivalent to less than or equal to 3, less than or equal to 2, or less than or equal to 1.

(407) The present disclosure uses boldface for vectors and uppercase boldface for matrices, without making any typographic distinction between numbers and random variables. The dimensions of a matrix or a vector is indicated by subscripts. For example, u.sup.m is a vector of length m, and X.sup.m,n is a matrix of dimensions mn.

(408) If X, Y are random variables on a common probability space (, F, P), their entropies are denoted by H(X), H(Y), H(X, Y) denotes their joint entropy, H(X|Y) denotes the conditional entropy of X given Y.

(409) Computer Systems

(410) The present disclosure provides computer systems that are programmed to implement methods of the disclosure. In some cases, the data objects to be processed may be stored in a cloud via services such as Amazon AWS or Google GCP. For instance, cloud compute instances obtained through the same providers may host the data reducers that process data according to the methods described herein. FIG. 5 shows a computer system 501 that is programmed or otherwise configured to implement the lossless compression algorithm as described herein. The computer system 501 can regulate various aspects of processing structured or semi-structured data. For example, the computer system may execute algorithms to: (i) estimate latent variables associated to rows and columns of the table; (ii) partition the table in blocks according to the row/column latents; (iii) apply a sequential (e.g., Lempel-Ziv compression or entropy coding) to each of the blocks; (iv) append a compressed encoding of the latent. The computer system 501 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.

(411) The computer system 501 includes a central processing unit (CPU, also processor and computer processor herein) 505, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 501 also includes memory or memory location 510 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 515 (e.g., hard disk), communication interface 520 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 525, such as cache, other memory, data storage and/or electronic display adapters. The memory 510, storage unit 515, interface 520 and peripheral devices 525 are in communication with the CPU 505 through a communication bus (solid lines), such as a motherboard. The storage unit 515 can be a data storage unit (or data repository) for storing data. The computer system 501 can be operatively coupled to a computer network (network) 530 with the aid of the communication interface 520. The network 530 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 530 in some cases is a telecommunication and/or data network. The network 530 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 530, in some cases with the aid of the computer system 501, can implement a peer-to-peer network, which may enable devices coupled to the computer system 501 to behave as a client or a server.

(412) The CPU 505 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 510. The instructions can be directed to the CPU 505, which can subsequently program or otherwise configure the CPU 505 to implement methods of the present disclosure. Examples of operations performed by the CPU 505 can include fetch, decode, execute, and writeback.

(413) The CPU 505 can be part of a circuit, such as an integrated circuit. One or more other components of the system 501 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).

(414) The storage unit 515 can store files, such as drivers, libraries and saved programs. The storage unit 515 can store user data, e.g., user preferences and user programs. The computer system 501 in some cases can include one or more additional data storage units that are external to the computer system 501, such as located on a remote server that is in communication with the computer system 501 through an intranet or the Internet.

(415) The computer system 501 can communicate with one or more remote computer systems through the network 530. For instance, the computer system 501 can communicate with a remote computer system of a user (e.g., laptop). Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC's (e.g., Apple iPad, Samsung Galaxy Tab), telephones, Smart phones (e.g., Apple iphone, Android-enabled device, Blackberry), or personal digital assistants. The user can access the computer system 501 via the network 530.

(416) Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 501, such as, for example, on the memory 510 or electronic storage unit 515. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 505. In some cases, the code can be retrieved from the storage unit 515 and stored on the memory 510 for ready access by the processor 505. In some situations, the electronic storage unit 515 can be precluded, and machine-executable instructions are stored on memory 510.

(417) The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.

(418) Aspects of the systems and methods provided herein, such as the computer system 501, can be embodied in programming. Various aspects of the technology may be thought of as products or articles of manufacture typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. Storage type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible storage media, terms such as computer or machine readable medium refer to any medium that participates in providing instructions to a processor for execution.

(419) Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.

(420) The computer system 501 can include or be in communication with an electronic display 535 that comprises a user interface (UI) 540 for providing, for example, compression results and analytics. Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface.

(421) Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 505.

(422) While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

(423) While preferred embodiments of the present subject matter have been shown and described herein, it can be understood that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now without departing from the present subject matter. It can be understood that various alternatives to the embodiments of the present subject matter described herein may be employed in practicing the present subject matter.

Systems and methods for compressing structured data via latent variable estimation

Assignee

Inventors

Cpc classification

Classification Explorer

H03M7/6011

ELECTRICITY

Classification Explorer

H03M7/3064

ELECTRICITY

International classification

Classification Explorer

H03M7/30

ELECTRICITY

Abstract

Claims

Description