Systems and methods for model-free compression and model-based decompression

09543980 ยท 2017-01-10

Assignee

Inventors

Cpc classification

International classification

Abstract

An encoder generates a compressed data sequence from an original data sequence using many-to-one mapping independently of a source model associated with the original data sequence and without extracting the source model. A decoder uses both the source model associated with the original data sequence and the mapping applied during compression that is devoid of, in substance, the source model, to regenerate, at least in part, the original uncompressed data sequence from the compressed data sequence that does not include a significant portion of the source model.

Claims

1. A computer implemented method of decompressing a compressed data sequence comprising a set of k symbols and being devoid of a significant portion of a source model associated with a data sequence from which the compressed data sequence was derived, the method comprising: generating by a processor an uncompressed data sequence comprising n symbols, wherein n is greater than k, by applying by the processor to the set of k symbols of the compressed data sequence: a plurality of constraints devoid of a significant portion of the source model; and at least one rule derived from the source model.

2. The method of claim 1, wherein the compressed data sequence was derived from the uncompressed data sequence by applying the plurality of constraints to the uncompressed data sequence.

3. The method of claim 2, wherein: the uncompressed data sequence comprises an encrypted data sequence; derivation of the compressed data sequence was performed without any information about any encryption key used in generating the encrypted data sequence; and the processor uses an encryption key used in generating the encrypted data sequence for generating the uncompressed data sequence.

4. The method of claim 1, wherein the plurality of constraints comprises at least one of a projection matrix corresponding to a low density parity check (LDPC) code; a linear code; a code on a graph comprising at least one of an LDPC code, a turbo code, a repeat-accumulate code, a convolutional code, and a trellis code; and a polar code.

5. The method of claim 1, wherein each constraint of the plurality of constraints is independent of any rule derived from the source model.

6. The method of claim 1, wherein generating the uncompressed data sequence comprises: (a) applying by the processor to at least one of the k symbols at least one constraint from the plurality of constraints, to obtain at least one intermediate result; and (b) applying by the processor the at least one rule to the at least one intermediate result.

7. The method of claim 6, wherein at least one of steps (a) and (b) is repeated at least once.

8. The method of claim 7, wherein, in one iteration, step (b) is performed before step (a).

9. The method of claim 1, wherein generating the uncompressed data sequence comprises: (i) applying by the processor to at least one of the k symbols at least one rule, to obtain at least one intermediate result; and (ii) applying by the processor at least one constraint from the plurality of constraints to the at least one intermediate result.

10. The method of claim 9, wherein at least one of steps (i) and (ii) is repeated at least once.

11. The method of claim 10, wherein, in one iteration, step (ii) is performed before step (i).

12. The method of claim 1, wherein: the source model is partially unknown and is parameterized; and generating the uncompressed data sequence comprises iteratively applying the plurality of constraints and at least one rule, and iteratively refining at least one parameter of the source model.

13. The method of claim 1, wherein the source model is represented as at least one of a graph, a codebook, an enumeration, a list of examples, a list of features, an algorithm, and a data structure.

14. The method of claim 1, wherein the source model is partially unknown and is parameterized, the method further comprising: representing the source model as a graph; and augmenting the graph using one or more parameters of the source model.

15. The method of claim 1, wherein generating the uncompressed data sequence comprises representing the n symbols of the uncompressed data sequence in a binary form.

16. The method of claim 1, wherein: the compressed data sequence is completely devoid of the source model; and the plurality of constraints are completely devoid of the source model.

17. The method of claim 1, further comprising applying by the processor to the set of k symbols of the compressed data sequence another plurality of constraints representing a quantization of the set of n symbols according to at least one of a distortion measure and semantic information.

18. A computer implemented method of compressing a data sequence comprising a set of n symbols, the method comprising: receiving by a processor a data sequence comprising a set of n symbols and corresponding to a source model; and mapping by the processor the set of n symbols into a set of k symbols, wherein k is less than n, the mapping being independent of the source model and being performed without extracting the source model from the data sequence.

19. The method of claim 18, wherein the set of k symbols is devoid of a significant portion of the source model.

20. The method of claim 18, wherein the mapping comprises applying to the data sequence at least one of a projection matrix corresponding to a low density parity check (LDPC) code; a linear code; a code on a graph comprising at least one of an LDPC code, a turbo code, a repeat-accumulate code, a convolutional code, and a trellis code; and a polar code.

21. The method of claim 18, wherein: the data sequence comprises an encrypted data sequence; and the mapping is performed without any information about any encryption key used in generating the encrypted data sequence.

22. The method of claim 18, further comprising quantizing the set of n symbols according to at least one of a distortion measure and semantic information, prior to the mapping.

23. The method of claim 18, wherein the source model is represented as at least one of a graph, a codebook, an enumeration, a list of examples, a list of features, an algorithm, and a data structure.

24. A system for decompressing a compressed data sequence comprising a set of k symbols and being devoid of a significant portion of a source model associated with a data sequence from which the compressed data sequence was derived, the system comprising: a first processor; and a first memory in electrical communication with the first processor, the first memory comprising instructions which, when executed by a processing unit comprising at least one of the first processor and a second processor, program the processing unit to: generate, from the compressed data sequence, an uncompressed data sequence comprising n symbols, where n is greater than k, by applying to the set of k symbols of the compressed data sequence: a plurality of constraints devoid of a significant portion of the source model; and at least one rule derived from the source model.

25. The system of claim 24, wherein generation of the uncompressed data sequence comprises at least one of an accumulation operation and a filtration operation, and the processing unit is configured to provide atomically at least one of the accumulation operation and the filtration operation.

26. The system of claim 24, wherein: the compressed data sequence is completely devoid of the source model; and the plurality of constraints are completely devoid of the source model.

27. A system for compressing a data sequence comprising a set of n symbols, the system comprising: a first processor; and a first memory in electrical communication with the first processor, the first memory comprising instructions which, when executed by a processing unit comprising at least one of the first processor and a second processor, program the processing unit to: map the set of n symbols into a set of k symbols, where k is less than n, the mapping being independent of a source model corresponding to the data sequence and being performed without extracting the source model from the data sequence.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) In the drawings, like reference characters generally refer to the same parts throughout the different views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments of the present invention are described with reference to the following drawings, in which:

(2) FIG. 1 shows a table listing various notations used throughout this disclosure;

(3) FIG. 2 schematically depicts an encoder-decoder system, according to one embodiment;

(4) FIG. 3A schematically depicts a decoder, according to one embodiment;

(5) FIG. 3B schematically depicts the decoder depicted in FIG. 3A where the source graph is depicted as a factor graph, according to one embodiment;

(6) FIG. 3C schematically depicts a particular implementation of the decoder depicted in FIGS. 3A and 3B, according to one embodiment;

(7) FIGS. 3D-3G schematically depict message passing in the decoder depicted in FIG. 3B, according to various embodiments;

(8) FIG. 4 shows a table listing various computations performed by the decoder depicted in FIG. 3C;

(9) FIG. 5 schematically depicts a decoder for uncompressing a compressed data sequence derived from a non-binary source, according to one embodiment;

(10) FIG. 6 schematically depicts a decoder for uncompressing a compressed encrypted data sequence, according to one embodiment;

(11) FIG. 7 is a flow chart depicting decompression of a compressed data sequence, according to some embodiments; and

(12) FIG. 8 is a flow chart depicting compression of an uncompressed data sequence, according to some embodiments.

DETAILED DESCRIPTION

(13) The table depicted in FIG. 1 describes the notation used generally in this disclosure. With reference to FIG. 2, an encoder-decoder system 200 includes a model-free encoder 202 that can compress an input data sequence s.sup.n and output a compressed data sequence x.sup.k. The encoder 202 can be implemented by one or more hardware processors and/or as one or more software components. The encoder 202 does not receive in substance a source model for the data sequence s.sup.n, and does not derive a substantial portion, if any, of such a model during the process of encoding (as typical universal encoders do). As a consequence, the compressed data sequence x.sup.k does not include a substantial portion of the source model. As used herein, substantial portion generally means information that can identify the class of possible data, highly likely data, or data of compression interest, e.g., an exponentially smaller set of symbol sequences than the set of all possible n symbol sequences.

(14) The compressed data sequence x.sup.k is typically transmitted wirelessly, via a cable (electrically conductive and/or optical), and/or using a non-transitory medium such as flash memory, compact disc, hard drive, solid-state drive, etc., to a decoder 204. The decoder 204 can be implemented by one or more hardware processors and/or as one or more software components. The decoder 204 receives a source model (also called data model) .sup.P.sup.s.sup.n and generates the original input sequence s.sub.n. A source model can be represented as a graph, a codebook, an enumeration, a list of examples, a list of features, an algorithm, or a data structure, or as a combination of two or more of these representations.

(15) In one embodiment, all data objects, compressed and uncompressed, are based on a finite-field alphabet S=GF(q). Let s.sup.nS.sup.n be an n-vector source data sequence presumed to be a random sample of s.sup.n. Such a sequence can be compressed using a code: a collection custom character(n, k) of kn projection matrices of a rate k/n linear source code ensemble, where k<n. Examples of such ensembles include but are not limited to low density parity check code (LDPC) ensemble. The decompression (also called decoding) requires a data model: p.sub.s.sub.n in some factored form or in a probabilistic graphical model (PGM) representation. In some embodiments, the PGM is a pairwise model.

(16) With reference to FIG. 8, to encode, i.e., compress data, k is set to target r.sub.code=k/n as the nominal compression rate. If LDPC codes are used for compression, a projection matrix H is selected from the collection custom character(n, k) as the encoding matrix. For an input data sequence s.sup.n of length n, a bin label sequence x.sup.k of length k, where k<n is produced by the encoder 202, as x.sup.k=Hs.sup.n. The sequence x.sup.k is the compressed output data. In some embodiments, the encoder may produce additional outputs such as values of doped symbols, doping locations, and/or a random seed used to select H from the collection custom character(n, k). Instead of applying a projection matrix, encoding, i.e., deriving k symbols from a sequence of n symbols, may be performed using binning, hashing, and/or decimation.

(17) With reference to FIG. 7, when the compressed data x.sup.k is received at a receiver, it may be decoded by decoder 204. To this end, the decoder 204 utilizes a code subgraph. Because H enforces k hard constraints of the form x.sub.a=.sub.i=1.sup.n H.sub.a,is.sub.i, the constraints can be represented as a bipartite factor graph custom character=(custom character, custom character, custom character), where there are k factor nodes custom charactercustom character{f.sub.1, . . . f.sub.k} and n source nodes Scustom character{s.sub.1, . . . , s.sub.n}. There is an edge between factor node f.sub.a and source node s.sub.i if and only if H.sub.a,i0, forming neighborhoods denoted by A=custom character.sub.acustom character. The algorithmic meaning of C is observed from the hash constraint function:

(18) c ( s n ) = .Math. a = 1 k f a ( s A ) = .Math. a = 1 { x a = .Math. i : H a , i 0 H a , i s i } ( s A )
where each factor is an indicator on one hard constraint, and c(s.sup.n)=1 if and only if all constraints are satisfied. c(s.sup.n) is an (unnormalized) probability distribution and various PGM algorithms can be applied thereto. In various embodiments, iterative local decoding of the linear source code with H is substantially the same as approximate marginalization of c(s.sup.n) on C, which is called the code subgraph. In general, the constraints may represent linear codes, including LDPC codes, codes on graphs such as LDPC codes, turbo codes, repeat-accumulate codes, convolutional codes, and trellis codes, and polar codes.

(19) The decoder 204 also utilizes a source subgraph. If the data model p.sub.s.sup.n is in a factored form, a factor graph can be formed directly therefrom. If the data model p.sub.s.sup.n is a directed or undirected PGM, it can be converted into a factor graph form using belief propagation marginalization. Without the loss of generality, in some embodiments, s.sup.nP.sub.s.sub.n. s.sup.np.sub.s.sub.n is represented by an undirected graph custom character=(custom character, ) over the n source nodes custom charactercustom character{s.sub.1, . . . , s.sub.n}, in the sense that p.sub.s.sub.n is factored over the maximal cliques of custom character as:

(20) p s n ( s n ) = 1 �� .Math. C cl ( �� ) C ( s C )
This factored form can be expressed as a factor graph custom character=(custom character, , ). In some embodiments, if custom character is a pairwise model, no conversion is needed, and the computation can be performed directly on custom character. Either one of custom character and custom character may be called a source subgraph.

(21) With reference to FIGS. 3A and 3B, in one embodiment, custom charactercustom charactercustom charactercustom character=(custom character, custom character, custom character) represents a combined source-code graph 300 in factor graph form. In the graph 300, the source nodes S are shared between the source subgraph 302a and code subgraph 304. In the embodiment depicted in FIG. 3A, the source model is described as a pairwise model in the subgraph 302a. In this subgraph, is elided. In general, however, either or both of source and code subgraphs can be factor graphs (also called Tanner graphs). For example, FIG. 3B shows a same source model as a source factor graph 302b. A source graph can also be a directed or undirected PGM. The source nodes divide custom character into a source side and a code side. The decoder 204 (shown in FIG. 2) performs belief propagation (BP) on the combined graph 300, representing approximate marginalization on the joint objective:
u(s.sup.n)custom characterc(s.sup.n)p.sub.s.sub.n(s.sup.n)

(22) In BP, each node in a graph corresponds to a local computation device accepting inputs and emitting outputs on edges, to connected ports. For a graph custom character=(custom character,) represented as bipartite factor graph custom character=(custom character, , ), custom character may be called variable nodes and may be called factor nodes. In this representation, each factor node corresponds to one unfactorizable clique custom character of custom character. ={(custom character, s.sub.i):custom character, s.sub.icustom character} represents the edges between factor nodes and their corresponding clique variable nodes.

(23) In general, any function (s.sub.1 . . . , s.sub.l) over l variables, :custom character.sup.l.fwdarw.R, can be represented by a tensor

(24) f i 1 , .Math. , i l i l + 1 , .Math. , i l
of total order l. In general, represents a factor in a graph, with l edges to l variable nodes. Thus, is a function over l variable nodes and, graphically, there are l edges to and, correspondingly, there are l ports. Some of the l (e.g., l) ports are designated as input ports and the rest (e.g., l-l) ports are designated as output ports. For example, in decoding, in some embodiments, l1 ports are designated as inputs, that emit one output. Such a function (also called tensor) can be implemented as a machine with the numbers of input and output ports summing to l. For the special case of a (1; 0)-tensor m.sup.i, representing a unary function m:custom character.fwdarw.R, the unitary function is called a message. Two atomic operations, namely, accumulation and filtration, are associated with messages. The accumulation operation on a set of messages custom character, is described as custom character:.sub.mcustom characterm where is the Hadamard product. The filtration operation by a factorg :custom character.sup.l.fwdarw.R on a set of messages custom character is described as custom character:({circle around (x)}.sub.mcustom characterm), where {circle around (x)} is the tensor product, and the outermost operation is a tensor contraction. and represent factors in the source graph (represented as a factor graph) and code graph (represented as another factor graph), respectively.

(25) For convenience, in some embodiments, input or output ports of a node are labeled by the adjacent node to which these ports are connected in custom character. A variable-node output message of the node s.sub.i, on the port custom character, is denoted as m.sup.i.fwdarw.custom character(s.sub.i). A factor-node output message of the node custom character, on the port s.sub.i, is denoted as m.sup.i.fwdarw.custom character(s.sub.i). Thus, two directional messages are associated with each edge of . Reciprocally, m.sup.i.fwdarw.custom character(s.sub.i) and m.sup.icustom character(s.sub.i) represent a factor-node input message and a variable-node input message, respectively.

(26) Referring again to FIGS. 3A-3C, for each source node s.sub.i form custom character in the source-code graph 300, there are two sets of neighbor nodes, one on source side and one on code side. Only the source nodes interact with both sides. The factor nodes only interact with nodes in its own corresponding subgraph. In the combined BP, messages are passed along all edges, and computations are performed on the three sets of nodes, custom character, , and custom character.

(27) First, messages are computed as described below for the nodes that do not interact with another side directly. In these computations, indicates source-side messages and m indicates code-side messages; wildcards, i.e., all nodes, only expand within the subgraph of their message. In node output computation, for every output port s.sub.i of custom character, filter messages on ports not s.sub.i by custom character using:

(28) i C i i i .fwdarw. C
In general, .sup.i.fwdarw.custom character represents information received at custom character corresponding to S.sub.i from all source nodes except S.sub.i, as depicted in FIG. 3D. In custom character node output computation, for every output port S.sub.i of .sub.a, filter messages on ports not S.sub.i by .sub.a, as depicted in FIG. 3E, using:

(29) m i a f i i m i .fwdarw. a

(30) Next, messages for nodes that open ports to both sides of the joint graph are computed as source side and code side output computations. Specifically, in source side s-node output computation, for every output port custom character of S.sub.i, messages on ports not custom character are accumulated as:

(31) i .fwdarw. C i C [ m i * ]
In code side s-node output computation, for every output port .sub.a of S.sub.i, messages on ports not .sub.a are accumulated as:

(32) m i .fwdarw. a m i a [ i * ]
.sup.icustom character is the aggregation of messages received at S.sub.i from all other factor nodes except custom character, and [m.sup.i] is the aggregate of the code-side messages, as depicted in FIG. 3F. Similarly, m.sup.ia is the aggregation of messages received at S.sub.i from all factor nodes except .sub.a, and [.sup.i] is the aggregate of all source-side messages, as depicted in FIG. 3G.

(33) Finally, referring back to FIGS. 3D and 3E, total belief computation is performed for every S.sub.i by accumulating messages on all ports as;

(34) b i = [ ] [ m i * ]
The computations associated with the subgraphs 302a, 302b, 304 and with overall source-code graph 300 may be implemented by one or more hardware processors and/or as one or more software components. In one embodiment, each node of these graphs can execute the associated accumulation and filtration operations atomically, using a processor dedicated to that node. In other embodiments, one processor (implemented as a hardware and/or software processor) may perform computations associated with only some of the nodes, the operations associated with the other nodes being performed by additional hardware and/or software processors.

(35) The bracketed terms, despite being Hadamard products of many messages, may be understood as single lumped messages that do not change with respect to the port on which the output is emitted. Therefore, in some embodiments, when a node of custom character is emitting on one subgraph, the lumped message from the other subgraph can be pre-computed and considered to be an external message on a virtual port opened on the side of that subgraph. As such, BP on custom character is the same as BP on each subgraph alone (e.g., subgraphs 302a, 302b, 304 in FIGS. 3A and 3B), with the addition of an external I/O port and computation to determine the external message. Taking advantage of this property, in some embodiments, a decoder system 350 depicted in FIG. 3C is architecturally modular, with external messages and virtual ports 306 acting as the only interface between graph-inferential components 302a, 302b, 304.

(36) In general, while each component can compute a total belief using external messages, referring again to FIG. 3C, in some embodiments, a main controller 308 is used to handle component-agnostic operations. The controller 308 is simply another component over the port-preserving intersection subgraph custom charactercustom charactercustom character*custom character=(custom character,,custom character). All of the ports of this subgraph can be interpreted as virtual ports, i.e., the ports 306, that can receive external messages. Therefore, for the controller 308, BP from which the original uncompressed data sequence may be derived can be expressed as

(37) Total belief computation : b i = [ M i * ]
where the wildcard is taken over all virtual ports 306. The computations for the pairwise model depicted in FIG. 3C are shown in FIG. 4.

(38) To perform the various computations described above according to a parallel schedule, each component (e.g., subgraphs 302a, 302b, 304 depicted in FIGS. 3A-3C), may compute internal messages and may then exchange external messages with each other simultaneously. As used herein, a simultaneous exchange of two messages does not require the start and/or end of two exchanges to coincide. The exchange is considered to be simultaneous if the two exchanges overlap, i.e., at least a part of one exchange occurs while at least a part of the other exchange is in progress. In a serial schedule, one component is active at a time, and presents its latest external messages to another component. A serial schedule may alternate between source message passing and code message passing. Within a component, the computations may be executed in parallel, whether or not the exchange of external messages occurs in parallel or serially.

(39) Referring back to FIGS. 2 and 7, in the decoder 204, belief propagation is typically performed until convergence or declaration of failure is determined. The expression {circumflex over (p)}.sub.s.sub.i=b.sup.i/b.sup.i.sub.1 can be used to determine convergence and, if a convergence is detected, the decompressed output of the decoder 204 may be expressed as:

(40) 0 s ^ i = arg max s i p ^ s i ( s i )

(41) Some data models associated with a data sequence to be uncompressed may not be known fully. Parametric models represent one type of such partially specified models. If an uncompressed data sequence s is drawn from a distribution p(s; ) where denotes unknown parameters, a checksum-correctness objective may be defined as:

(42) F ( s ^ ) = 1 k .Math. H s ^ - x .Math.
where is some appropriate norm (e.g., l.sub.0). At each iteration t of the decoding process described with reference to FIGS. 3A-3C, F() is evaluated on the contemporary source estimate .sup.(t)() obtained from the total belief update. The value * that minimizes F(.sup.(t)()) among choices in a neighborhood custom character.sub..sub.(t)(.sup.(t-1)), for some diminishing sequence .sup.(1)>.sup.(2)> . . . >0 of bounded sum (e.g., .sup.(t)=/.sup.t for >0, >1)), is chosen as the parameter estimate for that iteration. At the end of decoding, the sequence of estimates for are also converged within some neighborhood, to determine the parameters of the partially specified model. During the iterations, the model is revised using the estimated parameters.

(43) In some embodiments, the source model used in decoding is augmented with parameter nodes. To this end, while is deterministic, a Bayesian belief p.sub.() can be constructed to represent . p.sub.() may be unknown, but the true can be drawn from a random variable p.sub.. Therefore, in some embodiments, the joint distribution (s.sup.n, ) (s.sup.n, ) is written as:

(44) p s n ( s n , ) = p ( ) .Math. i = 1 n p s i | ( s i | )
In the graph a custom character for the source s.sub.n, the nodes custom character={s.sub.1, . . . , s.sub.n} may be augmented by a node , connected to all other nodes of custom character via (s.sub.i, ) as edge potential as replacement for node potential (s.sub.i), to obtain an augmented graphical model. Using this augmented model, the decoder can decode/uncompress a compressed data sequence using the objective:

(45) u ( s n , ) = c ( s n ) p s n ( s n , )

(46) In various embodiments described herein, the source s (i.e., data to be compressed) and the code as represented by H are based on a field S=GF(q). If q=2, the source and the code are binary; otherwise, the source and the code are based on a non-binary alphabet. Such non-binary sources may have substantial complexity in their internal symbol structure, and source models are defined over those symbols. Consider, for example, floating point numbers or relational database entries as source symbols. In real systems, even though source s may be highly structured internally, s is often presented to the compression system after the source s has already been serialized into an opaque bit-stream z. In model-free encoding, it is therefore not realistic to assume the actual field structure or even the alphabet size of s, at the encoder.

(47) In order to be able to use binary LDPC codes in compression/decompression of non-binary data i.e., H is GF(2) but s is GF(M), M>2, or in general, to use a code over a different alphabet than the source alphabet, a source with an alphabet larger than GF(2) may be modified by bit-planing, i.e., by treating the bits representing a letter in a larger alphabet as independent GF(2) sources. Bit-planing, however, may not achieve the best results in terms of compression efficiency.

(48) With reference to FIGS. 5 and 8, in one embodiment, a data sequence s.sup.n={s.sub.1, . . . , s.sub.n} is an n-symbol data sequence that, instead of bit-planing, is serialized by symbol level representational maps. In this embodiment, all s.sub.i i.e., custom character, belong to the same alphabet S of size M and, as such, there is one map for all n symbols, though in some situations, different symbols may belong to alphabets of different sizes. The representation map is a bijective function t.sub.M.fwdarw.q:S.fwdarw.GF(q)custom characterwhere custom characterlog.sub.q M. For integer symbols s.sub.i serialized into GF(2), the mapping can be a computer representation of the integers, or other binary expansions such as Gray-coding. When messages are passed to or from source nodes, there are related messages on the serialized representations of the source nodes. As such, a pair of message translation functions that convert between a message m.sup.(M) over S and a custom character-tuple of messages m.sup.(q)=m.sub.1.sup.(q), . . . , m.sub.B.sup.(q) over GF(q) can be defined as:
T.sub.M.fwdarw.q:(S.fwdarw.R.sup.+).fwdarw.(GF(q).fwdarw.R.sup.+).sup.B
T.sub.q.fwdarw.M:(GF(q).fwdarw.R.sup.+).sup.B.fwdarw.(S.fwdarw.R.sup.+)
In some embodiments, assuming message are properly normalized probabilities, for {1, . . . , custom character} and GF(q)

(49) T M .fwdarw. q ( m ( M ) ) ( ) = .Math. �� m ( M ) ( ) 1 { t M .fwdarw. q ( ) = }
and for S,

(50) T q .fwdarw. M ( m ( q ) ) ( ) = .Math. = 1 B m ( q ) ( t M .fwdarw. q ( ) )
During decoding, passing messages to or from source nodes custom character corresponds to passing related messages on the serialized representations of the source nodes.

(51) In this embodiment, encoding of the source sequence s.sup.n takes place in the representational domain of bit-streams, i.e., in a code domain. Given an LDPC projection matrix HGF(q).sup.knB and rate r.sub.code=k/nB, the compressed output is:

(52) x k = Hz nB = Ht M .fwdarw. q ( s n )
For an uncompressed data sequence custom character={s.sub.1, . . . , s.sub.n} expressed in a graphical decoder, the serialized symbols Z via the representative map t.sub.M.fwdarw.q can be expressed as:

(53) { z i , 1 , z i , 2 , .Math. , z i , B } t M .fwdarw. q ( s i ) t M .fwdarw. q ( S ) = .Math. s S t M .fwdarw. q ( s )

(54) Without the representative map, the code subgraph and the source subgraph share the n source nodes in the source-code graph, as discussed above with reference to FIG. 3A. With the representative map, however, that is no longer the case. Instead, for custom character=(custom character,) represented as a factor graph custom character=(custom character, , ), custom character=(custom character, custom character, custom character) where custom character=t.sub.M.fwdarw.q(custom character). The corresponding combined source-code graph 500 is custom charactercustom character(custom charactercustom character, custom character, custom character). The message passing strictly within each subgraph, i.e., source subgraph 502 and code subgraph 504 is similar to that described with reference to FIG. 3A. When messages cross alphabet/representation boundaries, and are passed from one subgraph to another, however, the messages are translated according to t.sub.M.fwdarw.q. The corresponding decoding can be called generalized decoding. In various embodiments, the source (i.e., data to be compressed) and messages are translated into a binary format using a mapping t.sub.M.fwdarw.2.

(55) Sometimes privacy of the data to be compressed and/or transmitted must be retained. Therefore, in some instances, a version of the data that has already been encrypted, e.g., by adding a secret key to a bit-stream representation of the data, is presented to the compressor (e.g., the encoder 202 shown in FIG. 2). Thus, as discussed above in describing non-binary sources, the compressor/encoder receives t.sub.M.fwdarw.2(s)k where k={k.sub.1, . . . , k.sub.nB} are nB bits of key associated with the nB bits representing the unencrypted data s. This stream is largely uncompressible using conventional encoders, without the encoder also having the key. Various embodiments of model-free encoders described herein, however, can consider k as part of a representational map and may apply coding to t.sub.M.fwdarw.2(s)kt.sub.M.fwdarw.2(s). The bijective function t.sub.M.fwdarw.q, where q=2, and the corresponding message translation functions T.sub.M.fwdarw.q and T.sub.q.fwdarw.M, for q=2, can be obtained as described above.

(56) With reference to FIG. 6, while decoding custom character, represented as a source-code graph 600, the source subgraph 602 is unencrypted because s, the object of decoding, is unencrypted. The code subgraph 604, however, is encrypted because z=t.sub.M.fwdarw.2(s)k is encrypted and x=Hz is also encrypted. The encryption boundary thus coincides with the representation boundary. As such, in one embodiment, a decoder applies generalized decoding, using t.sub.M.fwdarw.2 and the corresponding T.sub.M.fwdarw.2 and T.sub.2.fwdarw.M pair. During each iteration of this decoder, mcustom character.sub..fwdarw.i.sup.(M)(s.sub.i) is obtained by message decryption followed by the translation T.sub.2.fwdarw.M, and decoding mcustom character.sub..fwdarw.i,.sup.(q)(z.sub.i,) is performed by the translation T.sub.M.fwdarw.2 followed by message encryption.

(57) In some situations, lossless compression is infeasible or is impractical and, hence, the data sequence must be compressed using a lossy compression technique. Lossy compression may be preferred over lossless compression, even when the latter is feasible, for performance and/or cost benefits. Lossy compression involves two sources of compression gain: (1) conflating multiple source sequences as one (i.e., quantization), and (2) exploiting the data model (i.e., entropy-based compression). These two sources of compression gain relate, respectively, to two sources of prior knowledge, namely: (1) a distortion measure and/or semantic information, and (2) the data model. In various embodiments, the encoder is aware of the quantization, i.e., the distortion measure and/or semantic information but, as described with reference to FIG. 2, the compression is performed without relying on any significant information related to the data model. The quantization is also performed without relying on any significant information about the data model. As such, in some embodiments, the encoder includes a model-free quantization stage followed by a model-free coding stage. The corresponding decoder may use a model of the quantization together with the code subgraph, and the source subgraph.

(58) The design of a compression system for each data type, as the known systems generally require, can be a costly and time consuming process, spanning years of development. As technology improves, it is valuable to be able to take incremental advantage without full system overhaul. In various embodiments, compression with model-free encoding generates an open-ended compressed stream that is not tied to a particular source model, so that almost any change to this key piece of information can be made at the decoder as one sees fit. The described systems and methods thus present a low barrier not only to improvements but also to the initial design, so that a crude decoder can be made for many types of emerging data, with the goal of subsequent modification without fear of losing compatibility.

(59) Sometimes the source model is not fully known at the encoder, for example, if the bit representation of the data is unknown or it is encrypted data, or if it would help to encode without incorporating certain complex source structure for data retrieval reasons. In these cases, the technology described herein still allows efficient compression as if this information were known at the encoder, but allows these tasks to actually be deferred to the decoder. Ensuring end-to-end performance is also important when compressed data is transported across an unreliable network, such as WiFi or cellular networks, or over complex distributed network topologies. Traditionally, the compression and error correction designs are jointly designed to some degree to ensure best performance under such circumstances because the compressed output has intricate structure with critical and uncritical parts relating to the data type. In various embodiments, compression with model-free encoding and model-based decoding creates a compressed information stream in which the parts of the stream are interchangeable so that network-specific issues can be left to the transport layer, and no particularly special compression system needs to be designed for distributed/P2P streaming or different networks characteristics.

(60) In recent years, the computing landscape has moved towards a model of low-power mobile acquisition devices and powerful processing nodes at the server. This means the existing methods of compression, where a sophisticated encoder takes into account all source model information and a simple decoder reproduces the data, may not utilize resources in the right way. The described technology, with the model-free or model-independent encoder and a relatively more complex, model-based decoder, is well suited for this new landscape. It also allows the tradeoff of compression efficiency (hence network load) and computational load at the decoder to be dynamically adjusted, by choosing to use a crude or sophisticated source model, something that is very difficult if not impossible to achieve using known compression techniques.

(61) Applications for, and advantages of, the model-free encoding and model-based decoding technique for compression and decompression include compressing data stores of new and existing data types, improved end-to-end performance for stored and streaming network applications, more privacy and content protection at upstream sites, and many others.

(62) Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. A computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal. The computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).

(63) The operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.

(64) The term data processing apparatus encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them. The apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.

(65) A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language resource), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

(66) The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

(67) Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few. Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

(68) To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending resources to and receiving resources from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.

(69) Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).

(70) The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server.

(71) A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.

(72) While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

(73) Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

(74) Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.