WATERMARK EMBEDDING AND EXTRACTING METHOD FOR PROTECTING DOCUMENTS
20210165860 · 2021-06-03
Inventors
Cpc classification
G06T1/0028
PHYSICS
G06T1/005
PHYSICS
G06T2201/0065
PHYSICS
International classification
Abstract
A method for watermarking documents comprising: identifying and locating spaces in a received original document (10) by a location analysis (110) module which distinguishes between intra-word spaces and inter-word spaces; minimizing an error probability of interference between intra-word spaces and inter-word spaces in the watermarked document (20) by an optimization module (111); encoding (113) the message (30) into codewords and encoding (114) the codewords into the watermark;
embedding the watermark to generate (115) the watermarked document (20) by modifying the intra-word spaces and inter-word spaces of the original document (10).
Claims
1. A method (11) for watermarking documents, receiving an original document (10) and a message (30) to obtain a watermarked document (20) embedding a watermark which encodes the received message (30), the method characterized by comprising: identifying and locating spaces in the original document (10) by a location analysis (110) module, the location analysis (110) distinguishing between intra-word spaces and inter-word spaces; minimizing an error probability of interference between intra-word spaces and inter-word spaces in the watermarked document (20) by an optimization module (111); encoding (113) the message (30) into codewords and encoding (114) the codewords into the watermark; embedding the watermark to generate (115) the watermarked document (20) by modifying the intra-word spaces and inter-word spaces of the original document (10).
2. The method according to any preceding claim, wherein the location analysis (110) comprises: checking (401) whether the received original document (10) is a bitmap format document and if the original document (10) is in a non-bitmap format, converting (402) the original document (10) into a bitmap format document; applying an image-based segmentation (403) to the bitmap format document to obtain a set of lengths and a set of locations of the spaces in the bitmap format document; classifying (405) the identified and located spaces into intra-word spaces and inter-word spaces to obtain sets (406) of lengths of the intra-word and inter-word spaces, S.sub.S and S.sub.B respectively, and sets of locations of the intra-word and inter-word spaces, L.sub.S and L.sub.B, of the original document (10).
3. The method according to claim 2, wherein the image-based segmentation (403) applies clustering or optical character recognition if the received original document (10) has a non-homogeneous text background.
4. The method according to claim 2, wherein the image-based segmentation (403) applies projections if the received original document (10) has a homogeneous text background.
5. The method according to any any of claims 2-4, wherein the optimization module (111) minimizes the error probability of interference between intra-word spaces and inter-word spaces by: obtaining a modified set of intra-word spaces, S*.sub.S={S*.sub.Si} where i denotes a line number of the original document (10) and S*.sub.Si is equal to:
μ(i)+εif S.sub.Si>μ(i)+ε, where ε is a host-rejection parameter; and S.sub.Si else; obtaining a modified set of inter-word spaces, S*.sub.B={S*.sub.Bi} where i denotes a line number of the original document (10), where S*.sub.Bi={S*.sub.Bi(k)}, S*.sub.Bi(k) being the modified length of the k-th inter-word space of the i-th line and S.sub.Bi(k) being the original length of the k-th inter-word space of the i-th line of the original document (10), S*.sub.S={S*.sub.Si(k)}, S*.sub.Si(k) being the modified length of the k-th intra-word space of the i-th line and S.sub.Si(k) being the original length of the k-th intra-word space of the i-th line of the original document (10), and
S*.sub.Bi(k)=S.sub.Bi(k)+(Σ.sub.kS.sub.Si(k)−Σ.sub.kS*.sub.Si(k))/N.sub.b(i) N.sub.b(i) being the number of inter-word spaces in the i-th line of the original document (10).
6. The method according to claim 5, wherein generating (115) the watermarked document (20) comprising: selecting a subset from the set of inter-word spaces, S*.sub.B, mapped to the set of inter-word spaces in the t-th line, S.sup.t.sub.B, through a secret key; selecting a codeword w.sub.t from the encoded message (30) embedding the selected codeword w.sub.t in the selected subset of inter-word spaces by calculating a modified inter-word space S.sub.BW of the watermarked document (20) using the expression:
S.sup.t.sub.BW(k)=S.sup.t.sub.B(k)+w.sub.t(k).Math.c.sub.t(k).Math.S.sup.t.sub.B(k)=S.sup.t.sub.B(k)(1+w.sub.t(k).Math.c.sub.t(k)) where w.sub.t(k) is the selected codeword, for the k-th inter-word space in the t-th line; S.sup.t.sub.B(k) is the k-th inter-word space in the t-th line of the inter-word space S.sub.B of the original document (10), S.sup.t.sub.BW(k) is the k-th inter-word space in the t-th line of the inter-word space S.sub.BW of the watermarked document (20) and c.sub.t(k) is a weighting factor; modifying the intra-word spaces and inter-word spaces of the original document (10) replacing the intra-word space, S.sub.S, and the inter-word space, S.sub.B, respectively by the modified set of intra-word spaces, S*.sub.S, and the modified inter-word space, S.sub.BW, in the watermarked document (20).
7. The method according to any preceding claim, wherein encoding (113) the message (30) of length K comprises adding a channel coding to the message (30) to obtain a modified message (m′) of length L>K, dividing the modified message (m′) into payload blocks (510, 510′) of length N and adding a synchronization block (520, 520′) of length T before each payload block (510, 510′) respectively to obtain a symbol sequence (m″) of length L.Math.(1+T/N).
8. The method according to any preceding claim, wherein the original document (10) is selected from a digital document and a digitalized document.
9. A method (12) for extracting watermarks from a received document (60) resulting from a watermarked document (20) which has a watermark embedded encoding a message (30) of a number Q of symbols, the method characterized by comprising: identifying and locating spaces in the received document (60) by a location analysis (630) module, the location analysis (630) distinguishing between intra-word spaces and inter-word spaces and obtaining a vector S′.sub.BW corresponding to the inter-word spaces of the watermarked document (20); mapping (640) the inter-word spaces of the vector S′.sub.BW to symbols according to an alphabet of codewords w={w.sub.i}, i=1, 2, . . . Q; Q being the number of symbols of the message (30) and the symbols being univocally mapped to the codewords w; de-packetizing (650) the mapped symbols to obtain synchronization blocks and payload blocks of symbols; channel decoding (660) the payload blocks to extract the message (30).
10. The method according to claim 9, wherein the received document (60) is a distorted copy of the watermarked document (20).
11. The method according to any of claims 9-10, further comprising applying noise reduction (610) and geometric correction (620) to the received document (60) before the location analysis (630).
12. A non-transitory computer-readable medium containing computer instructions stored therein for causing a computer processor to perform the method according to any of claims 1-8.
13. A non-transitory computer-readable medium containing computer instructions stored therein for causing a computer processor to perform the method according to any of claims 9-11.
Description
DESCRIPTION OF THE DRAWINGS
[0030] For the purpose of aiding the understanding of the characteristics of the invention, according to a preferred practical embodiment thereof and in order to complement this description, the following Figures are attached as an integral part thereof, having an illustrative and non-limiting character:
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
PREFERRED EMBODIMENT OF THE INVENTION
[0037] The matters defined in this detailed description are provided to assist in a comprehensive understanding of the invention. Accordingly, those of ordinary skill in the art will recognize that variation changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, description of well-known functions and elements are omitted for clarity and conciseness.
[0038] Of course, the embodiments of the invention can be implemented in a variety of architectural platforms, operating and server systems, devices, systems, or applications. Any particular architectural layout or implementation presented herein is provided for purposes of illustration and comprehension only and is not intended to limit aspects of the invention.
[0039]
[0040] The watermark extraction method (12), shown in
[0041] The watermark embedding method (11) is divided into more detailed sub-processes in
[0042] The location analysis (110) consists of different steps shown in further detail in
[0043] By binarizing the bitmap format document and summing the pixels in horizontal direction, the horizontal projection P.sub.h(r) is obtained as:
P.sub.h(r)=Σ.sub.cI(r,c)
where r is the r-th row, c is the c-th column, and I(r,c) is the binarized document image.
[0044] By setting a proper threshold value, the bitmap format document may be segmented in text lines, finding optimal binarization thresholds from the image histogram. Once the projections are computed, and text lines identified, each text line is segmented. For each line, its vertical projection P.sub.vi(c) corresponds to:
P.sub.vi(c)=Σ.sub.rI.sub.i(r,c)
where, I.sub.i(r,c) is the binarized document image cropped to the i-th line.
[0045] Applying the same technique used to isolate text line using the horizontal projection P.sub.h(r), spaces in each line can be identified and measured applying a proper threshold to the vertical projection P.sub.vi(c).
[0046] The output (404) of the segmentation (403) is a set of space lengths and a set of their corresponding locations in the document, denoted as set S and set L respectively. The identified spaces are then classified (405) into intra-word and inter-word spaces. As the process is dealing with a bitmap document, this space classification (405) may be based on OCR—Optical Character Recognition—techniques or on a clustering algorithm by analyzing the histogram of the space lengths, S. As a result of this classification (405), the original space lengths S and locations L sets are split respectively into sets (406), S.sub.S and S.sub.B, and L.sub.S and L.sub.B, being the sets of the length and location of the intra-word and inter-word spaces, respectively.
[0047] As a following step, the output sets (406) of the space classification (405) are fed to an optimization process, aiming at reducing the error probability of the future watermark extraction process. In particular, in order to minimize the possible interference between inter-word and intra-word spaces in the watermark extraction process (i.e., interpreting a space of one kind for one of the other kind), the distance between the lengths of the two kinds of spaces may be increased. This operation is referred to as “host rejection”. This operation introduces distortion in the document (10), referred to as D.sub.R, which is regulated by a proper constraint parameter received as input, referred to as D.sub.R_MAX. In particular, a possible solution minimizing the error probability while keeping the distortion below D.sub.R_MAX consists of taking a modified intra-word space S*.sub.si, for the i-th line, equal to:
where μ(i) is the average value of the length of the intra-word spaces S.sub.Si in the i-th line, and ε is the host-rejection parameter, to be adjusted in order to obtain D.sub.R=D.sub.R_MAX.
D.sub.R is an indicative value of the change in the lengths of the intra-word spaces for all the lines of the document (10), i.e., D.sub.R=Σ.sub.i|S*.sub.Si−S.sub.si|
[0048] Hence, D.sub.R=Σ.sub.i(S.sub.Si−(μ(i)+ϑ)) for S.sub.Si>μ(i)+ε
[0049] As S.sub.Si and μ(i) are known, the value of the host-rejection ε is extracted and used to determine D.sub.R=D.sub.R_MAX.
[0050] The effect of the above host rejection application is to reduce the largest intra-word spaces in each line, resulting hence in larger inter-word spaces. Generally, the goal is to keep the total length of each line unchanged. As such, the following condition must be satisfied:
Σ.sub.kS*.sub.Si(k)+Σ.sub.kS*.sub.Bi(k)=Σ.sub.kS.sub.Si(k)+Σ.sub.kS.sub.Bi(k)
being S.sub.Si(k) the length of the k-th intra-word space of the i-th line, S*.sub.Si(k) the modified length of the same space, and, correspondingly, S.sub.Bi(k) and S*.sub.Bi(k) the original and modified length of the k-th inter-word space of the i-th line.
[0051] Being N.sub.b(i) the number of inter-word spaces in the i-th line, the above equation may be solved using modified spaces equal to:
S*.sub.Bi(k)=S.sub.Bi(k)+(Σ.sub.kS.sub.Si(k)−Σ.sub.kS*.sub.Si(k))/N.sub.b(i)
resulting in a homogeneous redistribution of the extra space among the inter-word spaces, which in turn results from the reduction of the intra-word spaces.
[0052] The message (30) to be encoded in the watermark is a sequence of bits of length K. The message (30) is encoded by an encoding module (113). As a first step, the encoding module (113) adds a channel coding to the message (30) in order to make it robust to decoding errors. This is achieved by standard forward error correction codes, resulting in a modified message (m′) of length L>K, shown in
[0053] Once the encoded message has been generated, the watermarked document (20) is generated (115) through a two-step process: [0054] 1. A watermarking encoding module encodes each symbol of the encoded message into a sequence of space lengths. As a result, the document inter-word spaces S.sub.B* are modified to the watermarked inter-word spaces S.sub.BW. [0055] 2. The final watermarked document (20) is generated modifying the spaces in the previous document, according to the intra-word space S.sub.S* and inter-word space S.sub.BW vectors.
[0056] The watermark embedding is performed in a per-packet fashion according to a secret key, resulting in a sequence of codewords {w.sub.i, . . . w.sub.Q}, being w.sub.i=[w.sub.i(1), . . . w.sub.i(P)], i=1, 2, . . . Q, and w.sub.i(k)=±1. Being the number of codewords Q equal to the number of different symbols in the encoded and packetized message, each symbol may univocally be mapped to a codeword. Then, for each symbol in the packet p(i), the following operations are performed: [0057] 1. Select a subset of P elements in S.sub.B* of length N.Math.L, mapped to S.sup.t.sub.B through a secret key. The subset may include elements belonging to different lines of text. [0058] 2. Embed the selected codeword w.sub.t in the selected subset using the following formula:
S.sup.t.sub.BW(k)=S.sup.t.sub.B(k)+w.sub.t(k).Math.c.sub.t(k).Math.S.sup.t.sub.B(k)=S.sup.t.sub.B(k)(1+w.sub.t(k).Math.c.sub.t(k))
where c.sub.t(k) is a weighting factor. Depending on the sign of the selected codeword w.sub.t(k), for the k-th inter-word space in the t-th line, the inter-word space S.sup.t.sub.BW(k) of the watermarked document (20) may be longer or shorter than the original space S.sup.t.sub.B(k).
[0059] The role of the weighting factor c.sub.t(k) is to ensure that the first and last letters of each line remain in the same position (really important, for instance, in justified texts and to avoid layout modifications in the resulting document). The weighting factor c.sub.t(k) is necessary when the number of large and short inter-word spaces in a given line is not equal. In general, this is true when
Σ.sub.kS.sup.t.sub.BW(k)=Σ.sub.kS.sup.t.sub.B(k)
or, equivalently,
Σ.sub.kw.sub.i(k).Math.c.sub.i(k).Math.S.sub.Bi(k)=0
being c.sub.i(k) and w.sub.i(k) the weighting factor and the codeword component respectively for the k-th inter-word space in the i-th line, according to the mapping defined above.
[0060] As such, the above condition can be rewritten as:
Σ.sub.k∃wi(k)=1c.sub.i(k).Math.S.sub.Bi(k)=Σ.sub.k∃wi(k)=−1c.sub.i(k).Math.S.sub.Bi(k)
being the first term the sum of the spaces for which the coding elements is +1, weighted with the corresponding weighting factor, while the second term the sum of the spaces for which the coding elements is −1, weighted with the corresponding weighting factor. This is equivalent to:
c.sub.1/c.sub.−1=Σ.sub.k∃wi(k)=−1S.sub.Bi(k)/Σ.sub.k∃wi(k)=1S.sub.Bi(k)
[0061] Furthermore, if the inter-word spaces in each line are uniform, which is the common case, then S.sub.BI(k)=S.sub.B and the condition becomes: c.sub.1/c.sub.−1=N.sub.−1/N.sub.1, being N.sub.−1 the number of ‘−1’ coding elements and N.sub.1 the number of ‘1’ coding elements, in w.sub.i(k) for the i-th line.
[0062] The generation of the watermarked document (20) concludes hence by modifying the spaces of the original document according to the S.sub.S* and S.sub.BW sets.
[0063] The reverse method (12) for extracting the embedded message from the watermarked document (20) is shown in
ĉ=arg max.sub.C∈{1, . . . ,Q}f(S.sup.t.sub.BW|W.sub.c) [0065] being f(S.sup.t.sub.BW|w.sub.c) the probability density function of S.sup.t.sub.BW conditions on w.sub.c. In practice, a statistical estimation is hardly applicable since it implies the knowledge of the statistical distribution of S.sup.t.sub.BW and its parameters, which largely varies on the type of document and font, and requires a large P to have a statistically meaningful sample. As such, the mapping (640) of spaces into symbols is performed by relying only on the means of the observed sample. In fact, after embedding the codeword w.sub.c in S.sup.t.sub.B, the mean of the sample S.sup.t.sub.BW has the form μ.sub.k(1+λw.sub.c(k)), being μ.sub.k the mean of S.sup.t.sub.B(k). Hence, the watermark spaces are distributed around Q centroids, so the embedded symbol can be estimated by vector quantization techniques, relying on various existing methods for such operation, including scalar manner. Note that, if a clustering method has been used in the location analysis (630), such classification may naturally be provided as a secondary output of such process. [0066] As a next step, the mapping between S′.sub.BWi(k) and w.sub.i(k) is performed assigning to ŵ.sub.i(k) the values of +1 or −1 depending on the corresponding space classification as large or short, respectively. The estimated ŵ.sub.i(k) are arranged accordingly to the subset defined in the embedding phase and the resulting sequences are compared with the codewords in the alphabet. If w.sub.ĉ is the most similar codeword, the symbol ĉ∈{1, . . . , Q} is selected.
[0067] The estimated symbols resulting from the mapping (640) are finally given as input to a de-packetization (650) step, which looks for the synchronization symbols and extracts the payload symbols. Finally, channel decoding (650) is applied into the payload symbols to extract the original message (30) which were embedded in the received watermarked document (60).
[0068] Note that in this text, the term “comprises” and its derivations (such as “comprising”, etc.) should not be understood in an excluding sense, that is, these terms should not be interpreted as excluding the possibility that what is described and defined may include further elements, steps, etc.