Cross-modal image-watermark joint generation and detection device and method thereof

Abstract

The present disclosure discloses a cross-modal image-watermark joint generation and detection device and method thereof. The device includes a multimodal encoder, an image-watermark feature co-embedding module, an image-watermark feature fusion module, an up-sampling generator, a non-cooperative game decoupling module configured to decouple an unwatermarked image and a reconstructed watermark from a composite image through two decoders by developing allocation strategies according to a non-cooperative game theory and a Shannon information theory; a strategy allocation module configured to set an composite image discriminator, keeping the consistency between the composite image and input text by multi-specification down-sampling convolution kernels and set the objective functions to constrain watermark reconstruction and unwatermarked image decoding; and a post-processing attack module configured to simulate various attacks for ensuring the robustness of watermarks.

Claims

1. A cross-modal image-watermark joint generation and detection device, comprising: an image-watermark feature co-embedding module, configured to map an original image feature and a watermark feature to a unified feature space by a learnable parameter matrix; an image-watermark feature fusion module, configured to fuse the watermark feature and the original image feature at a channel level to acquire an image-watermark fusion feature and cascade the original image feature for a plurality of times; an up-sampling generator, configured to map the image-watermark fusion feature into pixels to acquire a composite image with a preset resolution; a non-cooperative game decoupling module, configured to allocate information of the composite image through two decoders by developing allocation strategies according to a non-cooperative game theory and a Shannon information theory to decouple an unwatermarked image and a reconstructed watermark; a strategy allocation module, configured to set an image joint discriminator, extract features of the composite image by multi-specification down-sampling convolution kernels to constrain image-text semantic consistency and fidelity, and set an objective function to constrain reconstruction of watermark and unwatermarked image; and a post-processing attack module, configured to simulate post-processing attacks and output a final image-watermark joint generated image; wherein, the original image feature is obtained by a multimodal encoder, the multimodal encoder configured to extract features from an input text, noise sampling and a digital watermark by pre-trained natural language encoding models, multilayer perceptrons, and visual encoding models, and acquire feature representations thereof to obtain the original image feature through affine transformation using text features and noise features.

2. The cross-modal image-watermark joint generation and detection device according to claim 1, wherein the device further comprises an image and watermark joint generation evaluation module, configured to evaluate image quality, watermark invisibility, watermark reconstruction quality and watermark robustness.

3. The cross-modal image-watermark joint generation and detection device according to claim 1, wherein the image-watermark feature co-embedding module is as follows:
f.sub.t,w=.sub.c(T.sub.tM.sub.t,T.sub.wM.sub.w) where, f.sub.t,w is an image and watermark splicing feature, .sub.c(.Math.) represents a channel-level splicing operation, M.sub.t represents the original image feature and M.sub.w represents the watermark feature, T.sub.t and T.sub.w are learnable corresponding dimension parameter matrices.

4. The cross-modal image-watermark joint generation and detection device according to claim 3, wherein the image-watermark feature fusion module is as follows: $M_{i}^{c} = {\begin{matrix} E_{U n e t} (f_{t, w}), & i = 1 \\ Y_{i} (M_{i - 1}^{c}, M_{t};_{i}), & i = 2, 3,, .Math., N \end{matrix}$ where, Y.sub.i(.Math.) is an ith-layer feature fusion module, which consists of a full connected network (FCN) with a parameter of .sub.i and a nearest neighbor interpolation algorithm; M.sub.i.sup.c is a composite visual feature map output from an ith layer, and E.sub.Unet (.Math.) is a Unet-based encoder, configured to couple a visual feature to a watermark feature.

5. The cross-modal image-watermark joint generation and detection device according to claim 3, wherein the non-cooperative game decoupling module comprises an image decoding unit and a watermark reconstruction unit, the image decoding unit being expressed as: $\underset{s_{x_{r}}}{\arg \max} M_{g a i n}^{P_{x_{r}}} (s_{x_{r}}) = \underset{_{x}}{\arg \min} L_{x_{c} .fwdarw. x_{r}} (_{x}) = \underset{_{x}}{\arg \min} MI (w_{h}, x_{r};_{x}) - M I (x_{c}, x_{r};_{x}) s . t ., M I (x_{c}, x_{r}) I (x_{c}) - I (w_{r})$ where, MI(.Math.) represents a mutual information function calculated by Kullback-Leibler divergence and used for optimizing a parameter .sub.x, s.sub.x.sub.r represents an unwatermarked image decoupling strategy, M.sub.gain.sup.P.sup.xr(.Math.) represents a gain from an execution of the unwatermarked image decoupling strategy, I(x.sub.c) represents an information quantity of x.sub.c, I(w.sub.r) represents an information quantity of w.sub.r, w.sub.h represents a hidden watermark; and a target L.sub.x.sub.c.sub..fwdarw.x.sub.r(.sub.x) aims to restore an unwatermarked image x.sub.r; x.sub.c represents the composite image with a resolution of 256256; the watermark reconstruction unit being expressed as: $\underset{s_{w_{r}}}{\arg \max} M_{g a i n}^{P_{w_{r}}} (s_{x_{r}}) = \underset{_{w}}{\arg \min} L_{x_{c} .fwdarw. w_{r}} (_{w}) = \underset{_{w}}{\arg \min} {[I (x_{c};_{w}) - M I (x_{c}, x_{r};_{w})] - I (w_{r};_{w})} s . t ., I (w_{r}) < I (x_{c})$ where, s.sub.w.sub.r represents a watermark reconstruction strategy, and M.sub.gain.sup.P.sup.wr(.Math.) represents a gain from an execution of the watermark reconstruction strategy; .sub.w represents network parameter; a target L.sub.x.sub.c.sub..fwdarw.x.sub.r(.sub.x) compels w.sub.r to keep away from the space x.sub.r and approach to the space x.sub.c.

6. A cross-modal image-watermark joint generation and detection method, comprising: mapping an original image feature and a watermark feature to a unified feature space by a learnable parameter matrix; fusing the watermark feature and the original image feature at a channel level to acquire an image-watermark fusion feature and cascading the original image feature for a plurality of times; mapping the image-watermark fusion feature into pixels to acquire a composite image with a preset resolution; allocating information of the composite image through two decoders by developing allocation strategies according to a non-cooperative game theory and a Shannon information theory to decouple an unwatermarked image and a reconstructed watermark; setting an image joint discriminator, extracting features of the composite image by multi-specification down-sampling convolution kernels to constrain image-text semantic consistency and fidelity, and setting an objective function to constrain reconstruction of watermark and unwatermarked image; and simulating post-processing attacks and outputting a final image-watermark joint generated image; wherein, the original image feature is obtained by a multimodal encoder, the multimodal encoder configured to extract features from an input text, noise sampling and a digital watermark by pre-trained natural language encoding models, multilayer perceptrons, and visual encoding models, and acquire feature representations thereof to obtain the original image feature through affine transformation using text features and noise features.

7. A non-transitory computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program comprises program instructions; and when the program instructions are executed by the processor, the method according to claim 6 is implemented by the processor.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a frame diagram of a cross-modal image-watermark joint generation and detection device;

(2) FIG. 2 is a schematic diagram of a network framework of a cross-modal image-watermark joint generation and detection device;

(3) FIG. 3 is a flow chart of a cross-modal image-watermark joint generation and detection method;

(4) FIG. 4 is a case exhibition diagram of the generated images;

(5) FIG. 5 is a case exhibition diagram of the watermarks under multiple post-process attacks; and

(6) FIG. 6 is a contrast effect diagram of performance quantification of watermark reconstruction under multiple post-process attacks.

(7) Wherein, in FIG. 2, 11 represents Image-watermark feature co-embedding module;

DETAILED DESCRIPTION OF THE PRESENT DISCLOSURE

(8) To make the objectives, technical solutions and advantages of the present disclosure clearer, the implementations of the present disclosure will be further described below in detail.

Embodiment 1

(9) A cross-modal image-watermark joint generation and detection device, referring to FIG. 1 and FIG. 2, the device includes:

(10) I. A multimodal encoder, configured to extract features from an input text, noise sampling and a digital watermark by pre-trained natural language encoding models, multilayer perceptrons, and visual encoding models, and acquire feature representations thereof to obtain an original image feature through affine transformation using text features and noise features.

(11) Specifically, the multimodal encoder further includes:

(12) 1) a BiLSTM (bidirectional long short-term memory) context-aware encoding unit, configured to sequentially encode word embedding features by pre-trained long short-term memory networks, such that the text features have context-aware information, and the sentence-level embedding representation is obtained.

(13) The text features are sequentially fed into BiLSTM for bidirectional coding, and a hidden state s custom character .sup.L.sup.1.sup.1 of a whole sentence is acquired as a context-aware feature, wherein L.sub.1 represents a sentence vector length.

(14) 2) A multilayer perceptron noise coding unit, configured to map noise obtained by random sampling in a standard Gaussian distribution into a feature vector by multilayer perceptron networks to increase the variety of generated images.

(15) The noise obtained by random sampling in the standard Gaussian distribution N(0,1) is fed into an MLP network and mapped into the feature vector z custom character .sup.L.sup.2.sup.1, wherein L.sub.2 represents a noise vector length.

(16) 3) A watermark generation unit, configured to map creation-related metadata into a single-channel binary watermark embedded into an image in a hidden manner, by which the purpose of traceability is achieved.

(17) Specifically, text, creative time, creator ID and other factors are entered into single-channel binary watermark pixels in a character form, wherein the creator ID is set to be 8 digits, with each digit being sampled from uniform distribution U(0,9).

(18) 4) A multilayer convolutional network watermark feature extraction unit, configured to extract binary watermark features by convolutional neural networks to obtain spatial-level feature representation.

(19) Wherein, the single-channel binary watermark is fed into multilayer convolutional neural networks (CNN) to obtain the watermark features M.sub.w custom character .sup.HW1. 4 layers of the multilayer convolutional neural networks are set, and numbers of output channels are set to 3, 6, 12 and 1 respectively; LeakyReLu activation functions are set among the layers, dimensions of a receptive field are set to 33, and a convolution step size is set to 1.

(20) 5) An image feature initialization unit, configured to generate image initial features by calculation through affine transformation based on noise sampling and text input.

(21) In order to enrich the visual effect of the generated image, affine transformation is introduced to fuse noise and text features, and a specific implementation process thereof may be expressed as:
Affine(z,s)=.sub.scale(s).Math.(z)+.sub.shift(s)(1)

(22) Where, Affine(.Math.) represents an affine transformation function, .sub.scale(.Math.), .sub.shift(.Math.) and (.Math.) represent translation, scaling and noise mapping functions. An output matrix of the affine transformation is expressed as M.sub.t custom character .sup.HWC, which is regarded as an original image feature, wherein H, W, C represent a height and a width, and channels of a feature matrix respectively, and represents a feature space.

(23) II. An image-watermark feature co-embedding module, configured to map an original image feature and a watermark feature to a unified feature space by a learnable parameter matrix to achieve the compatibility of the original image feature and the watermark feature.

(24) Specifically, in order to improve the representation ability of the image and optimize the invisibility of the watermark, it is necessary to find a feature co-embedding space for integrating the original image feature M.sub.t and the watermark feature M.sub.w. Learnable corresponding dimension parameter matrices T.sub.t and T.sub.w are initialized randomly, and the watermark feature and the image feature are enabled to be compatible in a training process, and a specific implementation process thereof is expressed as:
f.sub.t,w=.sub.c(T.sub.tM.sub.t,T.sub.wM.sub.w)(2)

(25) Where, f.sub.t,w represents an image and watermark splicing feature, and .sub.c(.Math.) represents a channel-level splicing operation, which aims to express the watermark and the image in the same feature space.

(26) III. an image-watermark feature fusion module, configured to fuse the watermark feature and the original image feature at a channel level to achieve the effects of hiding watermark signals and highlighting a high-quality image visual effect by cascading the original image feature for many times.

(27) Specifically, the splicing feature f.sub.t,w is compressed by a Unet network to obtain low-level key information, and correlation mining is conducted on different scale features by skip connection to learn multi-scale watermark and image information. Furthermore, in order to reduce watermark interference to the image feature, the original image feature M.sub.t is fused for many times, and a watermark signal ratio is reduced. The specific implementation process may be expressed as:

(28) $\begin{matrix} M_{i}^{c} = {\begin{matrix} E_{U n e t} (f_{t, w}), & i = 1 \\ Y_{i} (M_{i - 1}^{c}, M_{t};_{i}), & i = 2, 3,, .Math., N \end{matrix} & (3) \end{matrix}$

(29) Where, Y.sub.i(.Math.) is an ith-layer feature fusion module, which mainly consists of a full connected network (FCN) with a parameter of .sub.i and a nearest neighbor interpolation algorithm. In the embodiment of the present disclosure, the number of layers is set to 3. M.sub.i.sup.c is a composite visual feature map output from an ith layer. E.sub.Unet(.Math.) is a Unet-based encoder configured to couple the visual feature to the watermark feature. In the embodiment of the present disclosure, even in the case that the watermark interference to the image quality is minor, the watermark information can also be protected from being lost.

(30) IV. An up-sampling generator, configured to map watermark image fusion features into pixels, semantic information and watermark signals are included therein, and finally a composite image with a resolution of 256256 is obtained.

(31) In order to generate the composite image with a hidden watermark by using the fusion feature M.sub.i.sup.c, it is processed by the up-sampling generator, and the specific implementation process may be expressed as:
x.sub.c=F.sub.w(M.sub.i.sup.c;.sub.c)(4)

(32) Where, F.sub.w(.Math.) represents an up-sampling generation function of a parameter .sub.c. x.sub.c is the composite image with a resolution of 256256, which shows an excellent visual effect and fully hides the watermark information.

(33) V. A non-cooperative game decoupling module, configured to allocate information of the composite image through two decoders by developing allocation strategies according to a non-cooperative game theory and a Shannon information theory to decouple an unwatermarked image and a reconstructed watermark.

(34) Specifically, a non-cooperative game features that there is a lack of communication and negotiation among game participants who need to develop their own dominant strategies. Concretely speaking, the participants place a strong emphasis on independent decision making to maximize their own interests, the decision making is independent of strategies adopted by other participants in strategic environment, and the ultimate objective is to achieve balance among game players.

(35) For the game G=(s.sub.1, s.sub.2, . . . , s.sub.n; p, p.sub.2, . . . , p.sub.n), supposing that (s.sub.1, s.sub.2, . . . , s.sub.n) is an arbitrary strategy combination, when meeting the strategy (s.sub.1, . . . , s.sub.i1, s.sub.i+1, . . . , s.sub.n) of other participants, the strategy s.sub.i* is the optimum selection of the participant p.sub.i. The non-cooperative game is formulated as:

(36) $\begin{matrix} ^{*} = (s_{1}^{*}, s_{2}^{*}, .Math., s_{n}^{*}) = \underset{(s_{1}, s_{2}, .Math., s_{n})}{\arg \max} M_{g a i n}^{P} (s_{1}, s_{2}, .Math., s_{n}) & (5) \end{matrix}$ $\begin{matrix} M_{g a i n}^{p_{i}} (s_{1}^{}, .Math., s_{i}^{*}, .Math., s_{n}^{}) M_{g a i n}^{p_{i}} (s_{1}^{}, .Math., s_{i}^{}, .Math., s_{n}^{}), i & (6) \end{matrix}$

(37) Where, * represents a set of Nash equilibrium strategies, that is, no participants may increase gains by changing their own strategies alone. M.sub.gain.sup.P.sup.i(.Math.) represents the gain of the gain participant from the execution of the set of strategies, P represents a set of participants, M.sub.gain.sup.P(.Math.) represents a set of gains of all participants from the execution of the set of strategies, s.sub.n represents the strategy of the nth participant, s.sub.n represents any strategy of the nth participant, and p.sub.n represents the n-th participant. In the image and watermark joint generation, decoupling of the image and the watermark may be regarded as a non-cooperative game process. While the image feature and the watermark feature have contributions to the overall visual effect of the composite image to varying degrees, supposing that there are contribution factors c.sub.x.sub.r, c.sub.w.sub.r, and c.sub.x.sub.c, which reflect contribution degrees of x.sub.r, w.sub.r and x.sub.c respectively, and they are approximate to an allocation strategy with linear positive correlation:
c.sub.x.sub.c(1(x.sub.c)).Math.c.sub.x.sub.r+(x.sub.c).Math.c.sub.w.sub.r(7)

(38) Where, (.Math.) reflects allocation strategies of two contributors, the value range is [0,1], and represents a positive correlation. Theoretically, the watermark and the image participate in a non-cooperative game, and strive to achieve a Nash equilibrium state. Supposing that c.sub.x.sub.c reaches an optimal point c.sub.x.sub.c*, it has an optimal visual representation effect.

(39) $\begin{matrix} ^{*} (c_{x_{c}}) \frac{c_{x_{c}}^{*} - c_{x_{r}}}{c_{w_{r}} - c_{x_{r}}} & (8) \end{matrix}$

(40) Where, *(.Math.) represents an optimal allocation strategy. Formula (8) is simplified:

(41) $\begin{matrix} \frac{1}{^{*} (c_{x_{c}})} C^{*} - \frac{c_{x_{c}}^{*} - c_{w_{r}}}{c_{x_{c}}^{*} - c_{x_{r}}} & (9) \end{matrix}$

(42) Where, C* is a constant. The allocation strategies are dependent on c.sub.w.sub.r and c.sub.x.sub.r, that is, the strategies are developed by the watermark and the image respectively. Therefore, the Nash equilibrium state (x.sub.r*,w.sub.r*) may be calculated according to Formula (1):

(43) $\begin{matrix} ^{*} (x_{c}) = \underset{(s_{x_{r}}, s_{w_{r}})}{\arg \max} M_{g a i n}^{P} (s_{x_{r}}, s_{w_{r}}) = \lim_{_{x} .fwdarw._{x}^{*},_{w} .fwdarw._{w}^{*}} (x_{r}, w_{r};_{x},_{w}) & (10) \end{matrix}$

(44) Where, s.sub.x.sub.r and s.sub.w.sub.r are the strategies for decoding the image and reconstructing the watermark, .sub.x and .sub.w are corresponding network parameters, and .sub.x* and .sub.w* are optimal parameters. (x.sub.r, w.sub.r) are closer to (x.sub.r*, w.sub.r*), and then the Nash equilibrium is more approximately achieved. Ideally, s.sub.x.sub.r and s.sub.w.sub.r achieve the trade-off under the condition of no mutual interference, and decoupling is completed.

(45) Specifically, the non-cooperative game decoupling module further includes:

(46) 1) an image decoding unit, configured to decouple the unwatermarked image from the composite image.

(47) The embodiment of the present disclosure designs a cooperative decoupling method for a watermark and an image. Ideally, an unwatermarked image is expected to retain visual information equivalent to that of the composite image, to reduce the disparity therebetween. Furthermore, watermark signals should not be stored in the unwatermarked image. According to the Shannon information theory, this process aims to reduce the difference between the unwatermarked image and the composite image, and meanwhile expand a gap between the image with hidden watermark and the unwatermarked image. The specific implementation process of this strategy may be expressed as:

(48) $\begin{matrix} \underset{s_{x_{r}}}{\arg \max} M_{g a i n}^{P_{x_{r}}} (s_{x_{r}}) = \underset{_{x}}{\arg \min} L_{x_{c} .fwdarw. x_{r}} (_{x}) = \underset{_{x}}{\arg \min} [MI (w_{h}, x_{r};_{x}) - M I (x_{c}, x_{r};_{x})] s . t ., M I (x_{c}, x_{r}) I (x_{c}) - I (w_{r}) & (11) \end{matrix}$

(49) Where, MI(.Math.) represents a mutual information function calculated by Kullback-Leibler divergence and used for optimizing a parameter .sub.x, s.sub.x.sub.r represents an unwatermarked image decoupling strategy, M.sub.gain.sup.P.sup.xr(.Math.) represents a gain from the execution of the unwatermarked image decoupling strategy, I(x.sub.c) represents an information quantity of x.sub.c, I(w.sub.r) represents an information quantity of w.sub.r, w.sub.h represents a hidden watermark. A target L.sub.x.sub.c.sub..fwdarw.x.sub.r(.sub.x) aims to restore the unwatermarked image x.sub.r.

(50) In order to achieve the Formula (11), the composite image is firstly processed by the decoder, and the specific implementation process may be expressed as:
x.sub.r=R.sub.r(x.sub.c;x)(12)

(51) Where, R.sub.r(.Math.) is a Unet-based image decoder, configured to establish a pixel-level dependency between the composite image and an analytic image.

(52) 2) A watermark reconstruction unit, configured to reconstruct an approximately lossless high-quality watermark from the composite image.

(53) Specifically, the unwatermarked image x.sub.r is almost dependent of information of the reconstructed watermark w.sub.r ideally, while the composite image x.sub.c and the reconstructed watermark w.sub.r share hidden information. Thus, in the feature space, the self-information of the composite image is set to I(x.sub.c), and the mutual information thereof to the unwatermarked image is set to MI(x.sub.c,x.sub.r), while I(w.sub.r) aims to search the feature space of a supplementary set C.sub.I(x.sub.c.sub.)MI(x.sub.c, x.sub.r) thereof. The specific implementation process of this strategy may be expressed as:

(54) $\begin{matrix} \underset{s_{w_{r}}}{\arg \max} M_{g a i n}^{P_{w_{r}}} (s_{x_{r}}) = \underset{_{w}}{\arg \min} L_{x_{c} .fwdarw. w_{r}} (_{w}) = \underset{_{w}}{\arg \min} {[I (x_{c};_{w}) - M I (x_{c}, x_{r};_{w})] - I (w_{r};_{w})} s . t ., I (w_{r}) < I (x_{c}) & (13) \end{matrix}$

(55) Where, s.sub.w.sub.r represents a watermark reconstruction strategy, and M.sub.gain.sup.P.sup.wr(.Math.) represents a gain from the execution of the watermark reconstruction strategy. The target L.sub.x.sub.c.sub..fwdarw.w.sub.r(.sub.w) compels w.sub.r to keep away from the space x.sub.r and approach to the space x.sub.c. Therefore, information hidden in w.sub.r may be restored effectively. Furthermore, similar to the Formula (12), an encoder is required to obtain w.sub.r, and the specific implementation process may be expressed as:
w.sub.r=R.sub.w(x.sub.c;.sub.w)(14)

(56) Where, R.sub.w(.Math.) is a Unet-based watermark decoder, and thus, under the independent strategy and the decoder, the image and the watermark may be cooperatively decoupled while approaching the Nash equilibrium state.

(57) VI. A strategy allocation module, configured to set objective functions of the composite image, the unwatermarked image and reconstructed watermark, namely set an image discriminator, extract features of the composite image by multi-specification down-sampling convolution kernels to constrain image-text semantic consistency and fidelity, and meanwhile set an objective function to constrain the reconstruction of watermark and unwatermarked image.

(58) Specifically, the strategy allocation module further includes:

(59) 1) a composite image discrimination strategy, configured to set the discriminator to constrain the composite image.

(60) In the embodiment of the present disclosure, it is not only required to generate the composite image x.sub.c, but also the unwatermarked image x.sub.r is decoupled by the specific Unet decoder. Therefore, the discriminator is required to ensure the authenticity of the image. For the composite image, the initialized image feature M.sub.t is used for guiding the semantic expression of x.sub.c. The objective function of the composite image discriminator is defined as:

(61) $\begin{matrix} L_{x_{c}}^{D} = E_{x p_{r}} [\log (D_{x_{c}} (x, s))] + \frac{1}{2} E_{x_{c} p_{x_{c}}} [\log (1 - D_{x_{c}} (x_{c}, s))] + \frac{1}{2} E_{x p_{r}} [\log (1 - D_{x_{c}} (x, \overset{}{s}))] & (15) \end{matrix}$

(62) Where, corresponds to mismatched text description, x represents a real image, and p.sub.r and p.sub.x.sub.c represent real data distribution and composite image distribution. In order to ensure the visual effect of the generated image, the objective function of the composite image discriminator is defined as:
L.sub.x.sub.c.sup.G=E.sub.x.sub.c.sub.p.sub.g[log(D.sub.x.sub.c(x.sub.c,s))]+.sub.1.Math.(x.sub.c,M.sub.t)(16)

(63) Where, (.Math.) measures similarity between x.sub.c and the initial image feature M.sub.t by using MSE-L2 loss. .sub.1 is a proportional coefficient.

(64) 2) An unwatermarked image decoding strategy, configured to split the unwatermarked image from the composite image.

(65) As a supplement of target L.sub.x.sub.c.sub..fwdarw.w.sub.r, the present disclosure adopts smooth L1 loss .Math..sub.1 for further constraining the generation of analysis image:
L.sub.x.sub.c=L.sub.x.sub.c.sub..fwdarw.w.sub.r+.sub.2.Math.w.sub.h,w.sub.r.sub.1(17)

(66) Where, .sub.2 is a proportional coefficient. The objective function is used for eliminating the watermark while maintaining a similar visual appearance with the composite image by the smooth L/loss.

(67) 3) A reconstructed watermark strategy, configured to reconstruct watermark signal from the composite image.

(68) In the embodiment of the present disclosure, a powerful constraint (.Math.) is introduced to ensure the completeness of w.sub.r, and the objective function for restoring the watermark is as follows:
L.sub.w.sub.r=L.sub.x.sub.c.sub..fwdarw.w.sub.r+.sub.3.Math.p(w.sub.h,w.sub.r)(18)

(69) Where, .sub.3 is a proportional coefficient, and the objective function for keeping the reconstructed watermark consistent with the hidden watermark.

(70) VII. A post-process attack module, configured to simulate post-processing attacks of random cropping, space rotation, Gaussian noise, impulse noise, Gaussian blur, and brightness adjustment, such that the watermark adapts to robustness against common attacks.

(71) In real scenarios where post-processing attacks such as Gaussian noise, space rotation and random cropping may occur, the watermark should have strong robustness to protect information from being lost, and therefore the ultimate purpose of traceability and protection is achieved. In the embodiment of the present disclosure, the post-processing attacks are simulated in a training process, such that the encoder-decoder is more adaptive to an attack pattern. Specifically, after the module is disposed on the up-sampling generator, the post-processing attacks with different intensities are added to the composite image x.sub.c, which aims to train robust decoder parameters. In the training process, the generator parameters are fixed to ensure that the generation process of the composite image is not affected. The image x.sub.c is fed into the watermark decoder after being attacked, and finally a high-quality reconstructed watermark is obtained. The solution shows significant robustness against the post-processing attacks, and information storage is maintained in a reasonable and identifiable range.

(72) VIII. An image and watermark joint generation evaluation system, a set of specific evaluation indicators is provided to evaluate the image quality, watermark invisibility, watermark reconstruction quality, watermark robustness and the like.

(73) The embodiment of the present disclosure provides a set of evaluation modules suitable for image and watermark joint generation. The modules are configured to evaluate the image quality (namely IS (inception score) and FID (frechet inception distance)), watermark invisibility (namely PSNR (peak signal-to-noise ratio), SSIM (structural similarity index measure) and LPIPS (learned perceptual image patch similarity), watermark reconstruction quality (namely NC and CA), watermark robustness (namely NC and CA) and the like. As a supplementation for existing watermark space evaluation indicators NC, the embodiment of the present disclosure designs an indicator for measuring the character accuracy (CA) of the reconstructed watermark. The indicator is calculated by optical character recognition (OCR) and edit distance. Through calculation of the indicators of NC and CA, after simulation of post-processing attacks (such as rotation, cropping, Gaussian noise and impulse noise), it is proved that character data in the reconstructed watermark in the solution provided by the embodiment of the present disclosure may still be maintained and restored.

Embodiment 2

(74) The embodiment of the present disclosure provides a cross-modal image-watermark joint generation and detection method, and as shown in FIG. 3, the method includes the following steps: Step 101: features of text, noise sampling and a digital watermark are extracted based on a multimodal encoder; specifically, input text is sequentially coded by a BiLSTM model, noise sampling is encoded by MLP, digital watermark features are extracted by multilayer CNNs, and finally encoding features of three modal data are output; and furthermore, the noise feature and the text feature are calculated by affine transformation to obtain an initialized image feature. Step 102: Feature compatibility of the image and the watermark is achieved by an image-watermark feature co-embedding matrix; specifically, two learnable parameter matrices are set, by which an image and watermark co-embedding space is found; and on the basis of achieving feature alignment, channel-level tandem connection is conducted to achieve the feature compatibility of the image and the watermark. Step 103: The watermark and image features are fused by the image-watermark feature fusion module at a channel level; specifically, Unet serves as a fusion network of the image and watermark features. In order to ensure the visual effect of the final generated image, the image feature is integrated for many times to reduce watermark interference to fusion features, and the compatibility of watermark signals and image semantics can be ensured; and the visual effect of the final composite image is ensured. Step 104: A high-resolution composite image with an invisible watermark is synthesized by the up-sampling generator; specifically, the watermark and image fusion features are fed into the generator consisting of the up-sampling module to achieve generation of the high-quality composite image with a resolution of 256256, in which watermark signals are hidden. Step 105: An unwatermarked image and a reconstructed watermark are decoupled based on a non-cooperative game theory; specifically, the watermark reconstruction and image decoding are regarded as a non-cooperative game process, aiming to independently develop two sets of allocation strategies for decoupling from the composite image. In the embodiment of the present disclosure, the allocation of the watermark feature and the image feature is required to achieve a Nash equilibrium state in order to ensure the decoupling quality thereof. Step 106: The composite image, the unwatermarked image and the reconstructed watermark are constrained by a strategy allocation module; specifically, the authenticity and consistency of the composite image are identified by the image discriminator, and information distribution strategy are allocated to the reconstructed watermark and unwatermarked image, so as to ensure the output quality of the composite image, the unwatermarked image and the reconstructed watermark. Step 107: The composite image is attacked by the post-processing attack module to evaluate the robustness of the watermark; specifically, common post-processing attacks with multiple intensities are added to the composite image, and watermark decoupling is conducted under the condition that pixel spatial information and semantic information are destroyed, to obtain robust decoder parameters, such that the watermark has strong robustness. Step 108: The embedding and analysis effects of the image and the watermark are judged based on the image and watermark joint generation evaluation system.

(75) Specifically, a set of comprehensive evaluation systems is applied to quantify the image and watermark joint generation effects, that is, measurement is conducted in terms of image quality (namely IS and FID), watermark invisibility (namely PSNR, SSIM and LPIPS), watermark reconstruction quality (namely NC and CA), watermark robustness (namely NC and CA) and the like. The embodiment of the present disclosure achieves excellent performance indicators.

(76) To sum up, the digital watermark is embedded into a text-to-image process in the embodiment of the present disclosure, and influence on the visual effect of the generated image is reduced under the condition of the invisible watermark as far as possible; supervisory and traceability means are provided for visual generative artificial intelligence; and the security and reliability of the generated image are guaranteed. In the embodiment of the present disclosure, the information distribution strategy of the composite image can be developed by the non-cooperative game and Shannon information theory to achieve decoupling of the watermark and the image under quality trade-off. In the embodiment of the present disclosure, under the condition that the post-processing attacks are applied to the composite image, the watermark with a higher recognition degree Can still be reconstructed, which proves that the watermark technology provided by the embodiment of the present disclosure has robustness. The embodiment of the present disclosure is applicable for a method based on generative adversarial networks, and ensures that the generated image has the hidden watermark. The present disclosure has strong generalization. The embodiment of the present disclosure provides a set of evaluation systems combining text-to-image generation and the watermarking, which can evaluate the image quality, watermark invisibility, watermark reconstruction degree and watermark robustness. The above-mentioned technology provides a reliable technical support for the supervision and traceability of the generated image.

Embodiment 3

(77) The solutions in Embodiments 1 and 2 are validated for feasibility below in combination with specific calculating examples and experimental data, and the detailed description is given as follows:

(78) Table 1 to Table 3 are quantifiable results of an image and watermark joint generation device and method for visually generated content security. In the embodiment of the present disclosure, text-to-image models are selected from two paradigms of single-stage generation and multistage generation respectively, namely RAT-GAN (recurrent affine transformation-generative adversarial network) and AttnGAN (attentional generative adversarial network), to validate the generalization of the image and watermark joint generation provided.

(79) Table 1 listed three images for fidelity comparison: (1) an original image: synthesized by a baseline model; (2) a composite image: refers to an image with a hidden watermark acquired from the generator of the present disclosure; (3) an analytic image: is an unwatermarked image version decoupled from the composite image. In the embodiment of the present disclosure, IS and FID indicators are used for evaluating the fidelity of the image. Ideally, although minor pixel-level interference exists, the composite image, the unwatermarked image and the original image should show almost the same visual appearance, and also have slight performance fluctuation in quantitative indicators.

(80) TABLE-US-00001 TABLE 1 Oxford-102 MS- CUB-Birds Flowers COCO Model Image type IS FID IS FID FID RAT- Original 5.36 0.20 13.91 4.09 0.06 16.04 14.60 GAN image Composite 4.94 0.06 16.95 3.72 0.07 18.35 15.62 image Un- 4.98 0.06 17.32 3.81 0.07 19.16 15.28 watermarked image AttnGAN Original 4.36 0.03 23.98 35.49 image Composite 4.02 0.05 26.49 37.51 image Un- 4.09 0.05 26.01 38.29 watermarked image

(81) The invisibility of the hidden watermarks should be embodied such that when the watermarks reach the Nash equilibrium state, they are imperceptible to a human visual system, and the composite image should not leak information obviously. The similarity between the composite image and the analytic image is measured by PSNR. As shown in Table 2, on a CUB-Birds dataset, PSNR values of RAT-GAN and AttnGAN are 33.29 dB and 33.86 dB respectively, while on Oxford-102 Flowers and MS-COCO datasets, an equivalent PSNR is obtained. When the PSNR is 30 dB, an embedded signal may be regarded as having high invisibility, and the embodiment of the present disclosure exceeds the threshold obviously. Therefore, the PSNR verifies the height invisibility of various watermarks in the embodiment of the present disclosure. Human perceptual preferences are simulated from three perspectives, and further evaluation is conducted by SSIM; the SSIM of two models maintains a matching rate of more than 99%, which proves the compatibility of the hidden watermark and a restored image and the invisibility of the watermark. Finally, in order to focus on the intrinsic structure of the image feature, the LPIPS model is used for learning the perceived distance of the composite image and the unwatermarked image. The model is evaluated by depth features. The RAT-GAN and the AttnGAN reach 0.0219 and 0.0235 on MS-COCO, which are less than 0.0320 reached under the condition that the real image is processed by an existing watermark embedding method. This indicates that a secret watermark hidden in the composite image is almost imperceptible. Therefore, almost traceless watermark information hiding is achieved in the embodiment of the present disclosure.

(82) TABLE-US-00002 TABLE 2 CUB-Birds Oxford-102 Flowers MS-COCO Model PSNR(dB) SSIM(%) LPIPS PSNR(dB) SSIM(%) LPIPS PSNR(dB) SSIM(%) LPIPS RAT-GAN 33.29 98.46 0.0257 33.51 98.60 0.0231 33.97 98.77 0.0219 AttnGAN 33.86 98.15 0.0223 33.54 98.26 0.0235

(83) Table 3 listed the degree of watermark reconstruction evaluated from the perspective of space and characters in the absence of attacks. The spatial similarity is measured by NC in a pixel-by-pixel manner, and this indicates that there are fewer distorted pixels. The similarity between the reconstructed watermark and the hidden watermark exceeds 99%. Therefore, the embodiment of the present disclosure achieves extremely high level of reconstruction from the spatial perspective. The embodiment of the present disclosure then provides a CA (Character Accuracy) indicator, by which the character accuracy is measured semantically in combination with OCR and an edit distance. It can be observed that an average CA is less than 0.17, which shows that almost all characters are recognizable in the absence of attacks.

(84) TABLE-US-00003 TABLE 3 CUB-Birds Oxford-102 Flowers MS-COCO Model NC(%) CA NC(%) CA NC(%) CA RAT-GAN 99.75 0.21 99.69 0.19 99.81 0.19 AttnGAN 99.72 0.23 99.48 0.21

(85) FIG. 4 is a case exhibition diagram of a generated image, which shows five sets of examples, and each example includes a composite image, an unwatermarked image, a hidden watermark, a reconstructed watermark, and single-fold and tenfold image differences. The presence of the hidden watermark in the image does not significantly change the visual appearance of the pattern, and it shows subtle changes in chroma under the condition of no information leakage. In addition, the watermark achieves almost lossless reconstruction, and all characters are clearly visible. The embodiment of the present disclosure achieves a significant visual effect, and ensures information distribution trade-off between the image and the watermark.

(86) FIG. 5 is a case exhibition diagram of a watermark analysis effect under multiple post-process attacks; FIG. 6 shows performance quantification of watermark analysis under multiple post-process attacks, and exhibits that the embodiment of the present disclosure has significant robustness against common post-processing attacks and higher watermark reconstruction capability. Generally, the watermark obtained after attacks ensures a certain character recognizability, which is enough to support traceability. Therefore, the embodiment of the present disclosure can guarantee the reconstruction quality of the watermark under attacks, and ensure the excellent robustness thereof under post-processing attacks with various intensities.

(87) To sum up, the embodiment of the present disclosure proves the characteristics of high quality of the generated image, high invisibility of the embedded watermark, high reconstruction accuracy of the watermark, strong robustness of the watermark against post-processing attacks and the like by a set of comprehensive evaluation systems, and the technical requirements of text-to-image and watermark joint generation can be met fully. The embodiment of the present disclosure aims to enable supervision of the visual generative model, support traceability of the generated image, and guarantee the security and reliability of the generated visual content.

Embodiment 4

(88) An image-watermark joint generation and detection device includes a processor and a memory, wherein program instructions are stored in the memory, and the processor calls the program instructions stored in the memory to enable the device to implement the following steps of a method: achieving a feature compatibility of an image and a watermark by an image-watermark feature co-embedding matrix; fusing features of the watermark and the image by image-watermark feature fusion at a channel level; synthesizing a high-resolution composite image with an invisible watermark by an up-sampling generator; decoupling an unwatermarked image and a reconstructed watermark based on a non-cooperative game theory; constraining the composite image, the unwatermarked image and the reconstructed watermark by strategy allocation; attacking the composite image by post-processing attacks; and judging embedding and analysis effects of the image and the watermark.

(89) It should be noted here that the description of the device in the above embodiment corresponds to that of the method in the embodiment, which is not repeated in the embodiment of the present disclosure.

(90) An executive body of the processor and the memory may be a computer, a single-chip microcomputer, a microcontroller and other devices with calculating functions. The executive body is not limited to the embodiment of the present disclosure during specific implementation, which is selected according to requirements in actual application.

(91) Data signals are transmitted between the memory and the processor through a bus, which is not repeated in the embodiment of the present disclosure.

(92) Based on the same inventive concept, an embodiment of the present disclosure further provides a computer-readable storage medium including a stored program, and when running, the program controls equipment where the storage medium is located to implement the steps of the method in the above embodiment.

(93) The computer-readable storage medium includes but is not limited to a flash memory, a hard disk, a solid state disk and the like.

(94) It should be noted here that the description of the readable storage medium in the above embodiment corresponds to that of the method in the embodiment, which is not repeated in the embodiment of the present disclosure.

(95) In the above embodiment, the implementation may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When achieved by the software, it may be achieved in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, flows or functions of the embodiment of the present disclosure are generated in whole or in part.

(96) The computer may be a general-purpose computer, a special-purpose computer, a computer network or other programmable devices. The computer instructions may be stored in the computer-readable storage medium or transmitted through the computer-readable storage medium. The computer-readable storage medium may be any available medium capable of being accessed by the computer or data storage equipment such as a server and a data center integrated with one or more available media. The available medium may be a magnetic medium or a semiconductor medium or else.

(97) The embodiments of the present disclosure do not limit models of other devices except for those specifically stated, as long as the devices can complete the above functions.

(98) Those skilled in the art can understand that the drawing is only a schematic diagram of a preferred embodiment. The serial numbers of the above embodiments of the present disclosure are merely for description, and do not represent the advantages and disadvantages of the embodiments.

(99) The above descriptions are merely preferred embodiments of the present disclosure, which are not intended to limit the present disclosure. Any modification, equivalent replacement and improvement made within the spirit and principle of the present disclosure should fall within the protection scope of the present disclosure.

Cross-modal image-watermark joint generation and detection device and method thereof

Assignee

Inventors

Cpc classification

Classification Explorer

G06T1/0028

PHYSICS

Classification Explorer

G06T2207/20016

PHYSICS

Classification Explorer

G06T11/00

PHYSICS

Classification Explorer

Y02T10/40

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

G06T7/0002

PHYSICS

Classification Explorer

G06T2207/20084

PHYSICS

Classification Explorer

G06T1/0064

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

Classification Explorer

G06T1/0078

PHYSICS

International classification

Classification Explorer

G06T1/00

PHYSICS

Classification Explorer

G06T11/00

PHYSICS

Classification Explorer

G06T7/00

PHYSICS

Abstract

Claims

Description