LIGHT-FIELD MESSAGING
20220261944 · 2022-08-18
Inventors
Cpc classification
G06T1/0028
PHYSICS
G06T1/005
PHYSICS
G06T2201/0083
PHYSICS
International classification
Abstract
A method of light messaging, in which a hidden message is transmitted via coded image emitted from a display device and retrieved using a camera, comprises training a camera-display transfer model that receives images with hidden messages from an embedding model and generates modified coded images based on training data that accounts for properties of displays and cameras, the modified coded images delivered to a recovery model that decodes the hidden messages and outputs hidden message determinations, training both the embedding and recovery models using the CDTF model and training data to minimize differences between the input hidden messages and the hidden message determinations. After training the CDTF model and other models, embedding a hidden message in a carrier image using the embedding model, displaying the coded image using the display device, receiving the coded image at the camera, and retrieving the hidden message using the recovery model.
Claims
1. A method of light messaging in which a hidden message is transmitted in a coded image emitted from a display device and retrieved from the coded image using a camera, comprising: training a camera-display transfer model that receives images coded with hidden messages from an embedding model and generates modified coded images that simulate camera-display distortion based upon training image data, the modified coded images being delivered to a recovery model that is configured to decode the hidden messages from the modified coded images and to output hidden message determinations; training both the embedding and the recovery models using the prior-trained CDTF model and the training image data to minimize a difference between the input hidden messages and the hidden message determinations; after training the CDTF model, embedding and recovery models: embedding a further hidden message in a carrier image using the trained embedding model, transforming the carrier image into a coded image; displaying the coded image using the display device; receiving the coded image at the camera; and retrieving the further hidden message embedded in the coded image using the trained recovery model.
2. The method of claim 1, wherein the CDTF model is implemented using a machine learning technique.
3. The method of claim 2, wherein the neural network implementing the CDTF model includes a neural network system having at least one convolutional neural network.
4. The method of claim 2, wherein the CDTF model is trained using a loss function that includes a perceptual metric.
5. The method of claim 1, wherein the training image data includes photographs taken by a plurality of cameras.
6. The method of claim 1, wherein the training image data includes images displayed on a plurality of display devices.
7. The method of claim 1, wherein the embedding model includes a first processing path for the message and a second processing path for carrier images into which the hidden messages are embedded.
8. The method of claim 7, wherein features of the carrier images and messages are shared by the first and second processing paths.
9. The method of claim 1, wherein the carrier image in which the coded image is displayed and captured using different display and camera devices than those from which the training image data is obtained.
10. The method of claim 1, wherein the coded images are encoded spatially by the embedding model such that time-based synchronization is not required to decode the code images using the recovery model.
11. A method of light messaging using a camera-display distortion model, including an embedding model configured to embed a hidden message into a carrier image and transform the carrier image into a coded image, a camera-display transfer model configured to receive the coded image from the embedding model and to generate a modified coded image that simulates camera-display distortion, and a recovery model configured to retrieve the hidden message from the modified coded image, the method comprising: receiving an image emitted by a display device in which a message has been embedded using the embedding model trained with the CDTF model; processing the received image through the recovery model trained with the CDTF model; and determining the hidden message.
12. The method of claim 11, wherein the embedding model, camera-display transfer model and recovery model are trained, using training image data, to minimize a difference between the hidden message embedded in the carrier image by the embedding model, and the message retrieved subject to the camera-display distortion simulated by the CDTF model.
13. A method of light messaging using a camera-distortion model including an embedding model configured to embed a hidden message into a carrier image and transform the carrier image into a coded image, a camera-display transfer model configured to receive the coded image from the embedding model and to generate modified coded images that simulate camera-display distortion based upon training image data, and a recovery model configured to retrieve the hidden message from the modified coded image, the method comprising: embedding a further hidden message in a carrier image using the embedding model trained with the CDTF model, transforming the carrier image into a coded image; and displaying the coded image using a display device.
14. A method of light messaging in which a hidden message is transmitted in a coded image emitted from a display device and retrieved from the coded image using a camera, comprising: training a camera-display transfer function model configured to receive the coded image from an embedding model and to generate modified coded images that simulate camera-display distortion based upon training image data, the modified coded images being delivered to a recovery model that is configured to decode the hidden messages from the modified coded images and to output hidden message determinations; and training both the embedding model and the recovery model using the prior-trained CDTF model and the training image data to minimize a difference between the input hidden messages and the hidden message determinations.
15. The method of claim 14, wherein the CDTF model is implemented using a machine learning technique.
16. The method of claim 15, wherein the neural network implementing the CDTF model includes a neural network system having at least one convolutional neural network.
17. The method of claim 15, wherein the CDTF model is trained using a loss function that includes a perceptual metric.
18. The method of claim 14, wherein the coded images are encoded spatially by the embedding model such that time-based synchronization is not required to decode the code images using the recovery model.
19. A method of enabling light messaging in which a hidden message is transmitted in a coded image emitted from a display device and retrieved from the coded image using a camera, comprising: obtaining training image data including images displayed by a plurality of display devices and captured by a plurality of camera devices; and training a camera-display transfer model configured to simulate camera-display distortion based upon the obtained training image data.
20. The method of claim 19, further comprising determining an inverse of the trained camera-display transfer function.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
DESCRIPTION OF CERTAIN EMBODIMENTS OF THE DISCLOSURE
[0024] As used herein the term “hidden message” refers to the covertly communicated payload and can include any type of information without limitation. The term “carrier image” refers to the image that is used to hide the message, and a “coded image” refers to the combined carrier image and hidden message. The term “image” throughout is meant to include both static images and video frames.
[0025] Disclosed herein are methods of end-to-end photographic light field messaging. Embodiments of the method include provision of a camera-display transfer function that models the camera and display without radiometric calibration, an embedding model that optimally embeds the message within an image, and a message recovery model that retrieves the message on the camera side. The camera-display transfer function can be implemented in a number of different ways using machine learning technique, including neural networks. In some embodiments, multiple camera-display transfer functions that work together can be implemented. Additionally, in certain embodiments the embedding model and the recovery model can be implemented using neural networks. Single-frame operation can be employed so that no temporal synchronization between camera and display is needed, greatly increasing the practical utility of the method. The properties of the camera hardware, display hardware, and radiometry need not be known beforehand. Instead, a training camera-display dataset (“CD dataset”) is compiled. In one implementation, a rich CD dataset with over one million images and 25 camera-display pairs has been used to train a neural network to learn the representative CDTF. The embedding and recovery models are trained after establishing a robust CDTF model. In some implementations, the CDTF is modeled using a neural network algorithm that learns which features are invariant to CDTF distortion, while simultaneously preserving perceptual quality of the carrier image.
[0026]
[0027] It is noted the embedding and recovery models need not be trained first using a CDTF model. Rather, in such embodiments, the embedding and recovery models using display/camera image pairs so that the embedding and recovery models themselves incorporate a CDTF transfer function. Furthermore, rather than using a model to simulate camera-display distortions, the embedding and recovery models can be trained inversely to incorporate an “inverse” function that effectively reverses the effects of camera-display distortion. Accordingly, the manner in which camera-display distortion is modeled and then reversed can be implemented in a number of different ways an in different orders as would be appreciated by those of skill in the art.
[0028]
[0029] The unaltered carrier image is denoted as i.sub.c, the unaltered message as i.sub.m, the coded image as i′.sub.c and the recovered message as i′.sub.m. L.sub.c and L.sub.m represent generic norm functions used for image and message loss that may be scaled to change their relative weighting, respectively. In theory, the objective is to learn functions E( ) and R( ) that minimize
L.sub.c(i′.sub.c−i.sub.c)+L.sub.m(i′.sub.m−i.sub.m) (1)
subject to
E(i.sub.c,i.sub.m)=i′.sub.c (2) and
R(i′.sub.c)=i′.sub.m (3)
[0030] In other words, the objective is to simultaneously minimize the distortions to the carrier image and minimize message recover error. It turns out that this simple formulation does not yield a trained solution. Instead, an additional function T( ) that simulates CDTF distortion is added. If i″.sub.c represents a coded image that has passed through a display-camera transfer T( ), then
T(i′.sub.c)=i″.sub.c (4)
[0031] The conditions for minimizing the loss function from (1) are now:
E(i.sub.c,i.sub.m)=i′.sub.c, T(i′.sub.c)=i″.sub.c (5), and
R(i″.sub.c)=i′.sub.m (6)
[0032] The CDTF function T( ) represents both the photometric and radiometric effects of camera-display transfer. T( ) is trained using a large dataset of images electronically-displayed and camera-captured using several combinations of displays and cameras.
[0033] Returning again to
[0034] It is noted that different convolutional neural network architectures can also be used and that the particular convolutional network implementation described above is not to be taken as limiting. In addition, in some embodiments, the CDTF function can be implemented, without an explicitly trained network using one or more fixed functions. For example, an image could be modified using some combination of affine or perspective transformations, blurring, changes in brightness, changes in contrast, changes in color saturation, changes in hue, rotation, flipping, scaling, stretching, cropping or translation, and other combinations of image processing and computer vision functions. In these embodiments, the untrained CDTF model can be used to train embedding and/or recovery functions.
[0035] The CDTF model 230 receives coded images (i′.sub.c) in an initial block 231. The output from the initial block 231 is passed to a layer of encoding blocks 232, 233, 234, then to a bottom block 235, and thereafter to a dense layer of decoding blocks 235, 236, 237. The final decoding block 238 outputs a modified coded image (i″.sub.c) that models the display-camera distortion. The initial, encoding, bottom and decoding blocks 231-238 can have the same architecture as those described above in the embedding model, although this is not required. Similarly, the recovery model 250 can include an initial block that receives modified coded images (i′.sub.c) in an initial block 251. The output from the initial block 251 is passed to a layer of encoding blocks 252, 253, 254, then to a bottom block 255, and thereafter to a dense layer of decoding blocks 255, 256, 257. The final decoding block 258 outputs a recovered message (i′.sub.m). In some implementations, the initial, encoding, bottom and decoding blocks 251-258 can have the same architecture as those described above in the embedding and CDTF models.
[0036] The light messaging method of the present disclosure has the combined goals maximizing message recovery and minimizing carrier image distortion. For coded image fidelity, the loss function uses the L2-norm to measure the difference between i.sub.c and i′.sub.c. To capture this difference, photo-realistic image generation using neural networks can employ perceptual loss metrics in training. The perceptual loss metric can also include a quality loss. In some implementations, the quality loss can be calculated by passing i.sub.c and i′.sub.c through a trained neural network for object recognition, such as VGG (Visual Geometry Group model), and minimizing the difference of feature map responses at several depths.
[0037] As noted above, to train the CDTF model (T( )), a dataset including over 1 million images collected using 25 camera-display pairs was used. Images from the MSCOCO 2014 training and validation dataset were displayed on five electronic different displays, and then photographed using five digital cameras. The chosen hardware represents a spectrum of common cameras and displays. To achieve a set of 1M images, 120,000 images of MSCOCO were chosen at random. Each camera-captured image was cropped, warped to frontal view, and aligned with its original. The measurement process was semi-automated and employed software control of all cameras and displays. The CDTF model was trained using the 1,000,000 image pairs (“1 M data set”); i.sub.coco represents the original images and i.sub.CDTF represents the same images displayed and camera-captured. The transfer function T( ) is trained to simulate CDTF distortion by inputting i.sub.coco and outputting i.sub.CDTF. Thus, the loss function to minimize is:
T.sub.loss=L.sub.2(i.sub.coco−i.sub.CDTF)+λ.sub.T*L.sub.1(VGG(i.sub.coco)−VGG(i.sub.CDTF)) (7)
[0038] A perceptual loss regularizer for T( ) is included to preserve the visual quality of the CDTF model output (i″.sub.c). The perceptual loss weight used in training can vary depending on the training data set; a value for λ.sub.T of 0.001 was used for training the CDTF model with the 1 M data set. T( ) was trained for 2 epochs using the Adam optimizer with a learning rate of 0.001, with betas equal to (0.9,0.999) and no weight decay. The total training time was 7 days.
[0039] The embedding and recovery models were trained simultaneously using 123,287 images from MS-COCO for i.sub.c and 123,282 messages for i.sub.m. The loss function for E( ) effectively minimized the difference between the coded image and the original image while encoding information from i.sub.m such that it is robust to CDTF. The loss function for R( ) was minimized to recover all information in i.sub.m despite CDTF distortion.
E.sub.loss=L.sub.2(i.sub.c−i′.sub.c)+λ.sub.E*L.sub.1(VGG(i.sub.c)−VGG(i′.sub.c)) (8)
R.sub.loss=ϕ*L.sub.1(i.sub.m−i′.sub.m) (9)
[0040] A perceptual loss regularizer can also been included in the loss function for E( ) to preserve the visual quality of the embedding model output i′.sub.c. The loss weight used in training (λ.sub.E) was 0.001, and the message weight (ϕ) was 128. Both the embedding and recovery models were trained for 3 epochs using the Adam optimizer with a learning rate of 0.001, betas equal to (0.9,0.999) and no weight decay. The total learning time for the embedding and recovery models was 8 hours. The embedding, CDTF and recovery models were all trained using PyTorch 0.3.0 with a Nvidia Titan X (Maxwell) computing card.
[0041] The efficacy of the training method was explored by experimentation. Tests were performed using a benchmark test data set with 1000 images from the MSCOCO image set, 1000 messages, and 5 camera-display pairs. Each message contained 1024 bits. Two videos were generated, each containing 1000 coded images embedded using the trained light field message method; in one test, the messages were recovered using the full model trained with the CDTF model. In another test, the embedding and recovery models were used without prior training with the CDTF model. In a further test, the full embedding and recovery model trained with the CDTF model was tested with data captured by cameras at a 45° angle to the display. Table 1 shows incudes the bit error rate (BER) results of the testing using the benchmark data. The light fielding messaging method trained with T( ) achieved a 7.3737% bit error rate, or 92.6263% correctly recovered bits on average for frontally photographed displays. The same model achieved a 14.0809% BER when camera and display were aligned at a 45° angle. All BER results were generated without any error correcting codes or radiometric calibration between cameras and displays. Redundancy and error correction coding can be used to reduce this bit error rate further.
TABLE-US-00001 TABLE 1 Basler iPhone * & Basler Pixel 2 & acA2040- Logitech & Acer acA1300- Samsung 90uc & Acer Insignia S- Predator 30uc & Dell 2494SJ S240ML 40D40SNA14 XB271HU 1707FPt LFM, 49.961% 50.138% 50.047% 50.108% 50.042% without T( ), frontal LFM, with 29.807% 15.229% 10.217% 5.1415% 10.01% T( ), 45° angle LFM, with 10.051% 6.5809% 10.333% 5.0732% 4.8305% T( ), frontal
[0042]
[0043] Further tests were performed to gain insight into the effects of varying different metrics on the disclosed LFM steganography model. In one test, the perceptual regularizer weight (λ.sub.T) in the loss function for T( ) was varied.
[0044] Still another test was performed to determine how well T( ) generalizes to new camera-display pairs. Using the 1000-image, 1024-bit test dataset, two additional cameras and two additional displays were tested using i) a discrete cosine transform (DCT) algorithm; ii) the Baluja algorithm described above, iii) the light field messaging method with the CDTF model and frontal image capture (i.e., without T( )); iv) the disclosed light field messaging method with T( ) and image capture at 45° (camera-display); and v) the disclosed light field messaging method with T( ) and frontal image capture. Table 2 shows the results of the tests (in terms of bit error rate (BER)), indicating that LFM trained with T( ) significantly outperforms existing methods for the new camera-display pairs, even when camera and display are at a 45° angle.
TABLE-US-00002 TABLE 2 Sony Cybershot Nikon Coolpix Sony Cybershot DSC-RX100 & Nikon Coolpix S6000 & Apple DSC-RX100 & Lenovo Apple Macbook S6000 & Lenovo Macbook Pro Thinkpad X1 Pro 13-inch, Thinkpad X1 13-inch, Early Carbon 3444-CUU Early 2011 Carbon 3444-CUU 2011 DCT, frontal 50.01% 50.127% 50.001% 49.949% Baluja, frontal 40.372% 37.152% 48.497% 48.827% LFM, without 50.509% 49.948% 50.0005% 49.9975 T( ), frontal LFM, with T( ), 12.974% 15.591% 27.434% 25.811% 45° angle LFM, with T( ), 9.1688% 7.313% 20.454% 17.555% frontal
[0045] The results above demonstrate that the disclosed method of light field messaging LFM with a CDTF model (T( )) significantly outperforms existing deep-learning and fixed-filter steganography approaches, yielding the best BER scores for every camera-display combination tested. This method is robust to camera exposure settings and camera-display angle, with LFM at 45° outperforming all other methods at 0° camera-display viewing angles. The low error rate of the disclosed LFM method opens exciting avenues for new applications and learning-based approaches to photographic steganography. Moreover, the disclosed method can be implemented as a single-frame synchronization-free methodology, with ordinary display hardware without high frequency requirements. Accordingly, an important benefit of the disclosed method is that it employs spatial encoding and does not rely on detection of changes in images over time, removing the need to synchronize image production and image capture in the time domain. It is noted that the models trained using data from the specific set of cameras and displays generalize to cameras and displays not included in the training set
[0046] It is to be understood that any structural and functional details disclosed herein are not to be interpreted as limiting the systems and methods, but rather are provided as a representative embodiment and/or arrangement for teaching one skilled in the art one or more ways to implement the methods.
[0047] It is to be further understood that like numerals in the drawings represent like elements through the several figures, and that not all components and/or steps described and illustrated with reference to the figures are required for all embodiments or arrangements.
[0048] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0049] Terms of orientation are used herein merely for purposes of convention and referencing and are not to be construed as limiting. However, it is recognized these terms could be used with reference to a viewer. Accordingly, no limitations are implied or to be inferred.
[0050] Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having,” “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
[0051] While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications will be appreciated by those skilled in the art to adapt a particular instrument, situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims.