Methods and Apparatuses of Contrastive Learning for Color Constancy

20220164601 · 2022-05-26

    Inventors

    Cpc classification

    International classification

    Abstract

    A contrastive learning method for color constancy employs a fully-supervised construction of contrastive pairs, driven by a novel data augmentation. The contrastive learning method includes receiving two training images, constructing positive and negative contrastive pairs by the novel data augmentation, extracting representations by a feature extraction function, and training a color constancy model by contrastive learning representations in the positive contrastive pair are closer than representations in the negative contrastive pair. The positive contrastive pair contains images having an identical illuminant while negative contrastive pair contains images having different illuminants. The contrastive learning method improves the performance without additional computational costs. The desired contrastive pairs allow the color constancy model to learn better illuminant feature that are particular robust to worse-cases in data sparse regions.

    Claims

    1. A contrastive learning method for color constancy in an image or video processing system, comprising: receiving input data associated with a first training image captured in a first scene under a first illuminant, and a second training image captured in a second scene under a second illuminant; constructing at least a positive contrastive pair and at least a negative contrastive pair by applying a data augmentation to the first and second training images, wherein each positive contrastive pair contains two images having an identical illuminant and each negative contrastive pair contains two images having different illuminants; extracting representations of the images in the positive and negative contrastive pairs by a feature extraction function; and training a color constancy model by contrastive learning, wherein the color constancy model is trained by learning representations in each positive contrastive pair are closer than the representations in each negative contrastive pair.

    2. The method of claim 1, wherein the step of training a color constancy model by contrastive learning further comprises: mapping each representation to a projection in a latent projection space by a feature projection function; measuring a similarity between projections of the positive contrastive pair and a similarity between projections of the negative contrastive pair; and maximizing the similarity between the projections of the positive pair and minimizing the similarity between the projections of the negative pair by a contrastive loss function.

    3. The method of claim 1, wherein the data augmentation augments the first training image to a different view to derive a first augmented image, wherein the first augmented image is label-preserving as the first training image and the first augmented image share a same ground-truth illuminant.

    4. The method of claim 1, wherein the step of constructing positive and negative contrastive pairs further comprises: deriving a novel illuminant by interpolation or extrapolation between the first illuminant and the second illuminant; synthesizing a first augmented image having the first scene and the first illuminant, a second augmented image having the second scene and the first illuminant, a third augmented image having the first scene and the novel illuminant, and a fourth augmented image having the second scene and the novel illuminant by the data augmentation; and constructing an easy positive contrastive pair by including the first training image and the first augmented image, constructing an easy negative contrastive pair by including the first training image and the fourth augmented image, constructing a hard positive contrastive pair by including the first training image and the second augmented image, and constructing a hard negative contrastive pair by including the first training image and the third augmented image.

    5. The method of claim 4, wherein the data augmentation extracts canonical colors from the first and second training images to form color checkers, fits a color mapping matrix and an inverse color mapping matrix to map between the two color checkers, derives two additional color mapping matrices from the color mapping matrix and inverse color mapping matrix for the novel illuminant, applies the color mapping matrix to the second training image to synthesize the second augmented image, and applies the two additional color mapping matrices to the first and second training images to synthesize the third and fourth augmented images respectively.

    6. The method of claim 5, wherein the color mapping matrix and inverse color mapping matrix are full color transformation matrices and the two additional color mapping matrices are full color transformation matrices.

    7. The method of claim 5, wherein the color mapping matrix and inverse color mapping matrix are reduced from full color transformation matrices to diagonal matrices, and the two additional color mapping matrices are derived from an identity matrix, the color mapping matrix, and inverse color mapping matrix, wherein the third and fourth augmented images are synthesized by simplified neutral color mapping using the two additional color mapping matrices.

    8. The method of claim 4, further comprising: mapping each representation to a projection in a latent projection space by a feature projection function; computing a first loss for the representations of the easy positive contrastive pair and easy negative contrastive pair, a second loss for the representations of the easy positive contrastive pair and hard negative contrastive pair, a third loss for the representations of the hard positive contrastive pair and easy negative contrastive pair, and a fourth loss for the representations of the hard positive contrastive pair and hard negative contrastive pair; and computing a contrastive loss by a sum of the first, second, third and fourth losses.

    9. The method of claim 1, wherein the step of constructing positive and negative contrastive pairs further comprises: synthesizing a first augmented image having the second scene and the first illuminant and a second augmented image having the first scene and the second illuminant by the data augmentation; constructing the positive contrastive pair by including the first training image and the first augmented image and constructing the negative contrastive pair by including the first training image and the second augmented image.

    10. The method of claim 9, wherein the data augmentation extracts canonical colors from the first and second training images to form color checkers, fits a color mapping matrix and an inverse color mapping matrix to map between the two color checkers, and applies the color mapping matrix and the inverse color mapping matrix to the first and second training images to synthesize the first and second augmented images.

    11. The method of claim 1, wherein the color constancy model is trained by scene-invariant and illuminant-dependent representations, so that representations of a same scene under different illuminants are far from each other and representations of different scenes under a same illuminant are close to each other.

    12. An apparatus conducting contrastive learning for color constancy in an image or video processing system, the apparatus comprising one or more electronic circuits configured for: receiving input data associated with a first training image captured in a first scene under a first illuminant, and a second training image captured in a second scene under a second illuminant; constructing at least a positive contrastive pair and at least a negative contrastive pair by applying a data augmentation to the first and second training images, wherein each positive contrastive pair contains two images having an identical illuminant and each negative contrastive pair contains two images having different illuminants; extracting representations of the images in the positive and negative contrastive pairs by a feature extraction function; and training a color constancy model by contrastive learning, wherein the color constancy model is trained by learning representations in each positive contrastive pair are closer than the representations in each negative contrastive pair.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0020] Various embodiments of this disclosure that are proposed as examples will be described in detail with reference to the following figures, wherein like numerals reference like elements, and wherein:

    [0021] FIG. 1 illustrates a relationship of representations for scene-invariant and illuminant-dependent representations.

    [0022] FIG. 2 illustrates a framework of a contrastive learning for color constancy system incorporating contrastive learning to learn generalized and illuminant-dependent feature representations according to an embodiment of the present invention.

    [0023] FIG. 3 illustrates an embodiment of formation for contrastive pairs and color augmentation.

    [0024] FIG. 4 is a flowchart illustrating applying a data augmentation to synthesize augmented images for better contrastive pair construction according to an embodiment of the present invention.

    [0025] FIG. 5 is a flowchart of contrastive learning for color constancy according to an embodiment of the present invention.

    DETAILED DESCRIPTION OF THE INVENTION

    [0026] It will be readily understood that the components of the present invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of the embodiments of the systems and methods of the present invention, as represented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.

    [0027] Reference throughout this specification to “an embodiment”, “some embodiments”, or similar language means that a particular feature, structure, or characteristic described in connection with the embodiments may be included in at least one embodiment of the present invention. Thus, appearances of the phrases “in an embodiment” or “in some embodiments” in various places throughout this specification are not necessarily all referring to the same embodiment, these embodiments can be implemented individually or in conjunction with one or more other embodiments. Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

    [0028] To avoid learning spurious correlations by deep learning models that focus on scene or object related features, contrastive learning may be used to regularize deep learning models to learn scene invariant and illuminant-dependent representations. Contrastive learning is a framework that learns general and robust feature representations by comparing similar and dissimilar pairs. Inspired from Noise Contrastive Estimation (NCE) and N-pair loss. As illustrated in FIG. 1, in contrast to image classification problems, the representations of the same scene under different illuminants for color constancy contrastive learning should be far from each other. On the contrary, the representations of different scenes under the same illuminant should be close to each other. However, conventional self-supervised contrastive learning often generates easy or trivial contrastive pairs that are not very useful for learning generalized feature representations.

    [0029] A deep learning based method for color constancy is designed to learn desired representations by contrastive learning according to embodiments of the present invention. The desired representations are scene-invariant and illuminant-dependent representations so that the representations of the same scene under different illuminants are far from each other, while the representations of different scenes under the same illuminant are more close to each other. Contrastive pairs generated by self-supervised contrastive learning are usually not good enough for regularizing the deep learning models for color constancy. Embodiments of the present invention construct more useful contrastive pairs for color constancy contrastive learning by data augmentations. Data augmentations are found to be effective in contrastive pair construction for conducting successful contrastive learning, for example, data augmentations such as random cropping, flipping, and rotation have been widely used in classification, object detection, and semantic segmentation to improve model quality. Various works rely on manually designed augmentations to reach their best results. To ease such efforts, strategy search or data synthesis have been used to improve data quality and diversity. However, popular data augmentation strategies for image recognition and classification may not be suitable for the color constancy task. For example, most previous data augmentations in contrastive learning are designed for high-level vision tasks such as object recognition and seek illuminant invariant features, which can be detrimental for color constancy. Data augmentation such as color dropping converts an sRGB image to a gray-scale one, making the color constancy task even more difficult. Consequently, color domain knowledge is incorporated to design data augmentation suitable for contrastive learning on color constancy according to some embodiments of the present invention. The color constancy task works best in the linear color space where the linear relationship to scene radiance is preserved. This prevents from using non-linear color jittering augmentations such as contrast, saturation, and hue.

    [0030] Methodology-Formulation FIG. 2 illustrates an overview of the Contrastive Learning for Color Constancy (CLCC) method. Contrastive learning is incorporated in the main color constancy task to learn generalized and illuminant-dependent feature representations. The learning problem setting follows the majority of learning-based color constancy research which only focuses on the white balance step of estimating the illuminant L from the input raw image I.sub.raw:


    L=f.sub.ϕ(h.sub.ϕ(I.sub.raw));

    where h.sub.ϕ is the feature extractor that produces visual representations for I.sub.raw, f.sub.ϕ is the illuminant estimation function, and {circumflex over (L)} is the estimated illuminant. Both h.sub.ϕ and f.sub.ϕ are parameterized by deep neural network with arbitrary architecture design, where θ and ϕ can be trained via back-propagation.

    [0031] The overall learning objective can be decomposed into two parts illuminant estimation for color constancy and contrastive learning for better representations as shown in FIG. 2. custom-character.sub.total=λcustom-character.sub.illuminant+βcustom-character.sub.contrastive;

    For the illuminant estimation task, a commonly used angular error is used:

    [00001] illuminant = arccos ( L ^ .Math. L .Math. L ^ .Math. .Math. .Math. L .Math. ) ;

    where L is the estimated illuminant and L is the ground-truth illuminant. Since the datasets for color constancy are relatively small because it is difficult to collect training data with corresponding ground-truth illuminants. Training a deep learning color constancy model with only the supervision custom-character.sub.illuminant usually does not generalize well. Contrastive learning is applied to train the color constancy model in various embodiment of the present invention that generalize better even with a small training dataset.

    [0032] In some embodiments of the CLCC, fully-supervised contrastive learning is used for color constancy. The essential building blocks of contrastive learning as shown in FIG. 2 include a stochastic data augmentation t(⋅)˜T, a feature extraction function h.sub.99 , a feature projection function g.sub.ψ, a similarity metric function s(⋅), contrastive pair formulation, and a contrastive loss function L.sub.contrastive. A stochastic data augmentation augments a sample image I to a different view t(I). Note that t(⋅) is required to be label-preserving, meaning that I and t(I) still share the same ground-truth illuminant L. The feature extraction function h.sub.ϕ, extracts the representation of t(I), and is further used for downstream color constancy task. The feature projection function maps the representation h.sub.ϕ (t(I)) to the projection z that lies on a unit hypersphere. The feature projection function g.sub.ψ is typically only required when computing learning representations and is thrown away once the learning is finished. The similarity metric function measures the similarity between latent projections (zi, zj). Anchor I, positive I.sup.+, and negative I.sup.−samples jointly compose the positive pair (I, I.sup.+) and the negative pair (I, I.sup.−) in contrastive pair formulation for contrastive learning. For the color constancy task, a positive pair should share the same illuminant label L, while a negative pair should have different ones. The contrastive loss function aims to maximize the similarity between the projections of the positive pair (z, z.sup.+) and minimize the similarity between that of the negative pair (z, z.sup.+) in the latent projection space.

    [0033] In self-supervised contrastive learning, two random training images I.sub.i and I.sub.j with different scene content are given, a positive contrastive pair is form with two randomly augmented views of the same image (t(I.sub.i), t.sup.'(I.sub.i.sup.+)), and a negative contrastive pair is formed with views of two different images (t(I.sub.i), t.sup.'(I.sub.j.sup.−)). Such naive formulation introduces two potential drawbacks. One is the sampling bias, the potential to sample a false negative pair that shares very similar illuminants, for example, L.sub.i≃L.sub.j. The other drawback is the lack of hardness, the fact that the positive t(I.sub.i.sup.+) derived from the same image as the anchor t(I.sub.i) could share similar scene content. This alone suffices to let neural network easily distinguish from negative t'(I.sub.j.sup.−) with apparently different scene content. To alleviate sampling bias and increase the hardness of contrastive pairs, methods of the present invention leverage label information, extending self-supervised contrastive learning into fully-supervised contrastive learning, where the essential data augmentation is specifically designed to be label-preserving for color constancy task.

    [0034] Contrastive Learning for Color Constancy FIG. 3 illustrates the realization of each component in the fully-supervised contrastive learning framework according to an embodiment of the present invention. A first stage of contrastive learning is contrastive pair formulation from two randomly sampled training images I.sub.XA and I.sub.YB, where I.sub.XA is defined as a linear raw-RGB image captured in the scene X under the illuminant L.sub.A, and I.sub.YB is a linear raw-RGB image captured in the scene Y under the illuminant L.sub.B. In various embodiments of the present invention, a positive pair shares an identical illuminant while a negative pair has different illuminants. Four contrastive pairs are generated from two randomly sampled training images I.sub.XA and I.sub.YB according to this embodiment. These four contrastive pairs include an easy positive pair (t(I.sub.XA), t'(I.sup.+.sub.XA)), an easy negative pair (t(I.sub.XA), t'(I.sup.−.sub.YC)), a hard positive pair (t(I.sub.XA), t'(I.sup.+.sub.YA)), and a hard negative pair (t(I.sub.XA), t'(I.sup.−.sub.XC). The easy positive pair contains two images having an identical scene X and illuminant L.sub.A, and the easy negative pair contains two images having different scenes (X, Y) and different illuminants (L.sub.A, L.sub.C). The hard positive pair contains two images having different scenes (X, Y) but with an identical illuminant L.sub.A, and the hard negative pair contains two images having an identical scene X but with different illuminants (L.sub.A, L.sub.C).

    [0035] Images I.sub.YC, I.sub.YA, and I.sub.XC are synthesized by replacing one scene's illuminant to another. A novel illuminant L.sub.C is derived by interpolation or extrapolation between the illuminants L.sub.A and L.sub.B of the two training images. A redundant hard negative sample I.sub.XB is not required in this embodiment. The function t is a stochastic perturbation-based, illuminant-preserving data augmentation composed by random intensity, random shot noise, and random Gaussian noise.

    [0036] Next stages of contrastive learning are similarity metric and contrastive loss function. Once the contrastive pairs are defined in the image space, a feature extraction function h.sub.ϕ and a feature projection function g.sub.ψ are used to encode those views t(⋅) to the latent projection space z. The contrastive loss is computed as the sum of InfoNCE losses for properly elaborated contrastive pairs:


    custom-character.sub.contrastive=custom-character.sub.NCE(Z.sub.XA,Z.sup.+.sub.XA,Z.sup.−.sub.YC)+custom-character.sub.NCE(Z.sub.XA,Z.sup.+.sub.XA,Z.sup.−.sub.XC)+custom-character.sub.NCE(Z.sub.XA,Z.sup.+.sub.YA,Z.sup.−.sub.YC)+custom-character.sub.NCE(Z.sub.XA,Z.sup.+.sub.YA,Z.sup.−.sub.XC).

    The InfoNCE loss custom-character.sub.NCE can be computed as:

    [00002] N C E = - log [ exp ( s + τ ) exp ( s + τ ) + .Math. n = 1 N exp ( s - τ ) ] ;

    where s+ and s− are the cosine similarity scores of positive and negative pairs respectively:


    s.sup.+=s(z, z.sup.30 ); s.sup.−=s(z, z.sup.−).

    The InfoNCE loss could be viewed as performing a (N+1) way classification realized by cross-entropy loss with N negative pairs and 1 positive pair, where i is the temperature scaling factor.

    [0037] Raw-domain Color Augmentation The goal of the proposed data augmentation is to synthesize more diverse and harder positive and negative samples for CLCC by manipulating illuminants such that the color constancy solution space is better constrained. Images I.sub.YC, I.sub.YA, and I.sub.XC are synthesized based on two randomly sampled images (I.sub.XA, L.sub.A) and (I.sub.YB, L.sub.B) by the following procedure. Twenty-four linear-raw RGB colors C.sub.Aϵcustom-character.sup.24×3 and C.sub.Bϵcustom-character.sup.24×3 of the color checker are extracted from I.sub.XA and I.sub.YB respectively using the off-the-shelf color checker detector. Given the detected color checkers C.sub.A and C.sub.B, a linear color mapping matrix M.sub.ABϵcustom-character.sup.3×3 that transforms C.sub.A to C.sub.B can be solved by any standard least-square method. The inverse color mapping matrix M.sub.BA can be derived by solving M.sub.AB.sup.−1. Accordingly, images I.sub.XB and I.sub.YA can be augmented as:


    I.sub.XB=I.sub.XAM.sub.AB;I.sub.YA=I.sub.YBM.sub.BA.

    [0038] The above data augmentation procedure produces novel samples I.sub.XB and I.sub.YA, but using only pre-existing illuminants L.sub.A and L.sub.B from the training images. To synthesize a novel sample I.sub.XC under a novel illuminant L.sub.C that does not exist in the training data set, a color checker C.sub.C can be synthesized by channel-wise interpolating or extrapolating from the existing color checkers C.sub.A and C.sub.B as:


    C.sub.C=(1-w)C.sub.A+wC.sub.B;

    where w can be randomly sampled from a uniform distribution of an appropriate range [w.sub.min, w.sub.max]. For example, a new color checker is synthesized using w=0.5 for interpolation, or a new color checker is synthesized using w=−1.5 or 1.5 for extrapolation. Note that w should not be close to zero in avoidance of yielding a false negative sample I.sub.XC=I.sub.XA for contrastive learning. To more realistically synthesize I.sub.XC, that is more accurate on chromatic colors, a full color transformation matrix M.sub.AC is used to map I.sub.XA to I.sub.XC, and a full color transformation matrix M.sub.BC is used to map I.sub.YB to I.sub.YC:


    I.sub.XC=I.sub.XAM.sub.AC; I.sub.YC=I.sub.YBM.sub.BC.

    [0039] FIG. 4 is a flowchart illustrating an example of the data augmentation applied to the contrastive learning for color constancy according to an embodiment of the present invention. In step S402, color checkers C.sub.A and C.sub.B for a pair of training images I.sub.XA and I.sub.YB are detected. A color mapping matrix M.sub.AB is computed for transforming color checker C.sub.A to C.sub.B and an inverse color mapping matrix M.sub.BA is computed for transforming color checker C.sub.B to C.sub.A in step S404. The data augmentation applies color mapping to swap pre-existing illuminants of the two training images I.sub.XA and I.sub.YB via estimated color mapping matrices M.sub.AB and M.sub.BA in step S406. In step S408, augmented images I.sub.XC and I.sub.YC with a novel illuminant corresponding to a novel color checker C.sub.C are synthesize via interpolation and extrapolation using the detected color checkers C.sub.A and C.sub.B.

    [0040] In some embodiments, the color transformation matrix M.sub.AC can be efficiently computed from the identity matrix custom-character and M.sub.AB without solving least-squares, and similarly, the color transformation matrix M.sub.BC can be efficiently computed from the identity matrix custom-character and M.sub.BA without solving least-squares.


    M.sub.AC=(1-w)custom-character+wM.sub.AB;M.sub.BC=wcustom-character+(1-w)M.sub.BA.

    [0041] The above synthesis method could be limited by the performance of color checker detection. When the color checker detection is not successful, the full colors C.sub.A and C.sub.B could be reduced to the neutral ones LA and LB, meaning that the color transformation matrix M.sub.AB is reduced from a full matrix to a diagonal matrix. This is also equivalent to first perform WB on I.sub.A with L.sub.A, and subsequently perform an inverse WB with L.sub.B. Even when chromatic colors cannot be correctly mapped, contrastive learning for color constancy with simplified neutral color mapping could still obtain performance improvement over the baseline.

    [0042] Evaluation Following the evaluation protocol, angular errors of applying various methods on the two public benchmark datasets NUS-8 and Gehler are evaluated. The Gehler dataset has 568 linear raw-RGB images captured by two cameras and the NUS-8 dataset has 1736 linear raw-RGB images captured by eight cameras. The performance of the CLCC method is able to achieve state-of-the-art mean angular error on the NUS-8 dataset, 17.5% improvements compared to FC4 with similar model size. Other competitive methods, such as C4 and IGTN, use much more model parameters (3 times and more than 200 times) but give worse mean angular error. The CLCC method provides significant improvements over the baseline network SqueezeNet-FC4 across all scoring metrics and reach the best mean metric, as well as the best wrose-25% metric. This indicates that the embodiment of fully-supervised contrastive learning not only improves the overall performance when there is no massive training data, but also improves robustness via effective contrastive pair constructions. For the Gehler dataset, the CLCC method stays competitive with less than 0.1 performance gap behind the best performing approach C4, whose model size is three times larger. Methods achieving better scores than the CLCC method either require substantially more complexity or utilize supplemental data. The C4 method has three times more parameters which may facilitate remembering more sensor features than the CLCC method. The FFCC method needs meta-data from camera to reach the best median metric. If no auxiliary data is used, the CLCC method performs better than FFCC-4 channels on all matrices. The CLCC method can also provide improvements on robustness for worst-cases. The improvement over worse-case performance increases especially in the region that suffers from data sparsity. This supports the aim of the contrastive learning design which learn better illuminant-dependent features that are robust and invariant to scene contents.

    [0043] Representative Flowchart for an Embodiment of Present Invention FIG. 5 is a flowchart illustrating embodiments of a contrastive learning method for color constancy in an image or video processing system. The image or video processing system receives input data of a first training image captured in a first scene under a first illuminant and a second training image captured in a second scene under a second illuminant in step S502. A data augmentation is applied to the first and second training images to synthesize positive and negative augmented images in step S504. Each positive augmented image has the first illuminant and each negative augmented image has an illuminant different from the first illuminant. The image or video processing system constructs one or more positive contrastive pairs and one or more negative contrastive pairs in step S506. Each positive contrastive pair includes the first training image and a positive augmented image and each negative contrastive pair includes the first training image and a negative augmented image. A feature extraction function is used to extract representations of the images in the positive and negative contrastive pairs in step S508. The image or video processing system trains a color constancy model by contrastive learning in step S510. The color constancy model is trained to learn representations in the positive contrastive pair are closer than the representations in the negative contrastive pair.

    [0044] Embodiments of contrastive learning for color constancy may be implemented in a circuit integrated into a video compression chip or program code integrated into video compression software to perform the processing described above. For examples, synthesis of positive and negative contrastive pairs may be realized in program codes to be executed on a computer processor, a Digital Signal Processor (DSP), a microprocessor, or field programmable gate array (FPGA). These processors can be configured to perform particular tasks according to the invention, by executing machine-readable software codes or firmware codes that define the particular methods embodied by the invention.

    [0045] The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described examples are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.