SYSTEM AND METHOD OF TRAINING VISION TRANSFORMER ON SMALL-SCALE DATASETS

Abstract

A deep learning training system and method, includes an imaging system for capturing medical images, a machine learning engine, and display. The machine learning engine selects a small-scale of images from a training dataset, generates global views by randomly selecting regions in one image, generates local views by randomly selecting regions covering less than a majority of the image, receives the generated global views as a first sequence of non-overlapping image patches, receives the generated global views and the generated local views as a second sequence of non-overlapping image patches, trains parameters in a student-teacher network to predict a class of objects by self-supervised view prediction using the first sequence and the second sequence. The teacher parameters are updated via exponential moving average of the student network parameters. The parameters in the teacher network are transferred to the vision transformer, and the vision transformer is trained by supervised learning.

Claims

1. A deep learning training system, comprising: an imaging system for capturing medical images; processing circuitry of a machine learning engine configured to select a subset of images from a training dataset of the captured medical images, generate global views by randomly selecting regions in one image of the subset of images covering a majority of the image, generate local views by randomly selecting regions covering less than a majority of the image of the one image, receive the generated global views as a first sequence of non-overlapping image patches, receive the generated global views and the generated local views as a second sequence of non-overlapping image patches, train parameters in a student-teacher network that includes a student network and a teacher network to predict a class of objects in the global views and the local views by self-supervised view prediction using the first sequence and the second sequence, wherein the processing circuitry updates the teacher parameters via exponential moving average of the student network parameters, initialize parameters in a vision transformer by transferring the trained parameters of the student-teacher network to the vision transformer, and perform supervised learning in the initialized vision transformer using the same subset of images; and an output device to output a class label for the one image.

2. The deep learning training system of claim 1, wherein the processing circuitry is configured to initialize the parameters in the vision transformer with trained parameters of the teacher network.

3. The deep learning training system of claim 1, wherein the processing circuitry is configured to train the parameters in the teacher network on the first sequence.

4. The deep learning training system of claim 1, wherein the processing circuitry is configured to train the parameters by self-supervised learning of view prediction, wherein the self-supervised learning is performed using unlabeled data.

5. The deep learning training system of claim 1, wherein the processing circuitry is further configured with a multi-layer projection (MLP) projection head that receives feature representations from each of the student network and the teacher network, and a classification MLP that receives feature representations from the vision transformer.

6. The deep learning training system of claim 1, wherein the student network and the teacher network have the same network structure.

7. The deep learning training system of claim 1, wherein the medical images are microscopic images of cells, and the system further comprises: a display device, wherein the processing circuitry is further configured to select a subset of images from a training dataset of the cell images, generate global views by randomly selecting regions in one image of the subset of images covering a majority of the image, and generate local views by randomly selecting regions covering less than a majority of the image of the one image; and wherein the display device displays an image of a cell showing a segmented region.

8. The deep learning training system of claim 7, wherein the processing circuitry is further configured with a segmentation MLP that receives feature representations from the vision transformer and outputs a segmented image of the cell.

9. The deep learning training system of claim 1, wherein each of the student network, teacher network and vision transformer are configured to be interchanged with a different vision transformer network.

10. The deep learning training system of claim 1, wherein each of the student network, teacher network and vision transformer include a distillation token with the non-overlapping image patches.

11. A non-transitory computer readable storage medium storing program instructions for a deep learning training framework, which when executed by processing circuitry of a machine learning engine, perform a method comprising: selecting a subset of images from a training dataset of captured medical images; generating global views by randomly selecting regions in one image of the subset of images covering a majority of the image; generating local views by randomly selecting regions covering less than a majority of the image of the one image; receiving the generated global views as a first sequence of non-overlapping image patches; receiving the generated global views and the generated local views as a second sequence of non-overlapping image patches; training parameters in a student-teacher network that includes a student network and a teacher network to predict a class of objects in the global views and the local views by self-supervised view prediction using the first sequence and the second sequence, wherein the teacher parameters are updated via exponential moving average of the student network parameters; initialize parameters in a vision transformer by transferring the trained parameters of the student-teacher network to the vision transformer; and perform supervised learning in the initialized vision transformer using the same subset of images; and an output device to output a class label for the one image.

12. The computer readable storage medium of claim 11, further comprising: initializing the parameters in the vision transformer with trained parameters of the teacher network.

13. The computer readable storage medium of claim 11, further comprising training the parameters in the teacher network on the first sequence.

14. The computer readable storage medium of claim 11, further comprising training parameters by self-supervised learning of view prediction, where self-supervised learning is performed using unlabeled data.

15. The computer readable storage medium of claim 11, further comprising: a multi-layer projection (MLP) projection head that receives feature representations from each of the student network and the teacher network; and a classification MLP that receives feature representations from the vision transformer.

16. The computer readable storage medium of claim 11, wherein the student network and the teacher network have the same network structure.

17. The computer readable storage medium of claim 11, wherein the medical images are microscopic images of cells, wherein the method further comprises: selecting a subset of images from a training dataset of the cell images, generating global views by randomly selecting regions in one image of the subset of images covering a majority of the image; generating local views by randomly selecting regions covering less than a majority of the image of the one image, and displaying an image of a cell showing a segmented region.

18. The computer readable storage medium of claim 17, further comprising: a segmentation MLP that receives feature representations from the vision transformer and outputs a segmented image of a cell.

19. The computer readable storage medium of claim 11, wherein each of the student network, teacher network and vision transformer are configured to be interchanged with a different vision transformer network.

20. The computer readable storage medium of claim 11, wherein each of the student network, teacher network and vision transformer include a distillation token with the non-overlapping image patches.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0017] A more complete appreciation of the invention and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:

[0018] FIG. 1 is a graph of trainable parameters and generalization for vision transformers:

[0019] FIG. 2 is a system for training and evaluating machine learning models:

[0020] FIGS. 3A, 3B, 3C, 3D are graphs of vision transformers trained with different weight initialization schemes:

[0021] FIG. 4 is a block diagram for a training framework, in accordance with an exemplary aspect of the disclosure:

[0022] FIG. 5 is a block diagram of a DeiT vision transformer architecture;

[0023] FIGS. 6A, 6B is a block diagram of a Swin vision transformer architecture;

[0024] FIG. 7 is a block diagram of a CaiT vision transformer architecture;

[0025] FIGS. 8A, 8B, 8C illustrate CLS tokens from heads of the last block of a vision transformer on low-resolution test samples from Tiny-ImageNet:

[0026] FIGS. 9A-9G illustrate the attention of the CLS token from the heads of the last block of ViT across different approaches.

[0027] FIGS. 10A-10D illustrate self attention for different vision transformers:

[0028] FIGS. 11A, 11B, 11C illustrate the effect of data size on self-supervised learning for weight initialization:

[0029] FIGS. 12A, 12B is a flow diagram of segmentation of microscopic cell images: and

[0030] FIG. 13 is a block diagram of a computer system:

DETAILED DESCRIPTION

[0031] In the drawings, like reference numerals designate identical or corresponding parts throughout the several views. Further, as used herein, the words a, an and the like generally carry a meaning of one or more, unless stated otherwise. The drawings are generally drawn to scale unless specified otherwise or illustrating schematic structures or flowcharts.

[0032] Furthermore, the terms approximately, approximate, about, and similar terms generally refer to ranges that include the identified value within a margin of 20%, 10%, or preferably 5%, and any values therebetween.

[0033] To alleviate problems associated with training vision transformers on small-scale datasets, an effective two-stage framework, embodied for example as a method and/or system, is provided to train vision transformers (also referred to as ViTs) on small-scale low-resolution datasets from scratch. The two-stage framework includes a low-resolution view prediction as a weight initialization scheme. The two-stage framework provides a solution to the problem of sensitivity of ViTs to weight initialization where ViTs converge to vastly different solutions depending on the network initialization. Conventional approaches perform pre-training (a type of initialization) with large-scale data to capture inductive biases from the data and follow up with successful transfer learning on small datasets. In the absence of huge datasets, however, the present approach has considered that it may be possible for ViTs to benefit from the inductive biases directly learned on the target small dataset such as CIFAR10 or CIFAR100. To this end, the present self-supervised weight learning scheme provides a solution that improves a feature prediction of low-resolution global and local views via self-distillation. See Mathilde Caron, Hugo Touvron, Ishan Misra, Herv? J?gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE CVF International Conference on Computer Vision, pages 9650-9660, 2021, incorporated herein by reference in its entirety. The present approach includes a self-supervised to supervised learning stage for small-scale datasets. In the second stage, the same ViT network is finetuned on the same target dataset using cross-entropy loss.

[0034] For purposes of this disclosure, small-scale datasets refer to a small percentage or low resolution. Small-scale can include a small percentage of labeled training images, for example on the order of one percent, but can achieve performance that is substantially equivalent to performance as if training is made with the entire training set. Small-scale can include low-resolution images that can achieve performance as if trained with high-resolution images.

[0035] Also, scale size can depend on the type of problem. A full rose flower training dataset may include 150 images of roses. In such case, approximately 15 images are a small-scale dataset. The full MovieLens dataset is 20,000,263 samples. In this case, approximately 200,000 samples are a small-scale dataset. ImageNet is a dataset of over 14 million images. The small-scale dataset know as Tiny ImageNet contains 100000 images of 200 classes (500 for each class) downsized to 64?64 colored image.

[0036] The present method solves an important problem in the realm of Vision Transformers (ViT's) which struggle when trained on fewer number of samples. ViT's are data hungry architectures which lack inductive biases and hence require a huge amount of data for successful training and achieving decent performance. In the case of the medical domain where the amount of training data is often small, it is a challenging task to efficiently train models on such limited data. However, the present method can effectively leverage the information from small number of training samples and hence provide a better generalization performance on the test samples. Therefore, the present method can provide an effective solution to the problem of data scarcity in medical domain.

[0037] Most of the existing approaches in Vision Transformers (ViTs) and in computer vision in general work on high-resolution inputs as much as 224?224 or 384?384 and in some cases up-to 512?512 pixels as well. However, when the data has low resolution, the input information in the form of features contained in each sample is not enough for the model to train effectively. Therefore, in such cases, the models struggle to train properly and hence result in reduced performance. The present method scales well on the low-resolution inputs and successfully trains ViT's on these low dimensional inputs while being computationally efficient. The ability to train well on low-resolution inputs can aid in the medical imaging domain where the input samples sometimes have low quality which makes such inputs difficult for the model to identify.

[0038] Further, the present training method helps the Vision Transformer (ViT) learn the shapes of the objects in the image. Such a property can aid in segmenting the class-specific objects from unseen test samples without any supervision. Such segmentation abilities on tiny images have a strong potential in the domain of medical image segmentation. For instance, in the case of single cell segmentation, the object of interest (cell) is tiny. Using conventional approaches, segmenting such a small object becomes extremely hard for the model. However, the present method, which effectively learns the semantic shapes of the objects in the small low-resolution inputs can be effectively applied in such cases where the object of interest has small size.

[0039] FIG. 1 is a graph of trainable parameters and generalization that illustrates the improvement provided by the present approach over conventional vision transformers. The present approach is simple in nature and yet outperforms by notable margins both in terms of trainable parameters and generalization (top-1 accuracy) on Tiny-ImageNet. See Lee et al.: Yahui Liu et al: Touvron et al. (International Conference on Machine Learning (2021)); and Le et al. This shows that inductive biases learned by the present self-supervised approach serves as an effective weight initialization to substantially improve ViT optimization during supervised training. The present approach has been demonstrated to be beneficial to different ViT designs over multiple small-scale datasets.

[0040] Subsequentially, the present approach is agnostic to ViT architectures, independent to changes in loss functions, and provides significant gains in comparison to different weight initialization schemes and existing works. See Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed-forward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249-256, Chia Laguna Resort, Sardinia, Italy, 13-15 May 2010. PMLR: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026-1034, 2015; Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019; Chen Zhu, Renkun Ni, Zheng Xu, Kezhi Kong, W. Ronny Huang, and Tom Goldstein. Gradinit: Learning to initialize neural networks for stable and efficient training.

[0041] The present training method has been demonstrated on five small datasets across different monolithic and non-monolith Vision Transformers. The present training method provides a self-supervised weight learning scheme from low-resolution views created on small datasets. This self-supervised weight learning scheme has been shown to be an effective weight initialization to successfully train ViTs from scratch, thus eliminating the need for large-scale pre-training. The present approach achieves the self-supervised inductive biases to improve the performance of ViTs on small datasets without modifying the network architecture or loss functions. The present training approach scales well with the input resolution. For instance, when trained on high-resolution samples, the present training method improves by 8% (CIFAR10) and 7% (CIFAR100) with respect to the state-of-the-art (SOTA) baseline for training ViTs on small datasets. Furthermore, the efficiency of the present approach is validated by observing its robustness against natural corruptions, and attention to salient regions in the input sample.

[0042] Different from conventional vision transformer approaches for small datasets, the present vision transformer architecture is provided without any modification to the internal layers or addition of new loss function. The present approach learns better generalizable features from the existing small target datasets.

[0043] Different from conventional vision transformers trained with self-supervised learning, the present vision transformer applies self-supervision for low-resolution small dataset to observe substantial improvements.

[0044] Different from vision transformer weight initialization approaches, the initial weights of the present vision transformer are learned using self-supervised learning directly from small datasets without any changes in the architecture or the optimizer.

[0045] FIG. 2 is a diagram of a machine learning system in accordance with an exemplary aspect of the disclosure. In an exemplary embodiment, a server 202 or artificial intelligence (AI) workstation may be configured for medical image segmentation for medical diagnosis. With such a configuration, one or more client computers 212 may be used to perform medical image segmentation for several medical images at a time. In the embodiment, the server 202 may be connected to a cloud service 210. The cloud service 210 may be accessible via the Internet. The cloud service 210 may provide a database system and may serve streaming video. Mobile devices 204, 206 may access medical images served by the cloud service 210. Viewers of the medical images served by the cloud service 210 may be provided with a medical diagnosis based on the segmented medical images.

[0046] An aspect is a medical diagnosis service having one or more servers 202 and one or more client computers 212. Medical images may be obtained from various imaging and/or scanning devices 230 and stored in a database 220. Among the various devices 230 can include CT scanning devices or MRI imaging devices, to name a few. The medical diagnosis service can make a medical diagnosis, so that viewers, users and/or physicians can make informed decisions of a likely medical diagnosis based on a medical image. In some embodiments, the medical images are cellular images taken with a microscope.

[0047] FIGS. 3A, 3B, 3C, 3D are graphs of vision transformers trained with different weight initialization schemes. As mentioned above, it has been determined that neural networks are sensitive to weight initialization schemes. See Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. Advances in Neural Information Processing Systems, 31, 2018, incorporated herein by reference in its entirety. To better understand the sensitivity to weight initialization, several vision transformers are trained with different weight initialization schemes including Uniform, Xavier, Truncated normal, and Gradinit. All models are trained for 100 epochs. The default training setting for small-scale datasets as proposed by Lee et al. are used in all experiments for consistent comparisons. The graphs illustrate that ViT training can be un-stable depending upon weight initializations e.g., CiaT performs poorly when initialized with Gradinit. See Touvron et al. (International Conference on Computer Vision (2021)). Similarly, the generalization of VIT and Swin varies a lot with different weight initialization methods.

[0048] FIG. 4 is a flow diagram and framework for a training method, in accordance with an exemplary aspect of the disclosure The present framework 400 provides a solution to the wide variation in results due to weight initialization problem The present framework 400 learns a weight initialization from the given data distribution (Q) in a manner that injects necessary inductive biases within the ViT architecture. The present training method consists of two stages including Self-supervised View Prediction 410 followed by Supervised Label Prediction 430 tasks. Both of these tasks are trained on the same data distribution (Q) with the same model backbone custom-character . An architectural difference between both learning tasks is the self-supervised multi-layer projection (MLP) 414 vs supervised MLP projection 452. In this manner, the present training method does not depend on large-scale pretraining. ViT encoder designs are described next.

[0049] In order to demonstrate the ability of the present training method to handle different ViTs, different monolithic and non-monolithic (Swin and CaiT) ViTs are trained (Table 1) as encoders. See Touvron et al (International Conference on Machine Learning): Ze Liu et al.: and Touvron et al. (Proceedings of the IEEE CVF International Conference on Computer Vision).

DeiT (Data-Efficient Image Transformer)

[0050] FIG. 5 is a block diagram of an architecture of the DeiT. The DeiT model includes a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. The DeiT model adds a new token, the distillation token 504, to the initial embeddings (patches 504 and class 502 token). The class token is a trainable vector, appended to the patch tokens before the first layer, that goes through the transformer layers, and is then projected with a linear layer to predict the class. The distillation token 506 is used similarly as the class token 502. It interacts with other embeddings through self-attention 508 and is output by the network after the last layer. To get a full transformer block a Feed-Forward Network (FFN) 510 is added on top of the self-attention layer 508. This FFN 510 is composed of two linear layers separated by a GeLu activation.

[0051] In an embodiment, the full transformer model is used in each of the student 412, teacher 422 and vision transformer 442.

[0052] In the original DeiT, the target objective is given by the distillation component of the loss. The target objective uses a hard-label distillation. Hard-label distillation is a variant of distillation where the hard decision of the teacher is taken as a true label. Let Zs be the logits of the student model. custom-character CE is the cross-entropy 516, y is the softmax function. Let y.sub.t=argmaxcZ.sub.t(c) 518 be the hard decision of the teacher. The objective associated with this hard-label distillation is:

[00001] $L_{global}^{hard Distill} = \frac{1}{2} L_{C E} (? (Z_{S}), y) + \frac{1}{2} L_{C E} (? (Z_{S}), y_{t})$

[0053] For a given image, the hard label associated with the teacher may change depending on the specific data augmentation. The teacher prediction y.sub.t plays the same role as the true label y.

[0054] The distillation embedding allows the model to learn from the output of the teacher 512, as in a regular distillation, while remaining complementary to the class embedding. Swin (Shifted window transformer)

[0055] FIG. 6A is a block diagram of the Swin Transformer architecture. FIG. 6A illustrates the tiny version (Swin-T). It first splits an input RGB image 602 into non-overlapping patches by a patch splitting module 604, like ViT. Each patch is treated as a token and its feature is set as a concatenation of the raw pixel RGB values. In implementation, a patch size of 4?4 is used and thus the feature dimension of each patch is 4?4?3=48. A linear embedding layer 616 is applied on this raw-valued feature to project it to an arbitrary dimension (denoted as C).

[0056] Several Transformer blocks with modified self-attention computation (Swin Transformer blocks 618) are applied on these patch tokens. The Transformer blocks maintain the number of tokens (H/4?H/4), and together with the linear embedding are referred to as Stage 1 (610).

[0057] To produce a hierarchical representation, the number of tokens is reduced by patch merging layers as the network gets deeper. The first patch merging layer concatenates the features of each group of 2?2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2?2=4 (2?downsampling of resolution), and the out-put dimension is set to 2C. Swin Transformer blocks 618 are applied afterwards for feature transformation, with the resolution kept at H/8?H/8. This first block of patch merging and feature transformation is denoted as Stage 2 (620). The procedure is repeated twice, as Stage 3 (630) and Stage 4 (640), with output resolutions of H/16?H/16 and H/32?H/32, respectively. These stages jointly produce a hierarchical representation, with the same feature map resolutions as those of typical convolutional networks, e.g., VGG and ResNet. As a result, the proposed architecture can conveniently replace the backbone networks in existing methods for various vision tasks.

[0058] Swin Transformer block Swin Transformer 618 is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on shifted windows, with other layers kept the same. As illustrated in FIG. 6B, a Swin Transformer block 618 includes a shifted window based MSA module 684, followed by a 2-layer MLP 692 with GELU non-linearity in between. A LayerNorm (LN) layer 682 is applied before each MSA module 684 and each MLP 692, and a residual connection 686, 694 is applied after each module.

[0059] W-MSA 664 and SW-MSA 684 denote window based multi-head self-attention using regular and shifted window partitioning configurations, respectively. The first module, W-MSA 664, uses a regular window partitioning strategy which starts from the top-left pixel, and the 8_8 feature map is evenly partitioned into 2?2 windows of size 4?4 (M=4). Then, the next module, SW-MSA 684, adopts a windowing configuration that is shifted from that of the preceding layer, by displacing the windows by [M/2], [M/2]) pixels from the regularly partitioned windows.

[0060] The shifted window partitioning approach introduces connections between neighboring non-overlapping windows in the previous layer and is found to be effective in image classification, object detection, and semantic segmentation.

CaiT (Class-Attention in Image Transformers)

[0061] FIG. 7 is a block diagram of the CaiT architecture. This architecture aims at circumventing one of the problems of the ViT architecture: the learned weights are asked to optimize two contradictory objectives: (1) guiding the self-attention between patches while (2) summarizing the information useful to the linear classifier. The CaiT architecture 700 explicitly separate the two stages. The two stages include a self-attention stage 710 and a class-attention stage 720. The self-attention stage 710 is identical to a conventional VIT transformer and receives patch embeddings 704. The class-attention stage 720 is a set of layers that compiles the set of patch embeddings for an input image 702 into a class embedding CLS 722 that is subsequently fed to a linear classifier 728.

[0062] This class-attention alternates in turn a layer that is referred to as a multi-head class-attention (CA 724), and a FFN 708 layer. In this stage, only the class embedding 722 is updated. Similar to the embedding fed in ViT and DeiT on input of the transformer, it is a learnable vector. The main difference is that, in the CeiT architecture 700, information is not copied from the class embedding to the patch embeddings during the forward pass. Only the class embedding 722 is updated by residual in the CA 724 and FFN 708 processing of the class-attention stage 720.

[0063] The role of the CA layer 724 is to extract the information from the set of processed patches. It is identical to a SA layer 706, except that it relies on the attention between (i) the class embedding xclass (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings xpatches.

[0064] These three ViTs are originally designed for higher resolution inputs (224 or 384) with patch sizes of 16 or 32. However, small-scale datasets have low resolution inputs e.g. 32 or 64 in the case of CIFAR and Tiny-ImageNet, respectively. Therefore, the patch size is reduced for such low resolution inputs. Specifically, a patch size of 8 and 4 is set for an input of size 64?64 and 32?32, respectively. Similarly, the original ViT designs are adopted for small datasets following. See Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34, 2021, incorporated herein by reference in its entirety. Table 1 presents the high level details of these network architectures. Further ablations with different ViT attributes (e.g. depth, and heads) are provided below.

TABLE-US-00001 TABLE 1 Details of ViTs encoders used in the present training method. Attributes Patch- Token MLP- Window- Depth size Dimension Heads ratio size ViT 9 [4, 8] 192 12 2 Swin [2, 4, 6] [2, 4] 96 [3, 6, 12] 2 4 CaiT 24 [4, 8] 192 4 2

Exploring ViT Attributes: Depth, Attention-Heads

[0065] The attributes of a ViT architecture are modified and the effect on model generalization are observed (top-1 accuracy %) across CIFAR100 and Tiny-Imagenet datasets (Table 2). Specifically, the depth and attention heads are varied to study the relation between ViT parameter complexity and its generalization. Depth is the number of transformer layers. The analysis highlights the following insights: First, for a given training method, the performance of model improves as the number of self-attention blocks is increased (e.g., six to nine), however, a decrease in generalization occurs by further increasing the self-attention blocks (e.g., at 12). This finding is consistent with Raghu et al. that shows the reduced locality (inductive bias) within ViTs with a higher number of self-attention layers which adds to further difficulty to ViT optimization. Second, increasing number of heads within self-attention bring more diversity during training and lead to better results. Third, the present approach outperforms the baseline methods in all the given settings, validating the necessity of self-supervised weight initialization during supervised learning.

TABLE-US-00002 TABLE 2 The effect of ViT architectural attributes on its generalization. ViT-Scratch ViT-Drloc SL-VIT ViT (Present) T- T- T- T- Depth Heads CIFAR100 Imagenet CIFAR100 Imagenet CIFAR100 Imagenet CIFAR100 Imagenet 6 12 69.75 53.44 51.97 36.69 71.93 54.94 75.14 59.15 6 6 68.76 52.25 53.12 39.12 71.01 53.68 72.92 55.83 9 12 73.81 57.07 58.29 42.33 76.92 61.00 79.15 63.36 9 6 69.88 53.56 55.50 45.93 72.01 54.63 73.59 58.18 12 3 68.09 51.26 57.18 43.50 72.14 52.98 68.88 52.89 12 12 71.23 54.55 56.50 45.71 74.04 56.13 77.22 56.41 12 6 70.57 53.42 58.58 46.78 73.23 55.38 73.93 58.46

Self-Supervised View Prediction as Weight Initialization Scheme

[0066] The weights are initialized for low resolution small-scale datasets via a self-supervised training stage 410. Among many self-supervised learning methods, a view prediction strategy is used. See Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597-1607. PMLR, 2020; Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640-9649, 2021; and Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Michael Ryoo. Self-supervised video transformer. In IEEE/CVF International Conference on Computer Vision and Pattern Recognition, June 2022, each incorporated herein by reference in their entirety. The self-supervised view prediction stage 410 does not require memory bank, large batch-size, or negative mining. The self-supervised weights are used for initialization during fine-tuning stage 430 directly from the low-resolution dataset. The view prediction pre-training 410 uses a student 412 ( custom-character .sub.s) and teacher 422 (.sub.t) framework to predict different views of the same input sample 402 from each other and thus follows the learning paradigm of knowledge self-distillation. See Jean-Bastien Grill, Florian Strub, Florent Altch?, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latenta new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271-21284, 2020, incorporated herein by reference in its entirety. Both student 412 and teacher 422 represent the same ViT network but process different views as explained next.

[0067] In the self-supervised view generation and prediction stage 410 a low resolution input 402 x sampled from a small data distribution custom-character . The height and width of the low-resolution input x is defined by h and w, respectively. During pre-training, the input is distorted and augmented to generate global 406 (x.sub.g) and local 404 (x.sub.l) views. Augmentations are used which preserve the semantic information of each selected view. See Caron et al. These augmentations include color jitter, gray scaling, solarization, random horizontal flip and gaussian blur. Global views 406 are generated by randomly selecting regions in the input image covering more than 50% of the input portion, while local views 404 are generated by randomly selecting regions covering around 20-50% portion of the input image 402. The global 406 and local 404 views are further resized such that the ratio of area of local to global view is 1:4. For example, the global view 406 generated for CIFAR sample is resized to a dimension of 32?32 and the local view 404 is resized to 16?16. Two global 406 and eight local 404 views are used to demonstrate the present method. The number of input tokens vary based on the view size, so the training method 400 uses Dynamic Position Embeddings 408 (DPE) which interpolates for the missing tokens of smaller views with height and width less than the original sample size h?w. Both student 412 and teacher 422 networks process these multi-sized views and output the corresponding feature representations. The features representation of each view is further processed by a 3-layer self-supervised MLP Projection (MLP 414) of the student 412 and teacher 422 networks. It has been determined that the multi-layer projection 414 performs better than a single layer MLP. Thus, each low-resolution view is converted into 1024 dimensional feature vector. Ablative analysis on the effect of the output size of self-supervised MLP projection head is provided below.

[0068] The teacher network 422 processes the global views 406 to generate target features (F.sub.g.sub.t) 432 while all the local 410 and global 406 views are forward-passed through the student network 412 to generate predicted features (F.sub.g.sub.s) 426 and (F.sub.t.sub.s) 428. These features are normalized to obtain {tilde over (F)}.sub.g.sub.t, {tilde over (F)}.sub.g.sub.s, and {tilde over (F)}.sub.t.sub.s. The student's parameters are updated by minimizing the following objective:

[00002] $\begin{matrix} ? = - {\overset{?}{F}}_{g_{t}} .Math. \log ({\overset{?}{F}}_{g_{s}}) + {.Math.}_{i = 1}^{n} - {\overset{?}{F}}_{g_{t}} .Math. \log ({\tilde{F}}_{l_{s}}^{(i)}), & (1) \end{matrix}$

where n represent number of local views 404 specifically set to 8. The teacher parameters are updated via exponential moving average (EMA) 416 of the student weights using: ?.sub.t???.sub.t+(1???.sub.s) where ?.sub.t and ?.sub.s denote the parameters of teacher 422 and student 412 network respectively and, 2 follows the cosine schedule from 0.996 to 1 during training. Further, centering and sharpening operations are applied to the teacher output. This way the present method avoids any mode collapse similar to BYOL and converge to a unique solution. See Jean-Bastien Grill, Florian Strub, Florent Altche, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latenta new approach to self-supervised learning. Advances in neural information processing systems, 33:21271-21284, 2020, each incorporated herein by reference in their entirety.

[0069] The self-supervised view prediction objective (Eq. 1) on low resolution inputs induces locality in the vision transformer and encourages better intermediate feature representations which further aids during the fine-tuning stage on the same dataset.

Self-Supervised to Supervised Label Prediction

[0070] The present two-stage framework 400 effectively trains vision transformers 442 on small-scale low resolution datasets from scratch. In Caron et al., a student-teacher framework is trained by self-supervised learning. Also, in Caron et al., the self-supervised learning begins by constructing different distorted views of an image with multi-crop strategy. For a given image, a set contains global views and local views of smaller resolution. All crops are passed through the student while only the global views are passed through the teacher, in order to encourage local-to-global correspondences. Both networks share the same architecture with different sets of parameters.

[0071] Unlike the self-supervised learning of Caron et al., the present self-supervised to supervised learning initializes a given model with weights learned via the self-supervised stage 410 on the target dataset and then fine-tunes the model in the supervised learning stage 430 on the same corresponding dataset. Conventional practices initialize the models with different initialization schemes or ImageNet pre-trained weights. After the initialization, the present self-supervised to supervised learning stage 430 transfers weights from the teacher network 422 to a vision transformer 442 and replaces the self-supervised MLP projection head 414 with a randomly initialized MLP classifier 452. The model is then trained via supervised objective as follows:

[00003] $\begin{matrix} ?_{C E} = - {.Math.}_{i = 1}^{k} y_{i} .Math. \log ({? (x)}_{i}), & (2) \end{matrix}$

[0072] where k is the output dimension of the final classifier and y represents the one-hot encoded ground-truth.

[0073] The classification MLP 452 provides a predicted classification or semantic segmentation labels 454. The teacher 422 provides high quality target features during pretraining and hence proves useful for the fine-tuning stage 430. The ablation on the effect self-supervised weights is provided below in Table 6.

Experimental Protocols

[0074] Experimental settings include dataset and training details, qualitative, and ablative analysis.

[0075] Datasets: The present approach is validated on five small-scale, low-resolution datasets including Tiny-Imagenet, CINIC10, CIFAR10, CIFAR100 and SVHN. See Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 23IN, 7(7):3, 2015; Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, and Amos J. Storkey. Cinic-10 is not imagenet or cifar-10, 2018. URL https_://arxiv.org/abs/1810.03505: Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Citeseer, 2009; and Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks, 2013. URL https_://arxiv.org/abs/1312.6082, each incorporated herein by reference in their entirety. Details about the dataset size, sample resolution and the number of classes are provided in Table 3. Self-supervised initialization is learned directly from small datasets. This allows to train ViTs on these datasets without any large-scale pre-training.

TABLE-US-00003 TABLE 3 Information for self-supervised. Dataset Train Size Test Size Dimensions # Classes Tiny-Imagenet 100,000 10,000 64 ? 64 200 CIFAR10 50,000 10,000 32 ? 32 10 CIFAR100 50,000 10,000 32 ? 32 100 CINIC10 90,000 90,000 32 ? 32 10 SVHN 73.257 26,032 32 ? 32 10

[0076] Self-supervised Training Setup: All models are trained with the Adam optimizer and a batch size of 256 via distributed learning over 4 Nvidia V100 32 GB GPUs. See Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980, 2014, incorporated herein by reference in its entirety. The learning rate is linearly ramped up during the first 10 epochs using:

[00004] $l r = 0.0 0 0 5 * \frac{Batch size}{2 5 6} .$

After first 10 epochs, learning rate follows the cosine schedule. See Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv: 1608.03983, 2016, incorporated herein by reference in its entirety. The student and teacher outputs are based on a temperature parameter which is set to 0.1 for the student network, while it follows a linear warm-up from 0.04 to 0.07 for the teacher network.

[0077] Supervised Training Setup: The training framework of Lee et al. is used for supervised learning and applies standard data augmentations for consistency. Specifically, cutmix, mixup, auto-augment, and repeated augment are used. See Yun et al.: Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv: 1805.09501, 2018; and Cubuk et al. (2020), each incorporated herein by reference in their entirety. Further, label smoothing, stochastic depth, and random erasing are used. See Szegedy et al (2016): Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646-661. Springer, 2016; and Zhong et al., each incorporated herein by reference in their entirety. All models are trained for 100 epochs with a batch size of 256 on a single Nvidia V100 32 GB GPU. Adam optimizer is used with a learning rate of 0.002 and learning decayed rate of 5e-2 with cosine scheduling.

Results

[0078] Generalization of different methods is provided with a comparative analysis presented in Table 4 across 3 different ViT architectures (Table 1). The present approach performs favorably well against different ViT baselines as well as CNNs without adding any additional parameters or requiring changes to architecture or loss functions. See Lee et al.: Yahui Liu et al. Note that all methods are trained on the original input resolution as provided in Table 1. A patch size of 8 is kept for Tiny-Imagenet to generate 16 number of input tokens for ViT and CaiT architectures. The patch size is reduced to 4 so that the resultant number of tokens become 64 for all other datasets such as CIFAR, SVHN, and CINIC10. Similarly, for Swin architecture, a patch size of 4 is used for Tiny-Imagenet to obtain 64 tokens, while for other datasets, a patch size of 2 is used which produces 256 number of input tokens. These architectural settings are consistently followed for all the baselines (Table 4). The present approach consistently performs better as compared to recent state-of-the-art methods (see Lee et al.: Yahui Liu et al.) for ViTs training on small-scale datasets (Table 4). Particularly, a significant gain is observed for the difficult cases where the ratio of number of classes vs. input samples is higher e.g. CIFAR100 and Tiny-ImageNet (Table 3). In this manner, the present approach paves the way to adopt ViTs to small datasets that also outperforms CNN based models. The effect of the present self-supervised weight initialization on convolutional networks is provided below.

TABLE-US-00004 TABLE 4 Generalization of different methods is provided with a comparative analysis. Model Params(M) Tiny-Imagenet CIFAR10 CIFAR100 CINIC10 SVHN ResNet156 0.9 56.51 94.65 74.44 85.34 97.61 ResNet110 1.7 59.77 95.27 76.18 86.81 97.82 EfficientNet B0 4.0 55.48 88.38 61.64 75.64 96.96 ResNet18 11.6 53.32 90.44 64.49 77.79 96.78 ViT (scratch) 2.8 57.07 93.58 73.81 83.73 97.82 SL-ViT.sub.(Arxiv21) 2.9 61.07 94.53 76.92 84.48 97.79 ViT-Drloc.sub.(NeurIPS21) 3.15 42.33 81.00 58.29 71.50 94.02 ViT (Present) 2.8 63.36 96.41 79.15 86.91 98.03 Swin (scratch) 7.1 60.05 93.97 77.32 83.75 97.83 SL-Swin.sub.(Arxiv21) 10.2 64.95 94.93 79.99 87.22 97.92 Swin-Drloc.sub.(NeurIPS21) 7.7 48.66 86.07 65.32 77.25 95.77 Swin (Present) 7.1 65.13 96.18 80.95 87.84 98.01 CaiT (scratch) 7.7 64.37 94.91 76.89 85.44 98.13 SL-CaiT.sub.(Arxiv21) 9.2 67.18 95.81 80.32 86.97 98.28 CaiT-DRLoc.sub.(NeurIPS21) 8.5 45.95 82.20 56.32 73.85 19.59 CaiT (Present) 7.7 67.46 96.42 80.79 88.27 98.18

Self-Supervised Weight Initialization for CNNs

[0079] The present self-supervised weight initialization strategy improves the performance of Vision Transformers. The effect on CNNs' performance is shown in Table 5. Specifically a ResNet-18 model is pre-trained on Tiny-Imagenet and CIFAR100 datasets with the present self-supervised view prediction objective and fine-tuned on the same datasets using supervised training framework. A slight improvement in model performance can be seen as shown in Table 5. This shows that the presence of inherent inductive biases ease the CNN optimization with non-learned weights initialization (such as Trunc Normal and Kaiming) in comparison to Vision Transformer.

TABLE-US-00005 TABLE 5 Effect of the self-supervised weight initialization scheme on CNNs. Model Initialization CIFAR100 Tiny-Imagenet ResNet-18 Trunc Normal 64.49 53.32 ResNet-18 Kaiming 64.08 52.19 ResNet-18 Self-supervised 65.00 53.48 (present)

[0080] Robustness to Input Resolution and Patch Sizes: A recent method projects the input samples to higher resolution to train Vision Transformer e.g., input resolution of 32?32 for CIFAR is re-scaled to 224?224 during training. This significantly increases the number of input tokens and hence the quadratic complexity within self-attention (Table. 6). In comparison to the performance of the present training method improves significantly on higher resolution. Thus the present training method proves effective on both low as well as high input resolutions. In comparison, the present approach successfully trains ViTs on low resolution inputs while being computationally efficient. The present training method scales well on high resolution inputs and outperforms by notable margins. (Table 6). See Yahui Liu et al.

TABLE-US-00006 TABLE 6 The present training method scales well. Input Patch- No. of Params Model Method Resolution size Tokens (M) CIFAR10 CIFAR100 SVHN Swin Drloc 224 ? 224 4 3136 28 83.89 66.23 94.23 Swin (Present) Drloc 32 ? 32 2 64 7.7 86.07 65.32 95.77 Swin (Present) Present 224 ? 224 4 3136 7.7 92.04 73.46 96.86

[0081] Robustness to Natural Corruptions: Analysis of the mean corruption error on CIFAR10 and CIFAR100 in Table 7. Mean corruption error (lower is better) is reported against 18 natural corruptions. The present training approach increases the model robustness against 18 natural corruptions such as fog, rain, noise, and blur, etc. See Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018, incorporated herein by reference in its entirety.

TABLE-US-00007 TABLE 7 Analysis of the mean corruption error on CIFAR10 and CIFAR100. ViT ViT Swin Swin Data (scratch) SL-ViT (Present) (scratch) SL-Swin (Present) CIFAR10 39.93 26.42 26.01 36.13 26.28 25.38 CIFAR100 65.04 48.56 48.10 53.83 47.29 45.10

Attention to Salient Regions

[0082] FIGS. 8A, 8B, 8C illustrate CLS tokens from heads of the last block of a vision transformer on low-resolution test samples from Tiny-ImageNet. These CLS tokens (attention maps) demonstrate the effectiveness of the present self-supervised attention stage. The attention maps show the attention of the CLS token from the heads of the last block of ViT for the low-resolution test samples from Tiny-ImageNet. The attention maps show that the present self-supervised pre-training learns to segment the class-specific features from unseen test samples without any supervision.

[0083] FIGS. 9A-9G illustrate the attention of the CLS token from the heads of the last block of ViT across different approaches. These CLS tokens demonstrate the effectiveness of the present supervised attention stage. All of the models are fine-tuned for 100 epochs on Tiny-Imagenet. The attention maps show that the present training method is able to capture the salient properties of the specific class in the input and has a sharp focused attention which is missing in conventional ViT baselines.

[0084] FIGS. 10A-10D further illustrate self attention for different vision transformers. The images show that the present training method can capture the salient objects in the image in comparison to conventional ViT baselines for which the attention is dispersed in the background.

[0085] The images in FIGS. 8A-8C, 9A-9C, 10A-10D represent the attention scores of the class token computed across the attention heads for the last ViT block as projected onto the unseen test samples of Tiny-ImageNet. Based on these results, the present training method is able to capture the shape of salient objects more efficiently with minimal or no attention to the background as compared to the conventional ViT baselines where the attention is more spread out in the background. Conventional ViT baselines completely fail to capture the shape of the salient object in the image.

Ablative Analysis

[0086] FIGS. 11A, 11B, 11C illustrate the effect of data size on self-supervised learning for weight initialization. The ViT models are trained on 25%, 50% and 75% of the total number of training samples across 3 datasets: CIFAR10, CIFAR100 and Tiny-Imagenet datasets. See Touvron et al. (International Conference on Machine Learning (2021)). In case of CIFAR10, the present training method achieves more than 90% top-1 accuracy with just 25% of data and outperforms other approaches with notable margins. A similar trend is observed with CIFAR100 and Tiny-Imagenet datasets.

[0087] Effect of Local-Global Crop Ratio: Local and global views are generated by randomly cropping certain regions from the original input image. The cropped area of each generated view is chosen from a specified range of values with respect to the original input size. The impact of the range of aspect ratios are analyzed for local and global views with respect to the original input size in Table 9. The original input size of Tiny-Imagenet is 2? times greater than the other datasets used in the experiments, therefore, a modified range of local-global aspect ratios are used as shown in Table 9 (right). That range of aspect ratios is between (0.2, 0.4) for local view and (0.5, 0.1) for global view works well for the Tiny-ImageNet. Similarly, for the other relatively lower-resolution datasets, the optimal aspect ratios are in the range of (0.2, 0.5) and (0.7, 1.0) for the local and global views, respectively (Table 9).

TABLE-US-00008 TABLE 8 Self-supervised teacher weights. Model Weights CIFAR100 Tiny-Imagenet ViT Student 77.27 61.02 ViT Teacher 79.15 63.36

TABLE-US-00009 TABLE 9 The impact of local-global crop ratios chosen during self-supervised training on the Top-1 train accuracy scores of CIFAR10 and CIFAR100 (left), and Tiny-Imagenet (right). Local Global CIFAR10 CIFAR100 Local Global Tiny-ImageNet View View ViT Swin ViT Swin View View ViT Swin (0.1, 0.4) (0.4, 1.0) 77.36 67.82 79.52 74.10 (0.1, 0.3) (0.3, 1.0) 64.37 74.33 (0.15, 0.4) (0.4, 1.0) 78.90 68.12 78.14 74.06 (0.15, 0.45) (0.45, 1.0) 62.02 74.09 (0.2, 0.5) (0.5, 1.0) 77.48 67.90 79.53 74.05 (0.2, 0.4) (0.5, 1.0) 64.82 75.13 (0.2, 0.5) (0.6, 1.0) 78.87 65.52 79.51 74.09 (0.2, 0.4) (0.6, 1.0) 62.03 57.33 (0.2, 0.5) (0.7, 1.0) 79.19 71.64 79.78 74.67 (0.3, 0.5) (0.5, 1.0) 61.69 73.92

[0088] Effect of Self-supervised MLP Dimensions: Table 10 shows the effect of the output head dimension of the present self-supervised projection MLP on the model generalization during supervised fine-tuning stage. The local-global aspect ratio is fixed to their optimal values and ablate over a range of MLP head dimensions. Based on the top-1 accuracy results on train set (Table 10), the dimension of size 1024 is chosen for all the experiments. A MLP head dimension of 1024 gives better overall results on train set across 3 datasets using ViT and Swin architectures.

TABLE-US-00010 TABLE 10 Effect of Self-supervised MLP Projection Head: SSL CIFAR10 CIFAR100 Tiny-Imagenet Head ViT Swin ViT Swin ViT Swin 512 78.77 79.46 71.63 74.04 66.84 74.10 1024 79.19 79.78 71.64 74.67 65.82 75.13 2048 78.83 79.48 71.15 73.87 65.05 73.95 4096 78.92 79.50 71.03 74.21 66.48 74.03

[0089] Effect of Teacher Vs. Student Weights Transfer: The performance of the VIT initialized is compared with the present self-supervised weights from student and teacher networks. In Table 8, it is observed that higher generalization (top-1 accuracy) with the teacher weights corroborates the present strategy of choosing teacher rather than student weights for the supervised training stage. Self-supervised teacher weights transfer well as compared to the student.

[0090] Performance comparison with self-supervised learning based CNN: A comparison is provided of ViT with ResNet18 (2.8 vs. 11.6 million parameters) in Table 11. ViT's performance improves significantly in comparison to self-supervised CNN. In addition, the present self-supervised approach is compared with different contrastive self-supervised methods that are mainly studied for CNNs (Table 12). The present method provides SOTA results for ViTs in comparison to self-supervised CNN.

[0091] Efficiency in terms of epochs: The present method is trained for 300 epochs (200 for self-supervised view matching and 100 for supervised label prediction) outperforms the current SOTA approach trained for even 600 epochs (Table 13). The present approach is efficient in terms of epochs used and outperforms the current approach in terms of Top-1 accuracy.

[0092] Self-supervised MLP layers: The present self-supervised projection MLP is modified which reduces complexity and increases generalization as shown in Table 14.

[0093] Analysis of MLP Head: The larger the size of MLP such as 65536, the lower is the performance on small-scale datasets (Table 10 and Table 15). This is because the large size of MLP head might result in overfitting the features of low resolution views. In Table 10, it is observed that MLP head dimension of 1024 gives better overall results on train set across 3 datasets using ViT and Swin architectures.

TABLE-US-00011 TABLE 11 Top-1 accuracy comparison of self-supervised (SS) trained CNN with Present approach. Model Initialization CIFAR100 Tiny-Imagenet ResNet-18 SS 65.00 53.48 ViT SS 79.15 63.36

TABLE-US-00012 TABLE 12 Comparison of other existing self-supervised learning techniques with present using basic ViT architecture. Method Tiny-Imagenet CIFAR10 CIFAR100 SimCLR 58.87 93.50 74.77 MOCO-V3 52.39 93.95 72.22 Present 63.36 96.41 79.15

TABLE-US-00013 TABLE 13 The present training method is efficient in terms of epochs used and outperforms the current approach in terms of Top-1 accuracy. Method Epochs CIFAR100 ViT-Drioc 600 (supervised) 68.29 ViT (Present) 200 (self-supervised) + 76.08 100 (supervised)

TABLE-US-00014 TABLE 14 Top-1 accuracy comparison of 3-Layer MLP with 1-Layer MLP training. Self-supervised MLP layers CIFAR100 Tiny-Imagenet 1-Layer 76.69 60.54 3-Layer (Present) 79.15 63.36

TABLE-US-00015 TABLE 15 Top-1 accuracy comparison of the size of projection head used during self-supervised training. Head size CIFAR100 Tiny-Imagenet 65536 77.42 60.77 1024 (Present) 79.15 63.36

[0094] FIG. 12A is a flow diagram of segmentation of microscopic cell images. Cell segmentation is usually the first step for downstream single-cell analysis in microscopy image-based biology and biomedical research. Deep learning has been widely used for image segmentation, but it is hard to collect a large number of labeled cell images to train models because manually annotating cells is extremely time-consuming and costly. Furthermore, datasets used are often limited to one modality and lacking in diversity, leading to poor generalization of trained models. The present self-supervised to supervised learning approach provides a solution that may be used for cell segmentation that can be applied to various microscopy images across multiple imaging platforms and tissue types.

[0095] The present training method is applied to cell image segmentation, for example as seen in FIG. 12B. A microscopic image may initially contain multiple cells 1252. An individual cell image 1254 may be extracted from the initial image. The present framework is used to produce a segmented image identifying the boundary of the cell object 1256. A sequence of time-lapse images can be segmented in order to track movement of the cell object.

[0096] An initial microscopic image can include a whole-slide image (?10,000?10,000). The present learning method can train on a dataset of highly varied images of cells, containing segmented objects. The microscopic images may include microbe species or individual human cells, as well as a colony of cells. Microbe, or microorganism, species include bacterium. Cell type images can include microscope modalities (e.g., confocal microscopy, stereo microscopy, time-lapse imaging, super resolution microscopy), time resolutions and magnifications.

[0097] The present learning method can be applied to develop deep learning models for single-cell analysis, including models for cell segmentation (whole-cell and nuclear) in 2D and 3D images as well as cell tracking in 2D time-lapse datasets. These deep learning models are applicable to multiplexed images of tissues to dynamic live-cell imaging movies.

[0098] FIG. 13 is a block diagram illustrating an example computer system for implementing the machine learning training and inference methods according to an exemplary aspect of the disclosure. The computer system may be an AI workstation configured with an operating system, such as Ubuntu Linux OS, Windows, a version of Unix OS, or Mac OS. The computer system 1300 may include one or more central processing units (CPU) 1350 having multiple cores. The computer system 1300 may include a graphics board 1312 having multiple GPUs, each GPU having GPU memory. The graphics board 1312 may perform many of the mathematical operations of the disclosed machine learning methods. The computer system 1300 includes main memory 1302, typically random access memory RAM, which contains the software being executed by the processing cores 1350 and GPUs 1312, as well as a non-volatile storage device 1304 for storing data and the software programs. Several interfaces for interacting with the computer system 1300 may be provided, including an I/O Bus Interface 1310, Input/Peripherals 1318 such as a keyboard, touch pad, mouse, Display Adapter 1316 and one or more Displays 1308, and a Network Controller 1306 to enable wired or wireless communication through a network 99. The interfaces, memory and processors may communicate over the system bus 1326. The computer system 1300 includes a power supply 1321, which may be a redundant power supply.

[0099] In some embodiments, the computer system 1300 may include a server CPU and a graphics card by NVIDIA, in which the GPUs have multiple CUDA cores. In some embodiments, the computer system 1300 may include a machine learning engine 1312.

[0100] In summary, an effective strategy is provided for training vision transformers on small-scale low-resolution datasets without large-scale pre-training. The present training method enables learning of self-supervised inductive biases directly from the small-scale datasets. The present network is initialized with the weights learned through self-supervision and then fine-tuned on the same dataset during the supervised training. Extensive experiments demonstrate that the present training method can serve as a better initialization scheme and hence train ViTs from scratch on small datasets while performing favorably well with respect to the conventional state-of-the-art methods. Further, the present training method can be used in a plug-and-play manner for different ViT designs and training frameworks without any modifications to the architectures or loss functions.

[0101] Numerous modifications and variations of the present invention are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.

SYSTEM AND METHOD OF TRAINING VISION TRANSFORMER ON SMALL-SCALE DATASETS

Assignee

Inventors

Cpc classification

Classification Explorer

G06V20/70

PHYSICS

Classification Explorer

G06V10/95

PHYSICS

Classification Explorer

G06V20/698

PHYSICS

Classification Explorer

G06V10/7753

PHYSICS

Classification Explorer

G06V20/695

PHYSICS

Classification Explorer

G06V2201/03

PHYSICS

International classification

Classification Explorer

G06V10/774

PHYSICS

Classification Explorer

G06V10/94

PHYSICS

Classification Explorer

G06V20/69

PHYSICS

Classification Explorer

G06V20/70

PHYSICS

Abstract

Claims

Description