Image Recognition

AI-in-Retail: Self-Supervised Learning in Computer Vision

Let us look at a recent advancement made in Computer Vision using Self Supervised Learning Emerging Properties in Self-Supervised Vision Transformers where the authors employ a variant of a previous SSL method and discover that Self-Supervised Vision Transformers naturally exhibit features that are useful for semantic segmentation and also find that they outperform most of the existing methods on Image Retrieval Tasks.

The approach “Self Distillation with No Labels” aka. DINO interprets the existing Bootstrap your own latent: A new approach to self-supervised Learning approach from a Knowledge Distillation perspective with some modifications on their own to produce SOTA Vision Transformers that naturally segment foreground from the background along with other useful features that help in retrieval problems.

Before looking into DINO in-depth, let us understand what Self-Supervised Learning is, how BYOL proposes an elegant way of extracting useful representations out of unlabelled data and how DINO improvises on top of BYOL


Till now, most of the advancements in Deep Learning focuses on Supervised Learning where the assumption is that loads of labeled data are available. While there is an active stream of research that pays attention to generating synthetic data, it is not always feasible to generate seamless real-world data.

Even if there are loads of data available, labeling them is expensive in terms of time and effort. Therefore there can always be scenarios where labeled data is limited. So how can we ensure that neural networks achieve higher accuracy under such conditions?

Q. What if there is a way to learn useful representations out of the abundant unlabelled data?

This paradigm of learning representations out of unlabelled data for improving downstream network’s performance (not always necessarily) is called Self-Supervised Learning. The different ways/approaches to learning representations are called pretext tasks.

Solution - Self-Supervised Learning(SSL)

Now that we know what Self-Supervised Learning is, let us look at the various pretext tasks available:

  1. Auto Encoders - Neural network learns to output the input image provided an input image. This way the representations learned by the encoder can be used for downstream tasks.
  2. Exemplar CNNs (Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks ) - Create a set of class examples for every unlabelled image (treating each as a class) by applying various augmentations and a classifier is trained on top of it.
  3. GANs Similar to Auto Encoders, image generation using GANs can also produce useful representations.
  4. Contrastive Learning - The most promising of the lot is this approach. There has been a multitude of SSL approaches using contrastive learning. To name a few:

Q. We now know that there are a variety of methods available to learn representations from unlabelled data. Great! But what next? How do we make use of these representations?

In general, a network trained using SSL methods is used as initialization for the training on limited labeled data.

 Note: We won’t be looking in detail into all of the pretext tasks. The scope of this article is only limited to Contrastive Learning

Contrastive Learning

Let us try to understand how neural networks learn representations using Contrastive Learning.

Contrastive learning is one of many paradigms that fall under Deep Distance Metric Learning where the objective is to learn a distance in a low dimensional space which is consistent with the notion of semantic similarity. In simple terms (considering image domain), it means to learn similarity among images where distance is less for similar images and more for dissimilar images. Siamese/Twin networks are used where a positive or a negative image pair is fed and similarity is learned using the contrastive loss.

Q. Okay, so we can train a contrastive model only if we know the positives and negatives for a given dataset. But the whole point of SSL is to train without labels right?

Yes, hence the below solution has been proposed.

  1. Given an unlabelled dataset, take any image x, apply augmentation to it and create another view xa.
  2. Train a contrastive loss model by treating x and xa as positive pairs and every other image other than & xa in the same batch as negatives.
  3. This approach surprisingly learns good representations that can be transferred to downstream tasks.

Q. The above approach is what forms the crux of SimCLR and SwAV methods. This is good to know, but why are we looking at SimCLR or SwAV when DINO is based on BYOL?

The important thing to note here is, when we treat every other sample in the batch as a negative(3), there are chances that a similar image can fall under negatives and this, in turn, produces not-so-ideal representations.

To get rid of negatives as well as to avoid the possibility of learning dissimilar representations for similar images, another contrastive learning-based SSL approach called BYOL was introduced where a neural network is trained with only positive pairs to learn very good representations.

Bootstrap Your Own Latent (BYOL)

BYOL is another Contrastive Learning-based SSL technique. It learns very good representations even though it uses no negative pairs during training.


  1. Take two networks with the same architecture but different parameters: a fixed ‘target’ network (which is randomly initialized) and a trainable ‘online’ network.
  2. Take an input image t and create an augmented view t1.
  3. Pass image t through an online network, image t1 through target network, and extract predicted and target embeddings respectively.
  4. Minimize the distance between both embeddings (MSE loss)
  5. Update the target network — which is the moving exponential average of the previous online networks.
  6. Repeat steps 2–5.
Boot strap your own latent

To solve the problem of collapse representations (or trivial solution), here in BYOL, a fixed randomly initialized network is considered as the target network and a trainable online network is used to learn representations (while the target network is nothing but the moving exponential average of the online network).


We now know what Self-Supervised Learning is, where Contrastive Learning fits into SSL and how BYOL learns rich representations from just positive pairs.

Let us now look at how DINO makes use of the approach used in BYOL.

Self Distillation with No Labels (DINO)

Self Distillation with No Labels (DINO) is a recent SSL technique that follows a very similar approach as BYOL, but has differences in the way the inputs are fed to the network and also makes use of Vision Transformers as the feature extractor.


  1. Take two networks with the same architecture but different parameters: a fixed ‘teacher’ network and a trainable ‘student’ network.
  2. Take image x, apply two different augmentations on x to produce two different views x1 and x2.
  3. Pass image x1 through student network, image x2 through teacher network, and extract predicted and target embeddings respectively.
  4. The output of the teacher network is centered with a mean computed over the batch
  5. Both the network outputs a K-dimensional feature embedding which is then normalized with softmax to give p1 and p2.
  6. Now to make sure both embeddings are similar, a cross-entropy loss is used.
  7. Update the teacher network — which is the moving exponential average of the previous online networks.

Although the method seems to be similar to BYOL, there is one key difference in the way in which the inputs are fed in DINO - multi-crop training

The input to the teacher network is the complete image while the student network takes in the complete image as well as random crops of the whole image. An example pair could be something like <a random crop of the whole image, the whole image>. The fact that the teacher is fed only the entire image and the student is fed both kinds forces the student to attend to foreground objects in the image.

Self distillation with no labels

Now that we understand how DINO works, let us see why there is this new notation of Student and Teacher networks here which was not the case with BYOL.

Knowledge Distillation Interpretation

Knowledge Distillation is a technique in which a smaller student network is trained to match the output of a relatively larger teacher network. This is also known as model compression where the objective is to train a smaller network (which is the student network) that matches the accuracy of a larger network (which is the teacher network).

When both the student and teacher are the same network, then it is called as Self-Distillation.

One can now draw comparisons between BYOL and Self-Distillation. We can see that the target network (in BYOL) can be seen as the teacher network here and the online network (in BYOL) can be seen as the student network.

And since this whole process is self-supervised i.e. no labels, the approach is called as Self-Distillation with No Labels aka DINO.

Comparison with other SSL approaches

Even though the approach followed in DINO seems similar to BYOL, it beats BYOL along with other SSL approaches such as SimCLR, SwAV, etc.

Also, DINO when combined with Vision Transformers (ViT) produces SOTA results on ImageNet - even better than a supervised baseline.

Linear and classification on image net

This clearly shows that multi-crop training plays a significant role in DINO’s performance.

Properties of SSL ViT

The authors have observed that SSL when combined with ViT specifically exhibits strong semantic segmentation properties as well useful features for image retrieval tasks, compared to supervised ViT or ConvNets.

Object Segmentation

Object segmnetation
Supervised AI in an image

Image Retrieval

image retreival

Real-world applications

DINO can be used for a variety of applications ranging from object segmentation, image classification tasks with limited labeled data, and image retrieval tasks.

Q. How can DINO be exploited for image classification tasks?

Given a set of unlabelled image data, we can train a network on that data using DINO method and then use its weights directly for finding k-nearest neighbors or use the weights as initialization for the downstream task (which is usually trained on a limited labeled set).

We hope you folks got the intuition behind Self-Supervised Learning, DINO, and the motivation behind using it

Will be back with more interesting papers in the future.


We are rapidly expanding our AI engineering team towards building cutting-edge Computer Vision techniques to solve radically different challenges we face when applying AI at a global scale.

Our AI today automatically indexes thousands of SKUs every day to positively impact the manufacturing, distribution, and placement of essential products. If you want to reach out to us please leave us a message on the form below.

Author details
If you have any questions, you can reach Raghul Asokan through Linkedin.