Image Recognition
21/06/2021

AI in Retail: How Infilect Shortens Enterprise AI Setup with Self-Supervised Learning

With its suite Image Recognition (IR) and Visual Intelligence (VI) based products, Infilect is the leading Enterprise SaaS provider to the worldwide retail. With its focus on environmental sustainability, Infilect serves global retail companies around the world and creates meaningful impact to improve the efficiency of the worldwide retail supply-chain.

For instance, our Image Recognition AI helps retail manufacturers (think, consumer goods, packaged read-to-eat foods, beverages and more) to implement image recognition technology and automation inside third-party brick and mortar stores to gather critical shelf data. Gathering in-store data helps retail manufacturers understand how consumers see their products on shelf and what factors influence their purchasing decisions. This helps to optimise production, distribution, and placement of the products in order to reduce wastage and improve product sales.

Thus, these 'in-store insights' prove essential for retail decision makers as it can have direct impact on their retail sales performance. Every retail brand, i.e their sales and marketing teams, want to ensure that all the merchandising activities that are undertaken inside a store, from product placement to product stocking to promotional activities or in short "Retail Execution" is implemented appropriately with very low margin of errors.

This brings us to the topic of today's blog post: how do we train very high accuracy Image Recognition AI systems, that can learn from small amount of labeled data, so that there is very low error margin.

“If intelligence is a cake, the bulk of the cake is self-supervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL).” — Yann LeCun head of Facebook AI
Image Courtesy - Analytics Vidhya-Medium

With the rapid progress made by the worldwide research community in developing high-fidelity AI systems, Self-Supervised Learnings (SSL) has emerged as a very promising technique to answer precisely the same question we posed above: high-accuracy learning from small amount of data. Hence, to kick off this series, we will share some of our experiences in developing and utilising state-of-the-art SSL techniques.


To begin with, it is quite well-known by now that the large-scale labeled dataset is both a necessity and an impediment to training super-accurate AI models that work well in practice. A highly accurate, highly diverse, high amount of labeled data can make training high accuracy AI models a breeze. This is especially true for Image Recognition (i.e., Computer Vision) AI models that are trained using Deep Learning-based architectures. A carefully curated large-scale dataset not only results in high accuracy but also makes the AI models resilient to future (and hence unknown) inputs.


However, given our Enterprise SaaS setting to serve some of the Fortune 50 Consumer Packaged Goods (CPG) companies, it is imperative to develop and deploy AI models within just a couple of days; otherwise, it can result in product setup delays and that can result into retail supply-chain losses. Furthermore, as part of our product setups, we are often required to identify thousands of given patterns in a multitude of retail environments; e.g., identifying a Stock Keeping Unit (SKU) that could be placed in various retail-stores, spread across Modern Trade (MT) super-markets to General Trade (GT) Kirana-stores.

Think about your favourite cookie, e.g., "Britannia Good Day Cashew Nut 10 Biscuits Pack" that could be placed on retail shelves in a variety of positions, angles, and quantities. How do we identify every occurrence of such as SKU by utilising just one reference pattern; the stock photo of the SKU, as displayed typically on a white background? This is where SSL plays a significant role to train and initialise high fidelity Deep Learning models using large amount of unlabelled data that can be fine-tuned by just providing one reference pattern.

Image recognition of good day biscuits
Image Courtesy: Official Britannia Website

So how do we get large amount of unlabelled data?

Today, as part of InfiViz product, auditors and merchandisers of CPG companies capture a variety of retail-shelf photos from tens of thousands of stores across the globe. All of these photos are unlabelled when they arrive into our data lake. As the next step, through a carefully crafted AI workflow, we ingest hundreds of thousands of such photos every day in order to train our SSL modules. Every day, these modules fine tune the underlying Deep Learning architectures. Thus, the model weights of the SSL modules are carefully trained on a large volume of unlabelled data and the modules learn patterns of different SKUs in terms of different brands, their packaging patterns, their sizes, their shapes, etc.

But how exactly does an SSL module learn to automatically figure out different patterns? Imagine providing only half of the photo as input and asking the models to recreate the other half. That is exactly what these modules do. Given that we know the complete photo (and hence the ground truth), we can easily compute the difference between the half of the photo produced by our modules against the actual half. This provides our modules with signals to self-correct itself so that it starts learning about the common patterns of SKUs that are found on different retail shelves. It is precisely because of this self-correction process that these techniques are referred to as Self-Supervised Learning (SSL).


Let’s simplify the problem to a simple Cat vs Dog classifier. Imagine that you have tens of thousands of images in a folder that are of cats and dogs but you don’t know which photo is of a cat and which is of a dog.


Now, this is how SSL can help us to separate them neatly into two folders, one of cat and the other of dog.

  • Step 1: Imagine that we divide every image into 4 parts, rearrange them randomly, and always ask the SSL module to predict the original photo (with the 4 parts arranged in a way to form a meaningful cat image).
Object detection of a cat
Image Courtesy: Analytics Vidhya-Medium
  • Step 2: We now train a Deep Learning network with the above SSL technique so that the weights in the network learn to recognise various patterns such as sizes, shapes, colours of cats and dogs found in the wild. This is where the images in the entire folder, i.e., unlabelled data, is shown to the SSL modules.
Image classification
Image Courtesy: Analytics Vidhya-Medium
  • Step 3: We now fine-tune the above network in step 2 with a single-digit number of images of cats and dogs, to produce a binary classifier; that takes an image as input and produces 0 for a cat and 1 for a dog. This is where a small amount of data is used to fine-tune the Deep Neural Network that is trained in step 2.
Image recognition - Extraction layers
Image Courtesy: Analytics Vidhya-Medium

With just a few samples of cats and dogs, the network that is already trained on a huge number of photos and can quickly learn to identify cats and dogs. This is exactly what we do when we have to identify thousands of SKUs from millions of shelf images captured in the wild.

For more information, refer to this Self-Supervised Learning in Real-world Computer Vision Applications that can provide a crisp summary of the state-of-the-art of the SSL. We hope that you enjoyed this blog post. To know more, please get in touch with us.

We are rapidly expanding our AI engineering team towards building cutting-edge Computer Vision techniques to solve radically different challenges we face when applying AI at a global scale. Our AI today automatically indexes thousands of SKUs every day to positively impact the manufacturing, distribution, and placement of essential products.

If you wish to join us in this journey, please leave a comment in the below form or explore open opportunities on our Careers page.

Author details
Reach out to Vijay on LinkedIn