Facebook’s Product Recognition System Revolutionizes Online Shopping
A year ago, Facebook launched a unified computer vision model, GrokNet. Now, the social media giant is looking to use the model to power new apps on Facebook, such as product markup, product suggestions, visual search, and more.
Facebook AI is building the world’s largest social media platform. The model is currently online at Facebook market. Soon he plans to extend GrokNet to new applications on Facebook and Instagram.
How GrokNet works
GrokNet identifies which products are in an image and predicts their categories, according to the Facebook blog. Unlike previous models, Facebook’s product recommendation system is an all-in-one model that fits billions of photos across industries including fashion, automotive, and home decor.
For example, when a seller posts an image on their Facebook page, the AI helps identify unlabeled items and suggests tags based on their product catalog. So when a user views an untagged seller’s post, the system recommends similar products under the seller’s product catalog post.
The visual below illustrates this. “These are only visual demonstrations – the exact experiences of the models may vary,” wrote Tamara Berg, head of scientific research at Facebook AI.
How GrokNet works (Source: Facebook)
Evolution of GrokNet
GrokNet was born from an AI research project with initial applications intended for Facebook Market. The model analyzed search queries such as “mid-century modern sofa” and predicted matches with search indexes and produced the most relevant results.
“With billions of product images uploaded to stores on Instagram and Facebook by sellers, it’s hard to predict the right product at any given time,” Berg said. Facebook has now extended the app to other products. For example, Instagram users can now find similar dresses by tapping on a picture. “While it’s still early days, we believe it will improve mobile shopping by making even more Instagram images shoppable,” said Sean Bell, researcher at Facebook AI.
How different is GrokNet
However, the scalability of Facebook’s product recognition systems over supervised learning or manual labeling remains a challenge. Additionally, the complexity of recommendations increases over time as certain combinations may occur more frequently in the data.
“We built a new model that learns from certain attribute-object pairs and generalizes to invisible combinations. So if you train on blue cars, blue skirts and blue skies, you would still be able to recognize the blue pants even if your model has never seen them during training, ”Facebook said. New composition frame was trained on 78 million public Instagram images – built on top of his research which uses hashtags as weak oversight to achieve SOTA image recognition.
Architecture of the compositional framework (Source: Facebook)
Facebook has incorporated a new composition framework that takes the weights and attributes of object classifiers and learns how to compose them into “object-attribute classifiers”. “This allows for predicting combinations of attributes and objects not seen during training, and surpasses the standard approach of predicting individual attributes and objects,” Berg said. In other words, it can accommodate millions of images and hundreds of thousands of fine-grained class labels and quickly generate predictions for new verticals.
Facebook had sampled objects and attributes around the world while collecting the data to train these models. “Although the field of artificial intelligence is just starting to understand the challenges of equity in AI, we are continually working to understand and improve the way our products work for everyone,” said Bell.
Towards a multimodal model
In addition, to improve understanding of the content on its platform, Facebook takes advantage of SOTA’s multimodal advances in all formats (image, text, etc.). As a result, it has significantly improved the accuracy of product categorization.
Facebook combined the visual cues from the image and the associated text description to craft the model’s final prediction. “We have found a great formula for a multimodal model, which includes a multitude of artificial intelligence frameworks and tools,” Berg wrote. It understands Facebook’s AI Multimodal bitransformer, generalized as MMF Transformer in Facebook AI Multimodal framework, and the Transformer encoder, pre-trained on public Facebook posts. Early-fusion multimodal transformers have outperformed late-fusion architectures.
For images without textual details, Facebook added a modality quitting tip during training. It randomly removes text or image when both modalities are present to ensure robustness against missing details. Compared to vision-only models, advancement has brought them significant improvements in terms of accuracy. Soon, Facebook plans to extend these multimodal attributes to other verticals.
Join our Telegram group. Be part of an engaging online community. Join here.
Subscribe to our newsletter
Receive the latest updates and relevant offers by sharing your email.