Image Similarity Search System
Problem Statement
Designing an image search engine which takes input an image and can search and host similar images.
Data
The Caltech101 dataset contains images from 101 object categories (e.g., “helicopter”, “elephant” and “chair” etc.) and a background category that contains the images not from the 101 object categories. For each object category, there are about 40 to 800 images, while most classes have about 50 images. The resolution of the image is roughly about 300×200 pixels. Provisions have been made to collect new labels and data through APIs.
Architecture
Data Store Component
Creation of AWS S3 folder to store image data for setting up hosting S3 link.
Creation of Mongo DB Collection to store Meta Data and Labels Information.
Creation of API links for image data collectors annotators to upload images and labels.
Creation of Docker File for the Pipeline.
Deploying the pipeline using AWS and GitHub Actions.
Model Trainer Component
Ingestion of Data from S3.
Preprocessing the images for Model Training.
Employing Pre-trained Resnet34 architecture to generate embeddings.
Using Approximate Nearest Neighbors to build the embedding tree.
Upload Model and Artifacts to S3.
Creation of Docker File for the Pipeline.
Deploying the pipeline using AWS and GitHub Actions.
Model Prediction Component
Download Model and Artifacts.
Taking input image from user through API.
Preprocessing the input.
Generating Embedding for the input using Previously built Model Architecture.
Employing Approximate Nearest Neighbors for similarity searching.
Get the publicly hosted S3 links of similar images.
Provide the links to the users.
Creation of Docker File for the Pipeline.
Deploying the pipeline using AWS and GitHub Actions.