A detailed introduction to Two Stage Object Detectors

9 min readMay 21, 2023

In the age of YOLOv8, why do we need to understand the two-stage object detector? Is it necessary? or is it redundant that can easily be skipped?

Let us enjoy a photo before starting this lengthy blog. Photo Credit: Me..! :)

This blog is written for the students of machine learning (not a PhD student though!!) More likely for someone who just started out exploring the different techniques present in the field of object detection. This blog tries to explain the different architectures and give an explanation about why something is more popular today and why something is used mainly in research projects and not so in other case studies. Let’s start then!

The problem formulation is that in a single image, there can be multiple objects of the same or different types in multiple scales. So, we have 3 tasks here: 1) Detect the objects in the image 2) Draw a bounding box surrounding each object, and 3) Output the class labels of each bounding box along with the box coordinates. The bounding box coordinates require 4 values — the x coordinate of the center, the y coordinate of the center, height and width of the box.

The pipeline for object detection is mainly done through either of the two approaches: Single Stage Object Detector and Two Stage Object Detector.

Single Stage Object Detector

In a single-stage object detector, we go directly from the image to classification and bounding box coordinates. The images are fed into a feature extractor using a CNN and then the extracted features are directly used for classification and regression for bounding box coordinates. Single-stage object detectors are very very fast and can be used in real-time object detection but sometimes their performance is poorer than two-stage object detectors. Examples are the YOLO family, SSD, RetinaNet, etc

Two-Stage Object Detector

The two-stage object detector divides the whole process into 2 steps:

Step 1: It first extracts the features using a CNN

Step 2: It then extracts a series of regions of interest called object proposals and then the classification and localization happens only on the object proposals.

Two-stage object detectors are very powerful and extremely accurate having very high values of mAP. Hence, they are mostly used in the medical domain where classification accuracy is more important than speed. Examples of two-stage object detectors are the R-CNN family, SPP-Net, etc.

Below is the image depicting the two detector types:

Let’s discuss a little on the localization and classification task.

Localization and Classification

Let us have an image containing only 1 object and we want to find the exact bounding box that encloses the position of the object in the image. These coordinates have to be extracted through regression. So, we have to map the image to these coordinates and hence we use a CNN for this. We then train our network using the ground truth coordinate and L2 loss or in other words Squared Loss with L2 regularization. We have CNN as a feature extractor and then a series of fully connected layers to get the final regression output. We can also attach another set of fully connected layers that predicts the class scores for the object. Class scores are trained by softmax loss. Below is an image describing it:

All images were taken from lecture slides of the CV3DST course of TUM

R-CNN Family

One of the most prominent architectures in the two-stage object detector segment is the R-CNN family of detectors. The first of the series is called Region — CNN or R-CNN.

R-CNN

In the R-CNN model, we use an algorithm called Selective Search (the explanation of it is beyond the scope of this blog) to extract roughly 2000 region proposals. Now, we need to analyze these regions only for giving the final output. We warp the region proposals to a fixed size and pass them through CNN to classify the object in the region. We judge each of these region proposals to see if it contains objects like a person, cat, dog, etc.

We first extract the object proposals, then warp them to a fixed size of 227*227. Then we apply CNN on top of every object proposal. As a result of each CNN, we get the bounding box (bb) coordinates through regression and classification scores through SVM. If we have an overlapping bounding box, we need to do the convolutional computation for each of the pixels in the overlapped region two or three times depending on how many bounding boxes are overlapped in a region. So, it is not the fastest of the methods.

Also, the bounding box regression doesn't fully predict the box coordinates. It just refines the bounding box location of the object proposal. What happens when the object (region) proposals that we got from selective search do not match up to the actual objects? Since we are not training the selective search, we run the risk of missing out on objects. This is somewhat avoided by the bounding box regression that transforms the region proposal box to the correct bounding box.

All the convolutional neural networks (CNN) share weights. We use the exact same convnets with the exact same weights and we just apply it to each region proposal. If they didn't share weights, it won't really work out because we might have a different number of region proposals for each image or even if we have the same 2000 region proposals for each image it would be infeasible to train 2000 convnets separately.

Cons:

The biggest limitation is that it is very slow. We need to forward pass our object detector for each region proposal that is there. This is pretty resource intensive in real life. Also, the object proposal algorithm that is selective search is fixed. Feature extraction and SVM classifier and regression head trained separately. So, we are not exploiting the learning potential to its full as the object proposal algorithm can never be improved by training the convnets.

Fast R-CNN

Fast R-CNN is a significant improvement over the R-CNN model.

Fall 2019 — University of Michigan Object Detection Lecture Slides

In Fast R-CNN, we have only 1 forward pass through the whole image using one Convnet that is applied on the entire image (instead of having 2000 forward passes for every object proposal region). The final feature map of the Convnet is then used as a base for the object proposals to find the Region of Interest (ROI). We would extract features only at the regions we are interested in (ROI).

Before we feed these extracted features to the classifier and regressor, we have a series of fully connected layers (FC). These FC layers expect fixed-size input. That means, from all the object proposals having different sizes and different aspect ratios, we have to convert the feature representations into a fixed size. The ‘ROI Pooling’ layer takes all these differently sized feature maps and converts them to a fixed size so that the FC layers can work on them. Swapping the CNN and detection of object proposals (as seen between Fast R-CNN and R-CNN) basically helped to reshare a lot of computation across different image regions.

We process the whole image at high resolution through a single CNN. This CNN will not have any FC layers, it will only have convolutional layers. The output will be feature maps giving convolutional features of the entire high-resolution image. This Convnet is called ‘Backbone Network’. We use AlexNet, VGG, ResNet, etc as the backbone.

We then run the region proposal algorithm like Selective Search to get region proposals over the feature map. We project the region proposals on the feature map and then apply cropping on the feature map itself.

After this, we apply lightweight small Convnets called ‘Per Region Network’ that will output the classification scores and the bounding box regression transforms for each ‘detected’ region. The ‘Per Region Network’ computation is very fast because all the major computations are already done in the backbone network and only the last few layers of the backbone network will be run in the Per Region Network. We are saving computation by doing most of the computations of the Per Region Network on a sharing basis in the backbone network.

How to crop features:

In order to backpropagate, we need to backpropagate into the weights of the backbone network as well. So, we need to crop and resize these features in a way that is differentiable. To do this is a little tricky. Let me introduce the concept of ‘Region Of Interest Pooling’.

Region Of Interest Pooling:

The feature map, L * K * C, where L * K = input size of the image or in this case the feature map size. The FC layers are trained to expect vectors of size H * W * C. So, the feature map output dimension and FC layers’ input dimension are not the same. On top of the feature map, we apply the object proposals. That means what we want to feed into the FC layers is not the full feature map but it is this ‘small’ feature map that is detected by the object proposal region (marked by green in the image). Since the object proposals can have any size, the feature maps detected also can have any size. We have to transform this ‘detected’ feature map into a dimension expected by the FC layer i.e. H* W * C.

We have the output feature map of the backbone (yellow area), we also have the feature map detected by the object proposal (green area). If we zoom in on the detected feature map and put a grid of H * W (the dimension expected by FC) on top of it, we then create (through max-pooling) a feature map of size H * W * C. For each location of the H * W grid, we compute the maximum value and pass it along to fill 1 of the pixel location of the feature map of the H * W * C dimension. If we have a bigger object proposal, we will still be doing the same pooling but over a bigger number of pixels.

We can do backpropagation through this operation (ROI — Pooling) in the same way we backpropagate over a max-pool layer that is we pass the gradient to the location corresponding to the maximum value.

Because of the above architecture, we see a significant improvement in the speed.

The Fast R-CNN is thus 25 times faster than R-CNN. But this is still not the best model as we have an even ‘FASTER’ Faster R-CNN.

But that will be covered in another blog as this blog has become quite lengthy..! I will also answer the question I asked at the very beginning. Why do we need to understand these architectures and where are these used in industry today. This part will be covered in detail in the next blog.

That’s all for now. Thank you for reading to the end.

If you have any suggestions to improve it please leave them in the comments..!!

References

All the screenshots are taken from the lecture slides of the Technical University of Munich (TUM)— CV3DST course and from the Fall 2019 Lecture slides of the University of Michigan for Object Detection course.
R-CNN: Girshick, R. (2013, November 11). Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv.org. https://arxiv.org/abs/1311.2524
Fast R-CNN: Girshick, R. (2015, April 30). Fast R-CNN. arXiv.org. https://arxiv.org/abs/1504.08083

A detailed introduction to Two Stage Object Detectors

R-CNN Family

R-CNN

Fast R-CNN

References

Written by Namrata Thakur