Real Time Object Detection with YOLO

The phrase “Object detection” has become a buzzword in the recent years and the entire field has advanced rapidly. The diversity of applications of object detection is astounding, from things like tracking objects and counting people, to automating CCTV surveillance systems and image and video annotation.

There is an equal diversity in the various approaches used to tackle the problem of object detection, with two major categories. These are : neural network approaches and non-neural approaches.

Non neural approaches rely on using various techniques for feature identification , such as Scale Invariant Feature Transform(SIFT) and Histogram of Oriented Gradient (HOG) for feature recognition, and then using techniques such as support vector machines(SVMs) for classification.

On the other end of the spectrum, neural networks are able to perform end to end object detection without specifically defining any features. These generally rely on Convolutional Neural Networks. Some examples are R-CNNs, Single Shot Multi Box Detector (SSD), and You Only Look Once (YOLO).

Real time object detection and its advantages

Consider an object detection model that takes a few seconds per image to detect objects. If we deploy this model in a situation where low latency is crucial such as a self driving car, this latency is far too high to be of practical use. A delay of a few microseconds could mean the difference between a fatal accident and a safe journey. Hence for such scenarios we need a model that will give us near real time results. It should be able to detect objects and perform inferences in a matter of milliseconds, if not microseconds.

Slower models such as R-CNN, faster R-CNN etc work really well when there is no need of real time detection. However, they can be wildly unreliable when low latency is key.

Real time models need to sense the environment, understand what is happening in a scene and react accordingly. The model should be able to both identify and locate presence of the objects by defining a bounding box around each object. In essence, we are performing two separate tasks here: identifying the objects in an image (object detection) and locating the objects with a bounding box (object localisation).

An example of real time object detection


YOLO was introduced in 2015 by Raymond et al., in their paper You Only Look Once: Unified Real-Time Object Detection. In contrast to most deep-learning based object detectors at the time, YOLO used a one-stage detector strategy. Just like the human eyes see everything, store it, which is later decoded by the brain and we infer information from it. All of this is done with just one single sight which is exactly how YOLO works. YOLO overcomes the problem of speed which other algorithms like R-CNN failed.

The algorithm frames object detection as a regression problem. It does this by taking a given image and simultaneously predicting bounding boxes and class probabilities in a single evaluation. YOLO uses features from the entire image to predict each bounding box.This approach made it prone to more localisation errors but it was far less likely to cause detect false positives.

Working of YOLO algorithm

The main aim of YOLO is predict the class of an object, and the bounding box that specifies the location of the object.
There are 4 main attributes that can be used to describe a bounding box:
*Center of the box (with the coordinates bx and by)
* Width of the box (bw)
* Height of the box (bh)
* The class of the object identified

Along with these parameters we also predict the probability that there is an object in the bounding box.

In contrast to many other algorithms, YOLO doesn’t search for regions of interest in the input image that could contain an object. It instead splits the image into an S x S grid. Each cell is then responsible for predicting K bounding boxes.

We consider an object to lie in a particular cell only if the centre coordinates of the anchor box lie in that cell. Due to this the co-ordinates of the centre are always calculated relative to the cell, whereas the height and width are calculated relative the the size of the whole image.

On one forward pass, YOLO will determine the probability that a cell contains a certain class. The equation for this is :

Equation for probability that a cell contains a certain class

YOLO predicts multiple bounding boxes per grid cell. However, at training time we want only one bounding box predictor to be responsible for each object. Therefore the predictor with the IoU (intersection over union) is chosen. This is done with the help of non-max suppression, which lets us eliminate bounding boxes that are very close.

The value of IoU of all bounding boxes is calculated respective to the one having the highest class probability. Then, the bounding boxes with an IoU over a certain threshold are removed. This signifies that the two bounding boxes are covering the same object but one has a lower probability for the same; hence it is eliminated.

This process then repeats for the bounding box with the next highest class probabilities, and this is repeated until we obtain all the different bounding boxes.

YOLO conceptual design (Source: You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

At this stage, almost all our work is done. The algorithm outputs the required vector showing the details of the bounding box of the respective classes. The overall architecture of the algorithm is:

Architecture of YOLO (Source: You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.

The loss function is one of the most important parameters of any algorithm. The loss function used by YOLO learns about all the four parameters it predicts simultaneously.

Loss function used by YOLO (Source: You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)
Variables used in the loss function by YOLO

This was a brief explanation of the vast and interesting algorithm that is YOLO. We looked over many aspects, including but not limited to what object detection actually is, the problems associated with real time object detection, as well as the YOLO algorithm and its working. We saw how earlier models failed to provide adequate real time detection, and then saw how YOLO was able to outperform all other models in the challenges faced.

Moreover, YOLO is constantly evolving. There are multiple versions of it available, ranging from YOLO, YOLO9000, YOLOv3, YOLOv4, YOLOv5 and many scaled versions. The model has gotten more nimble and accurate with each release, and it hopefully advances even further in the future.

I hope we were able to give you a through grounding in the base workings of the YOLO algorithm and the various concepts related to object detection.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store