A Coder’s Guide to IoU, Non-Max suppression, and Mean Average Precision

6 min readNov 11, 2020

Implementations from scratch in Pytorch for Object Detection

This article is aimed at an audience who likes to look at the code behind the algorithms to understand how they work instead of reading the math behind them.

Intersection over Union (IoU)

Intersection over Union (IoU), also known as Jaccard’s index, is a mathematical expression that calculates the “Area of Intersection” of two boxes over the “Area of Union” of the same two boxes. In object detection, the IoU between the ground truth and predicted bounding boxes is calculated to estimate how close our predicted bounding box is to the ground truth.

An IoU of 1 indicates that your predicted bounding box perfectly matches the ground truth box.

An IoU of 0 indicates that no part of your predicted bounding box overlaps the ground truth box.

Non-max Suppression

Non-max Suppression is a method for removing duplicate detections when an object detection algorithm produces multiple bounding boxes for the same object in an image.

Every object detection algorithm generates object confidence scores, which indicate how certain the detector is about that specific bounding box prediction.

Non-max Suppression can be broken down into several steps:

Step 1: Remove all predicted boxes with confidence scores lower than a user-defined threshold.

Step 2: While any predicted bounding boxes remain in the list of predicted bounding boxes:

Choose the bounding box with the highest confidence score and use it to make a prediction.
Compare this bounding box’s IoU to every other predicted bounding box of the same class, and if the IoU exceeds the user-defined IoU threshold, discard it as a duplicate detection.
Remove the predicted bounding box from the list of bounding boxes.

Mean Average Precision (mAP)

Mean Average Precision is a metric used to measure the accuracy of the predictions, it is simply the average of “Average Precision” calculated across all classes.

Average Precision(AP) can be defined as the Area Under the Precision-Recall Curve. To plot the Precision-Recall curve we need to define what is True Positive, False Positive, True Positive, and True Negative in the sense of object detection.

True Positive: If the IoU of a predicted bounding box of class ‘c’ and the ground truth box of the same class ‘c’ is ≥ threshold.
False Positive: If the IoU of a predicted bounding box of class ‘c’ and the ground truth box of the same class ‘c’ is < threshold.
Even if the IoU> threshold, if any detection is already detected as a TP for the current ground truth bounding box then this should be labeled as “FP”(as applied by the PASCAL VOC 2012 challenge).
True Negative: This is not applicable to object detection as this would mean there is no ground truth box and the model didn’t predict it as a bounding box, there are infinite possible positions to satisfy this.
False Negative: Undetected ground truth bounding box.

Precision and recall can be calculated as,

In practice, True Positive (TP) and False Positive (FP) are sufficient to calculate Precision and Recall, as Recall can be calculated directly as TP/len(ground_truths).

To obtain a list of TPs and FPs for the Precision-Recall graph, the TP and FP are cumulatively summed across all detections, which are sorted by confidence scores.

After calculating the TP_cumsum and FP_cumsum, the Average Precision can be calculated in a variety of ways; it is simply the AUC of the Precision x Recall curve. However, if you plot the curve, you will notice a zigzag pattern.

This makes calculating AUC less accurate; there are two popular solutions to this. They are as follows:

11 point interpolation
All point interpolation

11 point interpolation

This is how the PASCAL VOC competition asses objection detection algorithms.

Simply explained:

“Average” of the maximum “precision” values at 11 equally distributed points between 0 and 1, where recall is ≥ to these 11 points.

This is so simple, that it is better explained in code itself.

Example:

Let's consider this data:

Plot:

Averaging all max precision(red dots) gives you the 11 points interpolated Average precision.

AP = (1+0.6666 + 0.4285 + 0.4285 + 0.4285+0+0+0+0+0+0)/11
AP = 0.2684 or 26.84%

All point interpolation

In all point interpolation, instead of considering max precision at 11 fixed points, all points where recall value changes are considered. For each pair of points in recall change, the distance is calculated. For example If recall changes from 0.1 to 0.25 to 0.33, the distances will be 0.15 and 0.08. These distances can be thought of width of a rectangle, where the height is the max precision at these respective recall positions. With height and width in hand, the area at this part of the precision-recall curve can be calculated. Repeating this for every recall change will cover the entire curve.

This is also better explained with code:

Example:

Plot:

In this plot, For each pair of green dots calculate the x-axis distance between them and multiply it with the max precision in the region to get the area in that region.

Sum all the areas you get the Area under the curve (Average Precision):

AP = 1 ∗ (0.0666–0) + 0.6666 ∗ (0.1333–0.0666) + 0.4285 ∗ (0.4–0.1333) + 0.3043 ∗ (0.4666–0.4)
AP = 0.2456 or 24.56%

Finally, complete code to calculate Mean Average Precision(mAP):

COCO competition used variations of mAP.

From “A Survey on Performance Metrics for Object-Detection Algorithms” paper by Rafael Padilla, Sergio L. Netto, Eduardo A. B. da Silva:

The COCO detection challenge (bounding box) is a competition that provides bounding-box coordinates of more than 200,000 images comprising 80 object categories. The submitted works are ranked according to metrics gathered into four main groups.

AP: The AP is evaluated with different IOUs. It can be calculated for 10 IOUs varying in a range of 50% to 95% with steps of 5%, usually reported as AP@50:5:95. It also can be evaluated with single values of IOU, where the most common values are 50% and 75%, reported as AP50 and AP75 respectively.
AP Across Scales: The AP is determined for objects in three different sizes: small (with area < 322 pixels), medium (with 322 < area < 962 pixels), and large (with area > 962 pixels).
Average Recall (AR): The AR is estimated by the maximum recall values given a fixed number of detections per image (1, 10 or 100) averaged over IOUs and classes.
AR Across Scales: The AR is determined for objects in the same three different sizes as in the AP Across Scales, usually reported as AR-S, AR-M, and AR-L, respectively.

References:

R. Padilla, S. L. Netto and E. A. B. da Silva, “A Survey on Performance Metrics for Object-Detection Algorithms,” 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 2020, pp. 237–242, doi: 10.1109/IWSSIP48289.2020.9145130.
PDF available in the next reference repo.
https://github.com/rafaelpadilla/Object-Detection-Metrics
https://www.youtube.com/watch?v=t-phGBfPEZ4&list=PLhhyoLH6Ijfw0TpCTVTNk42NN08H6UvNq
https://www.youtube.com/watch?v=VAo84c1hQX8