Decoding YOLOv3 output with Intel OpenVINO’s backend (Part 2)

7 min readJun 1, 2021

Let’s get straight to where we left Part 1. We were discussing the about the attributes of the `YoloParams` class.

This log is dumped by log_params in the class. Another important element in the class definition is self.isYoloV3 = 'mask' in param. This simply helps us to determine whether the model being used is v3 or not. Actually, the mask is exclusive to YOLOv3 and tiny version. Previous versions lack it.

After the output layer has been extracted, we have a 3D array filled with mysteriously packed data that is the treasure we seek. The method used to pack has been discussed in the theory part above. We write a parser function that parses/simplifies this and call it parse_yolo_region(). This function takes in the array full of raw values (let’s call it packed array) and gives out list of all detected objects. The function does the following. The two output blobs are (1,255,26,26) and (1,255,13,13). Let it be (1,255,side,side) for this blog (the side attribute is dedicated for this. Look up the definition of the YoloParams class). The side x side represents the grid and the 255 values are the array we showed earlier.

One method to decode this array for both the layers is:

for oth in range(0, blob.shape[1], 85):   # 255
        for row in range(blob.shape[2]):       # 13
            for col in range(blob.shape[3]):   # 13
                info_per_anchor = blob[0, oth:oth+85, row, col] #print("prob"+str(prob))
                x, y, width, height, prob = info_per_anchor[:5]

Next, we find if any of the anchor boxes found an object and if it did, what class was it. There were 80 classes and the one with the highest probability is the answer.

if(prob < threshold):
    continue# Now the remaining terms (l+5:l+85) are 80 Classes
class_id = np.argmax(info_per_anchor[5:])

At the threshold confidence of 0.1 or 10%, the classes detected in our test image of the cycle+man, are

person 	 prob:0.19843937456607819
person 	 prob:0.7788506746292114
bicycle  prob:0.8749380707740784
bicycle  prob:0.8752843737602234

The x and y coordinates obtained are relative to the cell. To get the coordinates with respect to the entire image, we add the grid index and finally normalize the result with the side parameter.

x = (col + x) / params.side
y = (row + y) / params.side

To relate with above explained example, the commands can be related with the following terms used in the original paper.

bx = (Cx + x) / params.side
by = (Cy + y) / params.side

The aspect ratio or width and height, can be a big number or even negative, so we use exponent to correct it.

try:
    width = exp(width)
    height = exp(height)
except OverflowError:
    continue

These values are already normalised. To get absolute values of width and height, we need to multiply them with their respective anchor width or height and again normalise by the image width or height respectively (fixed to 416x416 for v3 and v3-tiny). Why we do this, wait for it…

size_normalizer = (resized_image_w, resized_image_h) if params.isYoloV3 else (params.side, params.side)
n = int(oth/85)
width = width * params.anchors[2 * n] / size_normalizer[0]
height = height * params.anchors[2 * n + 1] / size_normalizer[1]

To similarly get absolute coordinates of top-left and bottom right point of the box, we use the xand y values we determined and use the normalised width and height to get the values. w/2 shifts the point from center of the cell to the left boundary and y/2 shifts it to the upper boundary. Together, they give the top-left corner of the box. To resize these bounding boxes to the original image, we scale it up using the dimensions of the image (w_scale=h_scale=416).

xmin = int((x - w / 2) * w_scale)
ymin = int((y - h / 2) * h_scale)
xmax = int(xmin + w * w_scale)
ymax = int(ymin + h * h_scale)

Now, we have the desired observations from the 2 detector layers and we enpack them into objects to get:

In Layer detector/yolo-v3-tiny/Conv_12/BiasAdd/YoloRegion
Detected Objects
{'xmin': 707, 'xmax': 721, 'ymin': 53, 'ymax': 68, 'class_id': 8, 'confidence': 0.0016403508}
In Layer detector/yolo-v3-tiny/Conv_9/BiasAdd/YoloRegion
Detected Objects
{'xmin': 707, 'xmax': 721, 'ymin': 53, 'ymax': 68, 'class_id': 8, 'confidence': 0.0016403508}
{'xmin': 257, 'xmax': 454, 'ymin': 32, 'ymax': 323, 'class_id': 0, 'confidence': 0.29021382}
{'xmin': 247, 'xmax': 470, 'ymin': 31, 'ymax': 373, 'class_id': 0, 'confidence': 0.34315744}
{'xmin': 231, 'xmax': 534, 'ymin': 165, 'ymax': 410, 'class_id': 1, 'confidence': 0.6760541}
{'xmin': 232, 'xmax': 540, 'ymin': 188, 'ymax': 428, 'class_id': 1, 'confidence': 0.23595412}

But there are too many detections for just a single bicycle and person; this is an inherent issue with YOLO which leads to duplicate predictions beacause it is very likely that two or more anchors of same or different cell detect a particular object with different or even same probablities. If we plot all these boxes on the image, we get

To remove these duplicate boxes, we employ Non-Maximal Suppression and Intersection over Union.

NON-MAXIMAL SUPPRESSION:

Let’s not be perplexed with the fancy term. It would have been just fine even if one didn’t know it; we are already familiar with it but not the name. It refers to filtering objects on the basis of confidence.

INTERSECTION OVER UNION (IOU):

If we have two bounding boxes, then, IoU is defined as

IoU = dividing the area of overlap between the bounding boxes by the area of union [source]

It is used for two purposes:

It helps us benchmark the accuracy of our model predictions. Using it, we can figure out how well does our predicted bounding box overlap with the ground truth bounding box. The higher the IoU, the better the performance. The results can be interpreted as

IoU for performance check

It helps us remove duplicate bounding boxes for the same object. Exactly the problem that we are facing with the cyclist test case. For, this, we sort all the predictions/objects in descending order of their confidence. If two bounding boxes are pointing to the same object, their IoU would definitely be very high. In this case, we choose the box with higher confidence (i.e., the first box) and reject the second one. If the IoU is very low, this would possibly mean that the two boxes point to different objects of the same class(like different dogs or different cats in the same picture). We use IoU solely for this purpose.

objects = sorted(objects, key=lambda obj : obj['confidence'], reverse=True)    for i in range(len(objects)):
        if objects[i]['confidence'] == 0:
            continue
        for j in range(i + 1, len(objects)):
            # We perform IOU on objects of same class only
            if(objects[i]['class_id'] != objects[j]['class_id']): continue            if intersection_over_union(objects[i], objects[j]) > args.iou_threshold:
                objects[j]['confidence'] = 0    # Drawing objects with respect to the --prob_threshold CLI parameter
    objects = [obj for obj in objects if obj['confidence'] >= args.prob_threshold]
    print(f"final objects:{objects}")

where intersection_over_union is defined as

def intersection_over_union(box_1, box_2):
    width_of_overlap_area = min(box_1['xmax'], box_2['xmax']) - max(box_1['xmin'], box_2['xmin'])
    height_of_overlap_area = min(box_1['ymax'], box_2['ymax']) - max(box_1['ymin'], box_2['ymin'])
    if width_of_overlap_area < 0 or height_of_overlap_area < 0:
        area_of_overlap = 0
    else:
        area_of_overlap = width_of_overlap_area * height_of_overlap_area
    box_1_area = (box_1['ymax'] - box_1['ymin']) * (box_1['xmax'] - box_1['xmin'])
    box_2_area = (box_2['ymax'] - box_2['ymin']) * (box_2['xmax'] - box_2['xmin'])
    area_of_union = box_1_area + box_2_area - area_of_overlap
    if area_of_union == 0:
        return 0
    return area_of_overlap / area_of_union

Post this, we get filtered objects as

final objects:[{'xmin': 231, 'xmax': 534, 'ymin': 165, 'ymax': 410, 'class_id': 1, 'confidence': 0.6760541}, {'xmin': 247, 'xmax': 470, 'ymin': 31, 'ymax': 373, 'class_id': 0, 'confidence': 0.34315744}]

Now, we have good detections; on drawing bounding boxes, we get the following results at the confidence threshold of 0.1 (10%) and IoU threshold of 0.4 (40%):