Understanding 'scale_boxes' in YOLO Algorithm of CNN

Question

I'm studying Andrew NG's Convolutional Neural Networks and am in Week 3 of the course which deals with object detection using YOLO algorithm . I don't understand one section in the programming assignment that uses a function called 'scale_boxes' . This is what is described about the function in the course materials.

"*There're a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which we have provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh) which converts the yolo box coordinates (x,y,w,h) to box corners' coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape) YOLO's network was trained to run on 608x608 images. If you are testing this data on a different size image--for example, the car detection dataset had 720x1280 images--this step rescales the boxes so that they can be plotted on top of the original 720x1280 image.*"

And the function scale_boxes itself is defined as :

def scale_boxes(boxes, image_shape): """ Scales the predicted boxes in order to be drawable on the image""" height = image_shape[0] width = image_shape[1] image_dims = K.stack([height, width, height, width]) image_dims = K.reshape(image_dims, [1, 4]) boxes = boxes * image_dims return boxes

It is used in the following function 'yolo_eval' :

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5): """ Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes. Arguments: yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors: box_confidence: tensor of shape (None, 19, 19, 5, 1) box_xy: tensor of shape (None, 19, 19, 5, 2) box_wh: tensor of shape (None, 19, 19, 5, 2) box_class_probs: tensor of shape (None, 19, 19, 5, 80) image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype) max_boxes -- integer, maximum number of predicted boxes you'd like score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box iou_threshold -- real value, "intersection over union" threshold used for NMS filtering Returns: scores -- tensor of shape (None, ), predicted score for each box boxes -- tensor of shape (None, 4), predicted box coordinates classes -- tensor of shape (None,), predicted class for each box """ ### START CODE HERE ### # Retrieve outputs of the YOLO model (≈1 line) box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs # Convert boxes to be ready for filtering functions (convert boxes box_xy and box_wh to corner coordinates) boxes = yolo_boxes_to_corners(box_xy, box_wh) # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold (≈1 line) scores, boxes, classes = yolo_filter_boxes(box_confidence,boxes,box_class_probs,score_threshold) # Scale boxes back to original image shape. boxes = scale_boxes(boxes, image_shape) # Use one of the functions you've implemented to perform Non-max suppression with # maximum number of boxes set to max_boxes and a threshold of iou_threshold (≈1 line) scores, boxes, classes = yolo_non_max_suppression(scores,boxes,classes,max_boxes,iou_threshold) ### END CODE HERE ### return scores, boxes, classes

I don't understand the need for the function 'scale_boxes' . There doesn't seem to be any answers/attention to this in the discussion forums as well , which is why I'm posting this question here .

Can someone please explain in detail what this function does exactly and why it is required ?

$\begingroup$could you please provide your feedback on the answer$\endgroup$
– 10xAI
CommentedJul 18, 2020 at 14:15 — 10xAI, CommentedJul 18, 2020 at 14:15

10xAI · Accepted Answer · 2020-07-16 11:19:39Z

YOLO's network was trained to run on 608x608 images. If you are testing this data on a different size image--for example, the car detection dataset had 720x1280 images--this step rescales the boxes so that they can be plotted on top of the original 720x1280 image.

Since you are using a pre-trained model. It will resize your image to the size it was trained on. Whether you do it or the Model does this in the background.

Bounding Box values are simple coordinates on the image. It will change with the change in the size of the image. Imagine a face on a big image and on a tiny image.

$\hspace{5cm}$

So, YOLO will return you the coordinates for a smaller image and if you draw it over your original image, it will not cover the full object. So you rescale it in the ratio of two image sizes.

You can achieve the same by resizing your original image to the size of YOLO trained image and then you need not scale your bounding Box. You can simply draw the same box on this resized image.

Thanks for the answer , but how does multiplying the box dimensions by 720x1280 solve the problem ? ( This is what is done in the scale_boxes function ) . On observing the results after scaling , I found the boxes's length/width to be greater than 2000 . Shouldn't they be confined to the range ( 720 , 1280 ) ? — Bharathi, CommentedJul 20, 2020 at 7:13
If it is multiplied by (720,1280), it means it was normalized using the input_dim i.e. by (608, 608) during training. So, it's the same thing. If the prediction is not good, it can predict bigger x,y and the Box can go out of Range since it is a Regression problem. Try drawing it on 608x608 size and see if it is good there — 10xAI, CommentedJul 20, 2020 at 8:02

Stack Exchange Network

Understanding 'scale_boxes' in YOLO Algorithm of CNN

1 Answer 1

Hot Network Questions

Understanding 'scale_boxes' in YOLO Algorithm of CNN

1 Answer 1

Related

Hot Network Questions