YOLO algorithm - understanding training data

Question

I am taking "Convolutional Neural Networks" on Coursera and it is taught by Andrew Ng. I am in week 3 and confused about YOLO algorithm. I checked the course forums on coursera but I am still not clear and it seems that many people are confused about the same

what is the training data? all the images and the location of objects in it? so for an image, if there is one car and one man in it then the input would be vector representation of the image, location of the centers (x,y), height and width of 2 objects and object code of what is contained in each box. So in total vector representation of the image + 4 coordinates of the 2 centers + 4 height and width of 2 boxes + 2 object codes - one for man and one for car. Right?
what is the test data? vector representation of test images only? The course talks about splitting image into ss grid. I initially thought that a small square from ss grid is fed as an input. But i dont think that is the case. The entire image must be fed. is that correct?
How a square from s*s grid is used?
does size of training image and test image has to be the same? what they mean by "The car detection dataset has 720x1280 images, which we've pre-processed into 608x608 images." in week 3 assignment?

update 1------------------------------------------

if a train image has 2 objects and maximum classes is 3 then then would input be vector representation of the image + input vector of 8*2?
the length of the input vector would change based upon objects in an individual image?
i am still not clear how ss grid is used. if there is a big car in the middle of the image and YOLO looks at a small square from ss grid from the center of the image then it is impossible to detect that there is a car. We have to provide bigger square from center to YOLO so that it understands there is a car. So what is the use of feeding a small square?
How do we feed a bigger square?
is the process of feeding square from s*s grid occurs after multiple convolution layers (so that basically a smaller square represent a much larger area)?
do you have an example of an image and its shrink version? is their some data science around shrinking?

--------------------update 2

i read the answer and looked at videos again and still not clear

2.the length of the input vector would change based upon objects in an individual image? for example if image1 has 5 objects then length of the input is going to be much longer than an image which has only 1 object. How do we feed such kind of data where input is not of fix width. Do we find an image with with maximum objects and decided the length of the input and for rest of images we just pad 0s (to make input vector of the same length)?

I am still confused about ss grid. I initially asked How a square from s*s grid is used?

after reading the below answer, my updated question is i am still not clear how ss grid is used. if there is a big car in the middle of the image and YOLO looks at a small square from ss grid from the center of the image then it is impossible to detect that there is a car. We have to provide bigger square from center to YOLO so that it understands there is a car. So what is the use of feeding a small square?

the answer (in the comment of the orignal answer) says that 3: A grid cell doesn't contain a whole bounding box, but only the mid point of a bounding box.

my confusion is : as per earlier discussion, we dont feed a grid cell individually. We feed the entire image once. so what is the point of creating the ss grid. If the image is looked only once then how the algorithm detects say two objects - one big car and one small car in a single go? We create the grid and the grid is used only to find midpoint of the object. But then the entire object is identified. I am still not clear with this part.

i feel that i am not the only one who is having hard time understanding YOLO. I saw multiple threads in the comment section of the course asking similar questions and I would appreciate patience and guidance

oezguensi · Accepted Answer · 2018-12-21 02:42:32Z

For each bounding box you need
- p_c: any object / no object (background)
- b_x, b_y, b_w, b_h: x, y, width and height of the bounding box
- c_i: object i / no object i

For e.g. 2 bounding boxes and 3 classes (e.g. car, person, traffic light) your input vector would look as follows (the superscript in brackets denote the index of the bounding boxes)

\begin{bmatrix} p_{c}^{(1)}\\ b_{x}^{(1)}\\ b_{y}^{(1)}\\ b_{w}^{(1)}\\ b_{h}^{(1)}\\ c_{1}^{(1)}\\ c_{2}^{(1)}\\ c_{3}^{(1)}\\ p_{c}^{(2)}\\ b_{x}^{(2)}\\ b_{y}^{(2)}\\ b_{w}^{(2)}\\ b_{h}^{(2)}\\ c_{1}^{(2)}\\ c_{2}^{(2)}\\ c_{3}^{(2)}\\ \end{bmatrix}

The whole image is fed into the model. That is essentially why YOLO is so fast. It looks at the whole image only once.
This is done by the CNN. Basically each portion of a convolution corresponds to a grid cell. For example the upper right cell in an image would correspond to the upper right part of the filters in each layer. This is visualized at the left of this image:
Yes they have to have the same size. That's what most CNNs expect. All images in the training set must have equal sizes and so do the images of the test set. The image gets shrinked and deformed into a square of size 608x608

I would recommend you to rewatch the video, because the questions get answered in there. I also explained most of your updated questions already with the answer. Your updated questions 1. and 2. are explained in the answer 1. For question 3: A grid cell doesn't contain a whole bounding box, but only the mid point of a bounding box. Question 4 is hopefully clear after the last sentence. I couldn't quite understand 5. Question 6: It is just a stretched or shrinked image. No fancy algorithms involved. — oezguensi, CommentedDec 21, 2018 at 16:42

Stack Exchange Network

YOLO algorithm - understanding training data

1 Answer 1

Hot Network Questions

YOLO algorithm - understanding training data

1 Answer 1

Related

Hot Network Questions