I am taking "Convolutional Neural Networks" on Coursera and it is taught by Andrew Ng. I am in week 3 and confused about YOLO algorithm. I checked the course forums on coursera but I am still not clear and it seems that many people are confused about the same
- what is the training data? all the images and the location of objects in it? so for an image, if there is one car and one man in it then the input would be vector representation of the image, location of the centers (x,y), height and width of 2 objects and object code of what is contained in each box. So in total vector representation of the image + 4 coordinates of the 2 centers + 4 height and width of 2 boxes + 2 object codes - one for man and one for car. Right?
- what is the test data? vector representation of test images only? The course talks about splitting image into ss grid. I initially thought that a small square from ss grid is fed as an input. But i dont think that is the case. The entire image must be fed. is that correct?
- How a square from s*s grid is used?
- does size of training image and test image has to be the same? what they mean by "The car detection dataset has 720x1280 images, which we've pre-processed into 608x608 images." in week 3 assignment?
update 1------------------------------------------
- if a train image has 2 objects and maximum classes is 3 then then would input be vector representation of the image + input vector of 8*2?
- the length of the input vector would change based upon objects in an individual image?
- i am still not clear how ss grid is used. if there is a big car in the middle of the image and YOLO looks at a small square from ss grid from the center of the image then it is impossible to detect that there is a car. We have to provide bigger square from center to YOLO so that it understands there is a car. So what is the use of feeding a small square?
- How do we feed a bigger square?
- is the process of feeding square from s*s grid occurs after multiple convolution layers (so that basically a smaller square represent a much larger area)?
- do you have an example of an image and its shrink version? is their some data science around shrinking?
--------------------update 2
i read the answer and looked at videos again and still not clear
2.the length of the input vector would change based upon objects in an individual image? for example if image1 has 5 objects then length of the input is going to be much longer than an image which has only 1 object. How do we feed such kind of data where input is not of fix width. Do we find an image with with maximum objects and decided the length of the input and for rest of images we just pad 0s (to make input vector of the same length)?
- I am still confused about ss grid. I initially asked How a square from s*s grid is used?
after reading the below answer, my updated question is i am still not clear how ss grid is used. if there is a big car in the middle of the image and YOLO looks at a small square from ss grid from the center of the image then it is impossible to detect that there is a car. We have to provide bigger square from center to YOLO so that it understands there is a car. So what is the use of feeding a small square?
the answer (in the comment of the orignal answer) says that 3: A grid cell doesn't contain a whole bounding box, but only the mid point of a bounding box.
my confusion is : as per earlier discussion, we dont feed a grid cell individually. We feed the entire image once. so what is the point of creating the ss grid. If the image is looked only once then how the algorithm detects say two objects - one big car and one small car in a single go? We create the grid and the grid is used only to find midpoint of the object. But then the entire object is identified. I am still not clear with this part.
i feel that i am not the only one who is having hard time understanding YOLO. I saw multiple threads in the comment section of the course asking similar questions and I would appreciate patience and guidance