DL [Course 4/5] Convolutional Neural Networks [Week 3/4] Object detection

Key Concepts

Remember the vocabulary of object detection (landmark, anchor, bounding box, grid, …)
Understand the challenges of Object Localization, Object Detection and Landmark Finding
Understand and implement intersection over union
Understand and implement non-max suppression
Understand how we label a dataset for an object detection application

[mathjax]

Detection algorithms

Object Localization

What are localization and detection?

What are localization and detection?
- Image classification
  - “Car”
  - 1 object
- Classification with localization
  - “Car”
  - localization (bounding box)
  - 1 object
- Detection
  - multiple objects

Classification with localization

Classification with localization\[
\boxed{image}\rightarrow \boxed{CONVNET}\rightarrow \fc \rightarrow\ \text{softmax}(4)+\underbrace{b_x,b_y,b_h,b_w}_{\text{bounding box}}
\]

Defining the target label y

object
1. pedestrain
2. car
3. motorcycle
4. background
Need to output\( b_x,b_y,b_h,b_w \), class lavel(1-4)
\(P_c\) Provability object exist\[
y=\begin{bmatrix}
P_c\\b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3
\end{bmatrix}
\]
- class 1-3: \(P_c=1\)
- class 4: \(P_c=0\)
case x=Car: \[
y=\begin{bmatrix}
1\\b_x\\b_y\\b_h\\b_w\\0\\1\\0
\end{bmatrix}
\]
case x=background: \[
y=\begin{bmatrix}
0\\?\\?\\?\\?\\?\\?\\?
\end{bmatrix}
\]
- don’t care
Loss function\[
L(\hat y, y)=\begin{cases}
(\hat y_1 – y_2)^2 + (\hat y_2- y_2)^2 + \dots + (\hat y_8 – y_8)^2 & (y_1=1)\\
(\hat y_1 – y_2)^2 & (y_1=0)
\end{cases}
\]
error
- example
  - squared error just simplify description here
- c1,c2,c3
  - softmax
- bx,by,bh,bw
  - squared error
- Pc
  - logistic regression
  - probably squared error

Landmark Detection

In more general cases, NN just output X and Y coordinates of importatnt points and image are sometimes called landmarks.

face recognition
- landmark \(l_{1x},l_{1y}\dots l_{64x},l_{64y} \)
  - along the eye, mouse shape…
  - extract a few key points along the edges…
- image->convnet->y \[
  y=\begin{bmatrix}
  P_c\\
  l_{1x}\\l_{1y}\\\vdots \\ l_{64x}\\l_{64y}
  \end{bmatrix}
  \]
  - 128+1=129 units
- someone will have had to go through and laboriously annotate all of these landmarks.
people pose detection
- landmark
  - key positions like the midpoint of the chest, the left shoulder, elbow,wrist…
  - 32+1=32 units (if you use 32 coordinates to specify the pose of the person)
identity of landmark one must be consistent across different images.

Object Detection

Sliding windows detection

Car detection example
- training set
  - x: car or other
- ConvNet -> y
small rectangular region
- into ConvNet then prediction y{0,1}
- sliding through every region of this size
- repeat sliding window but use a larger window.
Sliding Windows Detection Algorithm
- square boxes,and slide them across the entire image
- classify every square region with some stride as containing a car or not.
huge disadvantages
- computational cost
- sliding windows convnet classification is more expensive.
- Solution: convolutional implementation of sliding windows
before the rise of NN
- people used to use much simpler classifiers like a simple linear classifier over hand engineer features in order to perform object detection.
- In that era because each classifier was relatively cheap to compute it wat just a linear function,Sliding Windows Detection ran okay.

Convolutional Implementation of Sliding Windows

Turning FC layer into convolutional layers

14,14,3
filter: 5,5
10,10,16
max_pool: 2,2
5,5,16
FC
400units
FC
400units
y: ~~softmax(4)~~ 4units

14,14,3
filter: 5,5
10,10,16
max_pool: 2,2
5,5,16
FC, filter: 5,5,400
1,1,400
FC, filter: 1,1,400
1,1,400
filter 1,1
1,1,4

Convolution implementation of sliding windows

14,14,3
filter: 5,5
10,10,16
max_pool: 2,2
5,5,16
FC, filter: 5,5,400
1,1,400
FC, filter: 1,1,400
1,1,400
FC: 1,1
1,1,4

16,16,3
filter: 5,5
12,12,16
max_pool: 2,2
6,6,16
FC: 5,5
2,2,400
FC: 1,1
2,2,400
FC: 1,1
2,2,4

Insted of run 4 subset propagation of the input image independently, it combines all four into one form of computation and share a lot of the computation in the regions of image that are common.

28,28,3
filter: 5,5
24,24,16
max_pool: 2,2
12,12,16
FC: 5,5
8,8,400
FC: 1,1
8,8,400
FC: 1,1
8,8,4

convolutionally make all the predictions at the same time by one forward pass through this big convnet.
But weakness which is the position of the bounding boxes is not going to be too accurate.

Bounding Box Predictions

Output accurate bounding boxes

YOLO algorithm

YOLO algorithm\[
\begin{array}{ccc}
\Box & \Box & \Box\\
\color{green}{[🚗]} & \Box & \color{yellow}{[🚗]}\\
\Box & \Box & \Box
\end{array}\\
y=\begin{bmatrix}
P_c\\b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3
\end{bmatrix}
\color{purple}{
\begin{bmatrix}
0\\?\\?\\?\\?\\?\\?\\?
\end{bmatrix} }
\color{green}{
\begin{bmatrix}
1\\b_x\\b_y\\b_h\\b_w\\0\\1\\0
\end{bmatrix}}
\color{yellow}{
\begin{bmatrix}
1\\b_x\\b_y\\b_h\\b_w\\0\\1\\0
\end{bmatrix} }
\]
- input image 100x100x3
- divided by 3×3
- Target output label: 3,3,8
advantage
- NN outputs precise bounding boxes
  - there is an object associated with that one of the 9 positions
  - and that is an object, what object it is.
  - and where is the bounding box for the object in that grid cell, so log as you don’t have more than one object in each grid cell
- object assigned only to one of the 9 grid cells.
- objects appearing in the same grid cell is just a bit smaller.
- This is a convolutional implementation, not 9 times on the 3×3 or 361 times on 19×19. Insted, this is one single convolutional implantation. pretty efficient algorithm.
YOLO paper is one of the harder papers to read.
- You Only Look Once: Unified, Real-Time Object Detection

Specify the bounding boxes

\(0 \leq b_x, b_y \lt 1\)
- top left of bounding box: 0,0
- bottom right of bounding box: 1,1
\(0 \leq b_w, b_h \) could be > 1
- can be larger than bounding box size.

Intersection Over Union

Evaluating object localization
- Intersection Over Union(IoU)\[
  IoU=\frac {\text{Size of intersection area}}{\text{Size of union (algorithm bounding box area)}}
  \]
  - same: \(IoU = 1\)
  - correct: \(IoU \geq 0.5\)
    - this is just a convention,no particularly deep theoretical reason for it.
    - sometimes see people use more stringent criteria like 0.6 or maybe 0.7

Non-max Suppression

problem
- multiple detections of each object
solution
- \(P_c\)
  - at first takes the largest one
  - remaining rectangles and all the ones with a high overlap, with a high IoU with this one that you’ve just output will get suppressed.
  - next go through the remaining rectangles.
- non-max means
  - you’re going to output your maximal probabilities classifications but suppress the close-by ones that are non-maximal.

Non-max suppression algorithm

Each output prediction is: \[
\begin{bmatrix}
p_c\\ b_x\\ b_y\\ b_h\\ b_w
\end{bmatrix}
\]
- simplify only doing car detection, let me get rid of the c1,c2,c3
- (see programming assignment)
Discard all boxes with \(p_c \leq 0.6 \)
While there are any remaining boxes:
- Pick the box with the largest \(p_c\) Output that as a prediction.
- Discard any remaining box with \(IoU \geq 0.5\) with the box output in the previous step
detect three objects say pedestrians,cars, and motorcycles
- output vector will have 3 components.
- independently carry out non-max suppression 3 times, each output classes.

Anchor Boxes

Overlapping objects
Anchor box1(like pedestrians by hand)\[
\color{green}{\boxed{\displaylines{\cdot \\ \bullet \\ \cdot }}}
\]
Anchor box2(like car by hand)\[
\color{yellow}{\boxed{\displaylines{\cdot \bullet \cdot }}}
\]
\[
y=\begin{bmatrix}
\color{green}{P_c\\b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3}\\
\color{yellow}{P_c\\b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3 }
\end{bmatrix}
,
\begin{bmatrix}
\color{green}{1\\b_x\\b_y\\b_h\\b_w\\1\\0\\0}\\
\color{yellow}{1\\b_x\\b_y\\b_h\\b_w\\0\\1\\0 }
\end{bmatrix}
,
\overbrace{\begin{bmatrix}
\color{green}{0\\?\\?\\?\\?\\?\\?\\?}\\
\color{yellow}{1\\b_x\\b_y\\b_h\\b_w\\0\\1\\0 }
\end{bmatrix}}^{\text{car only?}}
\]
Anchor box algorithm
- Previously
  - Each object in training image is assigned to grid cell that contains that object’s midpoint.
  - output y: 3,3,8
- With two anchor boxes
  - Each object in training image is assigned to grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.
  - output y: 3,3,16 or 3,3,2,8
    - 2 anchor box
    - 8 dimensions
can’t handle case: will just implement default tiebreaker. but that happens quite rarely especially if you use a 19×19 rather than a 3×3 grid.
- if you have 2 anchor boxes but three objects in the same grid cell
- if you have 2 objects associated with the same grid cell
Why anchor box?
- it allows your learning algorithm to specialize better.(fat object like car and tall skinny objects like pedestrians )
How do you choose anchor box?
- people used to just choose them by hand or choose maybe five or 10 anchor box.
- spans a variety of shapes that seems to cover the types of objects you seem to detect.
- Advanced: k-means algorithm method in YOLO paper. But by hand should work with these as well.

YOLO Algorithm

Training
- objects
  1. pedestrain
  2. car
  3. motorcycle
- y: 3,3,2,8\[
  y=\begin{bmatrix}
  P_c\\b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3\\
  P_c\\b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3
  \end{bmatrix}
  ,
  \begin{bmatrix}
  0\\?\\?\\?\\?\\?\\?\\?\\
  0\\?\\?\\?\\?\\?\\?\\?
  \end{bmatrix}
  ,
  \begin{bmatrix}
  0\\?\\?\\?\\?\\?\\?\\?\\
  1\\b_x\\b_y\\b_w\\b_h\\c_1\\c_2\\c_3
  \end{bmatrix}
  \]
Making predictions
- x:image
- y: 3,3,2,8\[
  y= \begin{bmatrix}
  P_c\\b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3\\
  P_c\\b_x\\b_y\\b_h\\b_w\\c_1\\c_2\\c_3
  \end{bmatrix}
  ,
  \begin{bmatrix}
  0\\?\\?\\?\\?\\?\\?\\?\\
  0\\?\\?\\?\\?\\?\\?\\?
  \end{bmatrix}
  ,
  \begin{bmatrix}
  0\\?\\?\\?\\?\\?\\?\\?\\
  1\\b_x\\b_y\\b_w\\b_h\\c_1\\c_2\\c_3
  \end{bmatrix}
  \]
Outputting the non-max supressed outputs
- For each grid cell, get 2 predicted bounding boxes.
- Get rid of low probability predictions.
- For each class (pedestrian, car, motorcycle) use non-max suppression to generate final predictions.

(Optional) Region Proposals

Region proposal: R-CNN
- Just a few regions that makes sense to run your continent crossfire.
- segmentation algorithm
- It is very slow but it has been an influential and might come across in your own work.
Faster algorithms
- R-CNN
  - Propose regions. Classify proposed regions one at a time
  - Output label + bounding box.
  - actually quite slow
- Fast R-CNN
  - Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
  - clustering step to propose the regions is still quite slow
- Faster R-CNN
  - Use convolutional network to propose regions.
  - faster than R-CNN but most implementations are usually still quite a bit slower than the YOLO algorithm.
YOLO seems to Andrew like a more promising direction for the long term.

Programming assignments

What you should remember:

YOLO is a state-of-the-art object detection model that is fast and accurate
It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
The encoding can be seen as a grid where each of the 19×19 cells contains information about 5 boxes.
You filter through all the boxes using non-max suppression. Specifically:
- Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes
- Intersection over Union (IoU) thresholding to eliminate overlapping boxes
Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise. If you wish, you can also try fine-tuning the YOLO model with your own dataset, though this would be a fairly non-trivial exercise.

References: The ideas presented in this notebook came primarily from the two YOLO papers. The implementation here also took significant inspiration and used many components from Allan Zelener’s GitHub repository. The pre-trained weights used in this exercise came from the official YOLO website.

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi – You Only Look Once: Unified, Real-Time Object Detection (2015)
Joseph Redmon, Ali Farhadi – YOLO9000: Better, Faster, Stronger (2016)
Allan Zelener – YAD2K: Yet Another Darknet 2 Keras
The official YOLO website (https://pjreddie.com/darknet/yolo/)