DL [Course 3/5] Structuring Machine Learning Projects [Week 2/2]

Key Concepts

Understand what multi-task learning and transfer learning are
Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets

[mathjax]

Structuring Machine Learning Projects (deeplearning.ai)の受講メモ

Error Analysis

Carrying out error analysis

if your algorithm is not yet at the performance of a human

Look at dev examples to evaluate ideas

90%accuracy,10%error much worse
- Should you try to make your cat classifier do better on dogs?
Error analysis:
- Get~100 mislabeled dev set examples.(5-10min)
- Count up how many are dogs.
- effort “ceiling“
  - 5/100 dogs in mislabeled (5%) then potentially from 10% to 9.5% error.
  - 50/100 dogs in mislabeled (50%) then potentially from 10% to 5% error.
simple counting procedure can effort should try to focus on reducing this mislabeled problem.
error analysis can save you a lot of time in terms of deciding what’s the most important or what’s the most promising direction to focus on.

Evaluate multiple ideas in parallel

Ideas for cat detection:
- Fix pictures of dogs being recognized as cats
- Fix great cats(lions,panthers,etc..)being misrecognized
- Improve performance on blurry images

Image	Dog	Great Cats	Blurry	Instagram Snapchat	Comments
1	✓			✓	Pit bull
2			✓	✓
3		✓	✓		Raining day at zoo
…
% of total	8%	43%	61%	12%

You can to better on great cats and blurry the potential improvement.
There’s a ceiling in terms of how much you could improve performance is much higher.

Clearing up incorrectly labeled data

Incorrectly labeled examples (in training set)

DL algorithms are quite robust to random errors in the training set
- randomly hit keyboard
- no need to fix when total data set is big enough
less robust system errors
- labeler consistently labels white dogs as cats

Error analysis in dev/test set

Image	Dog	Great Cat	Blurry	Incorrectly labeled	Comments
…
98				✓	Labeler missed cat in background
99		✓
100				✓	Drawing of a cat; Not a real cat.
% of total	8%	43%	61%	6%

Overall dev set error	10%	2%
Errors due incorrect labels	0.6%	0.6%
Errors due to other causes	9.4%	1.4%
	incorrect label is not important	worthwhile to fix up the incorrect labels

Goal of dev set is to help you select between two classifiers A&B.

Correcting incorrect dev/test set examples

Apply same process to your dev and test sets to make sure they continue to come from the same distribution
Consider examining examples your algorithm got right as well as ones it got wrong.
- not always
Train and dev/test data may now come from slightly different distributions.
wrap up advice
- In building practical systems, often there’s also more manual error analysis and more human insight that goes into the systems.
- Some engineers and researchers be reluctant to manually look at the examples.but sit down and look at a 100 or 200 examples to counter the number of errors help us, I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next.

Build your first system quickly, then iterate

Speech recognition example

Noisy background
- Cafe noise
- Car noise
Accented speech
Far from microphone
Young children’s speech
Stuttering

Set up dev/test set and metric
Build initial system quickly
Use Bias/Variance analysis & Error analysis to prioritize next
Build your first system quickly, then iterate
- build something quick and dirty

Mismatched training and dev/test set

Training and testing on different distributions

Cat app example
- Data from webpages
  - 200,000
- Data from mobile app (care about this)
  - 10,000
- Option1
  - merge and shuffle
  - pros: train/dev/test come from same distribution
  - cons: dev 2500 come from webpage 200k/210k, only 119 mobile apps
- Option2
  - train 205k(web200k), dev(2.5k mobile),test(2.5k mobile)
  - pros: dev/test aim mobile app
  - cons: train contain web images
Speech recognition example
- Training
  - Purchased data (from vendor)
  - Smart speaker control
  - Voice keyboard
  - 500k+10k(from dev/test)
- Dev/test
  - Speech activated
  - rearview mirror
  - dev/test=5k/5k (and 10k for train)

Bias and Variance with mismatched data distributions

Cat classifier example

Asuum humans get\(\approx 0\)% error

Training error 1%
Dev error 10%
variance problem?
- train set is high-resolution.
- dev set contains images much more difficult.
- hypothesis
  - algorithm saw in train data but not in dev.
  - train and dev are different distribution.
- because you changed 2 things in same time it’s difficult to know this.
Training-dev set: Same distribution as training set ,but not used for training.

Training error	1%	1%
Training-dev error	9%	1.5%
Dev error	10%	10%
	Variance problem	Data mismatch

Human error(proxy bayes error)	0%	0%
Training error	10%	10%
Training-dev error	11%	11%
Dev error	12%	20%
	Avoidable bias	Avoidable bias+ Data mismatch

Bias/variance on mismatched training and dev/test sets

Human level error	4%	4
	↕ avoidable bias
Training error	7%	7
	↕ variance
Training-dev error	10%	10
	↕ data mismatch
Dev error	12%	6
	↕ degree of overfitting to dev set
Test error	12%	6
		Sometimes if your dev/test distribution is much easier whatever app working on.

More general formulation

	General speech recognition		Rearview mirror speech recognition
Human level	“Human level” 4%	\(\leftrightarrow\)	6%
	\(\Updownarrow\)		\(\updownarrow\)	\(\updownarrow\) avoidable bisas
Error on example trained on	“Training error” 7%	\(\leftrightarrow\)	6%
	\(\Updownarrow\)		\(\updownarrow\)	\(\updownarrow\) variance
Error on example not trained on	“Training-dev error” 10%	\(\Leftrightarrow\)	“Dev/Test error” 6%
		\(\longleftrightarrow\) data mismatch

\(\Updownarrow\): general effective
\(\updownarrow\): additional insight

Addressing data mismatch

Carry out manual error analysis to try to understand difference between training and dev/test sets
- E.g. noisy car noise start, street numbers
Make training data more similar; or collect more data similar to dev/test sets
- E.g. simulate noisy in-car data
Artificial data synthesis ex1
- “The quick brown fox jumps over the lazy dog.”(include a-z)
- + Car noise
- = Synthesized in-car audio
caution in data synthesis
- 10k hour speech and 1 hour car noise
- copy car noise loop 10k times
  - overfitting to 1 hour of car-noise that is very small subset
- 10k hour unique car-noise
  - may be performance up or not?
Artificial data synthesis ex2
- CG cars, 20 unique cars in game
- overfit subset of 20 cars
when data mismatch problem
- recommend to do error analysis
- training set and dev set differ
- get more training data like dev set
- artificial data synthesis can work very well but just be caution and bear in mind whether or not might be accidentally

Learning from multiple tasks

Transfer learning （転移学習）

Image
Image recognition (pre-training)
- <1,000k data set>
Radiology diagnosis (fine-tuning)
- swap new data set
- (x:radiology-images,y:diagnosis) <100 data>
- swap last output layer and weight feeding into that layer: \(w^{[L]},b^{[L]}\)
- small dataset: training last layer only
- lot of data: retrain all network
Audio
- pre-training: Speech recognition <10,000h>
- fine-tuning: wake word/trigger word detection <1h>
- swap last output layer with multiple layers

When transfer learning makes sense

Task A and B have the same input x.
You have a lot more data for Task A than Task B.
Low level features from A could be helpful for learning B.

Multi-task learning

Simplified autonomous driving example

camera image: \(x^{(i)}\)

	\(y^{(i)}: (4,1)\)
pedestrians	0
cars	1
stop sign	1
traffic lights	0

\[
Y=\begin{bmatrix}
\vdots & \vdots & \vdots & & \vdots \\
y^{(1)}& y^{(2)}& y^{(3)}& \dots & y^{(m)}&\\
\vdots & \vdots & \vdots & & \vdots
\end{bmatrix}
\]

Neural network architecture

\[
Loss: \hat y^{(i)}\in \Bbb R^{4\times1}\\
\frac1m \sum_{i=1}^m \underbrace{\sum_{j=1}^4 L(\hat y_j^{(i)}, y_j^{(i)})}_{\text{unlike softmax regression}}\dots \text{usual logistic loss}\\

\]

Unlike softmax regression
- One image have multiple labels
- Nothing more than Single NN training
You can train 4 separate NN but:
- earlier features can be shared, 1NN training gets much more performance
stall label: sum only over value of j with 0/1 label: \[
Y=\begin{bmatrix}
1 & 1 && 0 && ?\\
0 & 1 && 1 && 1\\
? & ? &\dots& 1 &\dots& ?\\
? & ? && 0 && ?\\
\end{bmatrix}
\]

When multi-task learning makes sense

Training on a set of tasks that could benefit from having shared lower-level features.
Usually: amount of data you have for each task is quite similar.
Can train a big enough neural network to do well on all the tasks.(otherwise less performance)

multitask learningはtransfer learningと比べて利用が少ない
small data setはtransfer learning
例外的によく使われるのはComputer vision object recognizing.

End-to-end deep learning

What is end-to-end deep learning?

Speech recognition

x				“c a t”				y
audio	MFCC →	features	ML →	Phonemes	→	words	→	transcript
audio	→	→	→	→	→	→	→	transcript
audio	→	ML →	→	Phonemes	→	…	→	transcript

3,000h data: Traditional pipeline approach actually works just as well.
10k-100k data: end-to-end approach suddenly starts to works very well
medium data: bypass the features and just learn to output the phonemes of the neural network.

Face recognition

Camera image(x) -> Identify(y)
- any position in vision
Camera image(x) -> Zoom face(x’) -> Identify(y)
Why works
1. Each of the two problems you’re solving is actually much simpler.
2. Have lot of data for each of the two sub-tasks.
breaking down into two sub-problems results better performance than a pure end-to-end deep learning approach.

Machine translation
- x: English
- y: French
- lot of English to French data set
- DL works well
Estimating child’s age
- X: x-ray picture
- Y: age of child
- step: Image->bones->age
- not enough data to train this task in an end-to-end
- task1: bone segmentation is simple problem
- task2: child hand size statistics data
- pipeline approach may works well

Whether to use end-to-end deep learning

Pros and cons of end-to-end deep learning

Pros
- Let the data speak
- Less hand-designing of components needed
Cons
- May need large amount of data
- Excludes potentially useful hand-designed components

Applying end-to-end deep learning

Key question: Do you have sufficient data to learn a function of the complexity needed to map x to y?
driving
- Carefully choose X->Y what task you can get from data