Key Concepts
- Understand what multi-task learning and transfer learning are
- Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets
[mathjax]
Structuring Machine Learning Projects (deeplearning.ai)の受講メモ
Error Analysis
Carrying out error analysis
- if your algorithm is not yet at the performance of a human
Look at dev examples to evaluate ideas
- 90%accuracy,10%error much worse
- Should you try to make your cat classifier do better on dogs?
- Error analysis:
- Get~100 mislabeled dev set examples.(5-10min)
- Count up how many are dogs.
- effort “ceiling“
- 5/100 dogs in mislabeled (5%) then potentially from 10% to 9.5% error.
- 50/100 dogs in mislabeled (50%) then potentially from 10% to 5% error.
- simple counting procedure can effort should try to focus on reducing this mislabeled problem.
- error analysis can save you a lot of time in terms of deciding what’s the most important or what’s the most promising direction to focus on.
Evaluate multiple ideas in parallel
- Ideas for cat detection:
- Fix pictures of dogs being recognized as cats
- Fix great cats(lions,panthers,etc..)being misrecognized
- Improve performance on blurry images
Image | Dog | Great Cats | Blurry | Instagram Snapchat | Comments |
1 | ✓ | ✓ | Pit bull | ||
2 | ✓ | ✓ | |||
3 | ✓ | ✓ | Raining day at zoo | ||
… | |||||
% of total | 8% | 43% | 61% | 12% |
- You can to better on great cats and blurry the potential improvement.
- There’s a ceiling in terms of how much you could improve performance is much higher.
Clearing up incorrectly labeled data
Incorrectly labeled examples (in training set)
- DL algorithms are quite robust to random errors in the training set
- randomly hit keyboard
- no need to fix when total data set is big enough
- less robust system errors
- labeler consistently labels white dogs as cats
Error analysis in dev/test set
Image | Dog | Great Cat | Blurry | Incorrectly labeled | Comments |
… | |||||
98 | ✓ | Labeler missed cat in background | |||
99 | ✓ | ||||
100 | ✓ | Drawing of a cat; Not a real cat. | |||
% of total | 8% | 43% | 61% | 6% |
Overall dev set error | 10% | 2% | |
Errors due incorrect labels | 0.6% | 0.6% | |
Errors due to other causes | 9.4% | 1.4% | |
incorrect label is not important | worthwhile to fix up the incorrect labels |
Goal of dev set is to help you select between two classifiers A&B.
Correcting incorrect dev/test set examples
- Apply same process to your dev and test sets to make sure they continue to come from the same distribution
- Consider examining examples your algorithm got right as well as ones it got wrong.
- not always
- Train and dev/test data may now come from slightly different distributions.
- wrap up advice
- In building practical systems, often there’s also more manual error analysis and more human insight that goes into the systems.
- Some engineers and researchers be reluctant to manually look at the examples.but sit down and look at a 100 or 200 examples to counter the number of errors help us, I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next.
Build your first system quickly, then iterate
Speech recognition example
- Noisy background
- Cafe noise
- Car noise
- Accented speech
- Far from microphone
- Young children’s speech
- Stuttering
- Set up dev/test set and metric
- Build initial system quickly
- Use Bias/Variance analysis & Error analysis to prioritize next
- Build your first system quickly, then iterate
- build something quick and dirty
Mismatched training and dev/test set
Training and testing on different distributions
- Cat app example
- Data from webpages
- 200,000
- Data from mobile app (care about this)
- 10,000
- Option1
- merge and shuffle
- pros: train/dev/test come from same distribution
- cons: dev 2500 come from webpage 200k/210k, only 119 mobile apps
- Option2
- train 205k(web200k), dev(2.5k mobile),test(2.5k mobile)
- pros: dev/test aim mobile app
- cons: train contain web images
- Data from webpages
- Speech recognition example
- Training
- Purchased data (from vendor)
- Smart speaker control
- Voice keyboard
- 500k+10k(from dev/test)
- Dev/test
- Speech activated
- rearview mirror
- dev/test=5k/5k (and 10k for train)
- Training
Bias and Variance with mismatched data distributions
Cat classifier example
Asuum humans get\(\approx 0\)% error
- Training error 1%
- Dev error 10%
- variance problem?
- train set is high-resolution.
- dev set contains images much more difficult.
- hypothesis
- algorithm saw in train data but not in dev.
- train and dev are different distribution.
- because you changed 2 things in same time it’s difficult to know this.
- Training-dev set: Same distribution as training set ,but not used for training.
Training error | 1% | 1% |
Training-dev error | 9% | 1.5% |
Dev error | 10% | 10% |
Variance problem | Data mismatch |
Human error(proxy bayes error) | 0% | 0% |
Training error | 10% | 10% |
Training-dev error | 11% | 11% |
Dev error | 12% | 20% |
Avoidable bias | Avoidable bias+ Data mismatch |
Bias/variance on mismatched training and dev/test sets
Human level error | 4% | 4 |
↕ avoidable bias | ||
Training error | 7% | 7 |
↕ variance | ||
Training-dev error | 10% | 10 |
↕ data mismatch | ||
Dev error | 12% | 6 |
↕ degree of overfitting to dev set | ||
Test error | 12% | 6 |
Sometimes if your dev/test distribution is much easier whatever app working on. |
More general formulation
General speech recognition | Rearview mirror speech recognition | |||
Human level | “Human level” 4% | \(\leftrightarrow\) | 6% | |
\(\Updownarrow\) | \(\updownarrow\) | \(\updownarrow\) avoidable bisas | ||
Error on example trained on | “Training error” 7% | \(\leftrightarrow\) | 6% | |
\(\Updownarrow\) | \(\updownarrow\) | \(\updownarrow\) variance | ||
Error on example not trained on | “Training-dev error” 10% | \(\Leftrightarrow\) | “Dev/Test error” 6% | |
\(\longleftrightarrow\) data mismatch |
- \(\Updownarrow\): general effective
- \(\updownarrow\): additional insight
Addressing data mismatch
- Carry out manual error analysis to try to understand difference between training and dev/test sets
- E.g. noisy car noise start, street numbers
- Make training data more similar; or collect more data similar to dev/test sets
- E.g. simulate noisy in-car data
- Artificial data synthesis ex1
- “The quick brown fox jumps over the lazy dog.”(include a-z)
- + Car noise
- = Synthesized in-car audio
- caution in data synthesis
- 10k hour speech and 1 hour car noise
- copy car noise loop 10k times
- overfitting to 1 hour of car-noise that is very small subset
- 10k hour unique car-noise
- may be performance up or not?
- Artificial data synthesis ex2
- CG cars, 20 unique cars in game
- overfit subset of 20 cars
- when data mismatch problem
- recommend to do error analysis
- training set and dev set differ
- get more training data like dev set
- artificial data synthesis can work very well but just be caution and bear in mind whether or not might be accidentally
Learning from multiple tasks
Transfer learning (転移学習)
- Image
- Image recognition (pre-training)
- <1,000k data set>
- Radiology diagnosis (fine-tuning)
- swap new data set
- (x:radiology-images,y:diagnosis) <100 data>
- swap last output layer and weight feeding into that layer: \(w^{[L]},b^{[L]}\)
- small dataset: training last layer only
- lot of data: retrain all network
- Audio
- pre-training: Speech recognition <10,000h>
- fine-tuning: wake word/trigger word detection <1h>
- swap last output layer with multiple layers
When transfer learning makes sense
- Task A and B have the same input x.
- You have a lot more data for Task A than Task B.
- Low level features from A could be helpful for learning B.
Multi-task learning
Simplified autonomous driving example
camera image: \(x^{(i)}\)
\(y^{(i)}: (4,1)\) | ||
pedestrians | 0 | |
cars | 1 | |
stop sign | 1 | |
traffic lights | 0 |
\[
Y=\begin{bmatrix}
\vdots & \vdots & \vdots & & \vdots \\
y^{(1)}& y^{(2)}& y^{(3)}& \dots & y^{(m)}&\\
\vdots & \vdots & \vdots & & \vdots
\end{bmatrix}
\]
Neural network architecture
\[
Loss: \hat y^{(i)}\in \Bbb R^{4\times1}\\
\frac1m \sum_{i=1}^m \underbrace{\sum_{j=1}^4 L(\hat y_j^{(i)}, y_j^{(i)})}_{\text{unlike softmax regression}}\dots \text{usual logistic loss}\\
\]
- Unlike softmax regression
- One image have multiple labels
- Nothing more than Single NN training
- You can train 4 separate NN but:
- earlier features can be shared, 1NN training gets much more performance
- stall label: sum only over value of j with 0/1 label: \[
Y=\begin{bmatrix}
1 & 1 && 0 && ?\\
0 & 1 && 1 && 1\\
? & ? &\dots& 1 &\dots& ?\\
? & ? && 0 && ?\\
\end{bmatrix}
\]
When multi-task learning makes sense
- Training on a set of tasks that could benefit from having shared lower-level features.
- Usually: amount of data you have for each task is quite similar.
- Can train a big enough neural network to do well on all the tasks.(otherwise less performance)
- multitask learningはtransfer learningと比べて利用が少ない
- small data setはtransfer learning
- 例外的によく使われるのはComputer vision object recognizing.
End-to-end deep learning
What is end-to-end deep learning?
Speech recognition
x | “c a t” | y | ||||||
audio | MFCC → | features | ML → | Phonemes | → | words | → | transcript |
audio | → | → | → | → | → | → | → | transcript |
audio | → | ML → | → | Phonemes | → | … | → | transcript |
- 3,000h data: Traditional pipeline approach actually works just as well.
- 10k-100k data: end-to-end approach suddenly starts to works very well
- medium data: bypass the features and just learn to output the phonemes of the neural network.
Face recognition
- Camera image(x) -> Identify(y)
- any position in vision
- Camera image(x) -> Zoom face(x’) -> Identify(y)
- Why works
- Each of the two problems you’re solving is actually much simpler.
- Have lot of data for each of the two sub-tasks.
- breaking down into two sub-problems results better performance than a pure end-to-end deep learning approach.
- Machine translation
- x: English
- y: French
- lot of English to French data set
- DL works well
- Estimating child’s age
- X: x-ray picture
- Y: age of child
- step: Image->bones->age
- not enough data to train this task in an end-to-end
- task1: bone segmentation is simple problem
- task2: child hand size statistics data
- pipeline approach may works well
Whether to use end-to-end deep learning
Pros and cons of end-to-end deep learning
- Pros
- Let the data speak
- Less hand-designing of components needed
- Cons
- May need large amount of data
- Excludes potentially useful hand-designed components
Applying end-to-end deep learning
- Key question: Do you have sufficient data to learn a function of the complexity needed to map x to y?
- driving
- Carefully choose X->Y what task you can get from data