Key Concepts
- Apply satisficing and optimizing metrics to set up your goal for ML projects
- Take the correct ML Strategic decision based on observations of performances and dataset
- Understand why Machine Learning strategy is important
- Choose a correct train/dev/test split of your dataset
- Use human-level perform to define your key priorities in ML projects
- Understand how to define human-level performance
[mathjax]
Structuring Machine Learning Projects (deeplearning.ai)の受講メモ
Introduction to ML Strategy
Why ML Strategy
- Motivating example
- 90%
- Ideas
- Collect more data
- Collect more diverse training set
- Train algorithm longer with gradient descent
- Try Adam instead of gradient descent
- Try bigger network
- Try smaller network
- Try dropout
- Add L_2 regularization
- Network architecture
- Activation functions
- #hidden units
- …
Orthogonalization(直交化)
- TV tuning example
- So in this context, orthogonalization refers to that the TV designers had designed the knobs so that each knob kind of does only one thing. And this makes it much easier to tune the TV, so that the picture gets centered where you want it to be.
- Chain of assumptions in ML
- Fit training set well on cost function
- bigger network
- better algorithm adam
- (early stopping: less orthogonalized)
- Fit dev set well on cost function
- Regularization
- Bigger train set
- Fit test set well on cost function
- Bigger dev set
- Performs well in real world
- Change dev set or cst function
- Fit training set well on cost function
- supervised learning system
- tune 4 knobs(train,dev,test,real)
- how to diagnose what exactly is the bottleneck to your system’s performance. As well as identify the specific set of knobs you could use to tune your system to improve that aspect of its performance.
Setting up your goal
Single number evaluation metric
Classifier | Precision | Recall | F1 Score |
A | 95% | 90% | 92.4% |
B | 98% | 85% | 91.0% |
- F1 Score=Average of Precision and Recall: harmonic mean\[
F1=\frac2{\frac1P+\frac1R}
\] - single number evaluation metric
- speed up iteration
Satisficing and Optimizing metric
Classifier | Accuracy | Running time |
A | 90% | 80ms |
B | 92% | 95ms |
C | 95% | 1,500ms |
- cost = accuracy – 0.5 * runningTime
- maximize accuracy
- Subject to runningTime <= 100ms
- Accuracy: optimizing metric
- Running Time: satisficing metric
- N metric: 1 optimizing, N-1 satisficing
Train/dev/test distributions
- Cat classification dev/test set
- Regions(randomly choose):
- US/UK/OtherEu/SouthAmeria … dev set
- Indiea/China/OtherAsia/Australia … test set
- it looks like that come from same distribution
- Bad Idea
- Randomly shuffle data into dev/test
- Guideline
- Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.
Size of the dev and test sets
- Old way of splitting data
- train:test = 70:30
- train:dev:test = 60:20:20
- modern deep learning era
- train:dev:test = 98:2:2
- Size of test set
- set your test set to be big enough to give high confidence in the overall performance of your system.
- 10,000-100,000 enough
- no test set might be ok
- test set can evaluate cost final for shipping quality
When to change dev/test sets and metrics
- Cat dataset examples
- Metric: classification error
- Algorithm A: 3% error
- good for metric
- but letting through a lot of pornographic
- Algorithm B: 5% error
- good for users
- change evaluation metric: \[
Error: \underbrace{\frac 1{m_{dev}}}_{\color{red}{\frac1{\sum_i w^{(i)}}}} \sum_{i=1}^{m_{dev}} \color{red}{w^{(i)}} I\{ y_{predicted}^{(i)}=y^{(i)}\}\\
\color{red}{w^{(i)}=\begin{cases}
1 \text{ if } x^{(i)} \text{is non-porn}\\
10 \text{ if } x^{(i)} \text{ is porn}
\end{cases}}
\]
- Orthogonalization for cat pictures: anti-porn
- So far we’ve only discussed how to define a metric to evaluate classifiers.(Place target: first knob)
- Worry separately about how to do well on this metric.(Aim/Shoot at target: another knob)\[
J=\underbrace{\frac 1m}_{\color{red}{\frac1{\sum_i w^{(i)}}}} \sum_{i=1}^m \color{red}{w^{(i)}} L( \hat y^{(i)},y^{(i)})
\]
- Another example
- Dev/test: high resolution picture
- user images: low resolution picture
- If doing well on your metric+dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.
Comparing to human-level performance
Why human-level performance?
- Comparing to human-level performance
- accuracy/time
- human/bayes optimal error(=best possible error)
- sometimes slows down after you surpass human level performance. why?
- human task is not so far from bayes optimal error
- when under human level tools works good but hard to use that surpassed human level.
- Humans are quite good at a lot of tasks.So long as ML is worse than humans,you can:
- Get labeled data from humans.
- Gain insight from manual error analysis:Why did a person get this right?
- Better analysis of bias/variance
Avoidable bias
Cat classification example
dataA | dataB | |
Humans(\(\approx bayes\)) | 1% | 7.5% |
Training error | 8% | 8% |
Dev error | 10% | 10% |
Tactics: Avoid focus on | bias (underfitting) | variance (overfitting) |
- dataA: 1%-8% Avoidable Bias=7%
- dataB: 8%-10% Variance=2%
Understanding human-level performance
Human-level error as a proxy for Bayes error
- Medical image classification example
- Typical human 3% error
- Typical doctor 1% error
- Experienced doctor 0.7% error
- Team of experienced doctors 0.5% error
- Bayes error <= 0.5%
- What is human-level error?: 0.5%
Error analysis example
Human | 1% 0.7% 0.5% | 1% 0.7% 0.5% | 0.5% |
Training error | 5% | 1% | 0.7% |
Dev error | 6% | 5% | 0.8% |
Bias | Variance | both |
Summary of bias/variance with human-level performance
Human-level error(proxy of bayes error) | |
↕ Avoidable bias | |
Training error | |
↕ Variance | |
Dev error |
Surpassing human-level performance
- Problems where ML significantly surpasses human-level performance
- Online advertising: estimating click
- Product recommendations
- Logistics (predicting transit time)
- Loan approvals
- 4 examples
- learning from structure data. not computer vision.
- Not natural perception
- lots of data
- other surpassing
- Speech recogiation
- Some image recogination
- Medical: ECG,skin cancer, radiology task
Improving your model performance(guideline)
- The two fundamental assumptions of supervised learning
- You can fit the training set pretty well
- avoidable bias
- The training set performance generalizes pretty well to the dev/test set.
- variance
- You can fit the training set pretty well
- Reducing(avoidable) bias and variance
- Human-level
- Avoidable bias
- Train bigger model
- Train longer/better optimization algorithms: momentum, RMSprom, Adam
- NN architecture/hyperparameters search:RNN,CNN
- Avoidable bias
- Training error
- Variance
- More data
- Regularization: L2, dropout, data augumentation
- NN architecture/hyperparameters search
- Variance
- Dev error
- Human-level
Machine Learning flight simulator
- Having three evaluation metrics makes it harder for you to quickly choose between two different algorithms, and will slow down the speed with which your team can iterate.
- Sometimes we’ll need to train the model on the data that is available, and its distribution may not be the same as the data that will occur in production. Also, adding training data that differs from the dev set may still help the model improve performance on the dev set. What matters is that the dev and test set have the same distribution.
- You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?
- Use the data you have to define a new evaluation metric taking into account the new species.