DL [Course 3/5] Structuring Machine Learning Projects [Week 1/2]

Key Concepts

Apply satisficing and optimizing metrics to set up your goal for ML projects
Take the correct ML Strategic decision based on observations of performances and dataset
Understand why Machine Learning strategy is important
Choose a correct train/dev/test split of your dataset
Use human-level perform to define your key priorities in ML projects
Understand how to define human-level performance

[mathjax]

Structuring Machine Learning Projects (deeplearning.ai)の受講メモ

Introduction to ML Strategy

Why ML Strategy

Motivating example
- 90%
Ideas
- Collect more data
- Collect more diverse training set
- Train algorithm longer with gradient descent
- Try Adam instead of gradient descent
- Try bigger network
- Try smaller network
- Try dropout
- Add L_2 regularization
- Network architecture
  - Activation functions
  - #hidden units
  - …

Orthogonalization（直交化）

TV tuning example
- So in this context, orthogonalization refers to that the TV designers had designed the knobs so that each knob kind of does only one thing. And this makes it much easier to tune the TV, so that the picture gets centered where you want it to be.
Chain of assumptions in ML
1. Fit training set well on cost function
  - bigger network
  - better algorithm adam
  - (early stopping: less orthogonalized)
2. Fit dev set well on cost function
  - Regularization
  - Bigger train set
3. Fit test set well on cost function
  - Bigger dev set
4. Performs well in real world
  - Change dev set or cst function
supervised learning system
- tune 4 knobs(train,dev,test,real)
how to diagnose what exactly is the bottleneck to your system’s performance. As well as identify the specific set of knobs you could use to tune your system to improve that aspect of its performance.

Setting up your goal

Single number evaluation metric

Classifier	Precision	Recall	F1 Score
A	95%	90%	92.4%
B	98%	85%	91.0%

F1 Score=Average of Precision and Recall: harmonic mean\[
F1=\frac2{\frac1P+\frac1R}
\]
single number evaluation metric
- speed up iteration

Satisficing and Optimizing metric

Classifier	Accuracy	Running time
A	90%	80ms
B	92%	95ms
C	95%	1,500ms

cost = accuracy – 0.5 * runningTime
- maximize accuracy
- Subject to runningTime <= 100ms
Accuracy: optimizing metric
Running Time: satisficing metric
N metric: 1 optimizing, N-1 satisficing

Train/dev/test distributions

Cat classification dev/test set
Regions(randomly choose):
- US/UK/OtherEu/SouthAmeria … dev set
- Indiea/China/OtherAsia/Australia … test set
- it looks like that come from same distribution
- Bad Idea
Randomly shuffle data into dev/test
Guideline
- Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

Size of the dev and test sets

Old way of splitting data
- train:test = 70:30
- train:dev:test = 60:20:20
modern deep learning era
- train:dev:test = 98:2:2
Size of test set
- set your test set to be big enough to give high confidence in the overall performance of your system.
- 10,000-100,000 enough
no test set might be ok
- test set can evaluate cost final for shipping quality

When to change dev/test sets and metrics

Cat dataset examples
- Metric: classification error
- Algorithm A: 3% error
  - good for metric
  - but letting through a lot of pornographic
- Algorithm B: 5% error
  - good for users
- change evaluation metric: \[
  Error: \underbrace{\frac 1{m_{dev}}}_{\color{red}{\frac1{\sum_i w^{(i)}}}} \sum_{i=1}^{m_{dev}} \color{red}{w^{(i)}} I\{ y_{predicted}^{(i)}=y^{(i)}\}\\
  \color{red}{w^{(i)}=\begin{cases}
  1 \text{ if } x^{(i)} \text{is non-porn}\\
  10 \text{ if } x^{(i)} \text{ is porn}
  \end{cases}}
  \]
Orthogonalization for cat pictures: anti-porn
1. So far we’ve only discussed how to define a metric to evaluate classifiers.(Place target: first knob)
2. Worry separately about how to do well on this metric.(Aim/Shoot at target: another knob)\[
  J=\underbrace{\frac 1m}_{\color{red}{\frac1{\sum_i w^{(i)}}}} \sum_{i=1}^m \color{red}{w^{(i)}} L( \hat y^{(i)},y^{(i)})
  \]
Another example
- Dev/test: high resolution picture
- user images: low resolution picture
- If doing well on your metric+dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

Comparing to human-level performance

Why human-level performance?

Comparing to human-level performance
- accuracy/time
- human/bayes optimal error(=best possible error)
sometimes slows down after you surpass human level performance. why?
1. human task is not so far from bayes optimal error
2. when under human level tools works good but hard to use that surpassed human level.
Humans are quite good at a lot of tasks.So long as ML is worse than humans,you can:
- Get labeled data from humans.
- Gain insight from manual error analysis:Why did a person get this right?
- Better analysis of bias/variance

Avoidable bias

Cat classification example

	dataA	dataB
Humans(\(\approx bayes\))	1%	7.5%
Training error	8%	8%
Dev error	10%	10%
Tactics: Avoid focus on	bias (underfitting)	variance (overfitting)

dataA: 1%-8% Avoidable Bias=7%
dataB: 8%-10% Variance=2%

Understanding human-level performance

Human-level error as a proxy for Bayes error

Medical image classification example
- Typical human 3% error
- Typical doctor 1% error
- Experienced doctor 0.7% error
- Team of experienced doctors 0.5% error
  - Bayes error <= 0.5%
- What is human-level error?: 0.5%

Error analysis example

Human	1% 0.7% 0.5%	1% 0.7% 0.5%	0.5%
Training error	5%	1%	0.7%
Dev error	6%	5%	0.8%
	Bias	Variance	both

Summary of bias/variance with human-level performance

Human-level error(proxy of bayes error)
	↕ Avoidable bias
Training error
	↕ Variance
Dev error

Surpassing human-level performance

Problems where ML significantly surpasses human-level performance
- Online advertising: estimating click
- Product recommendations
- Logistics (predicting transit time)
- Loan approvals
4 examples
- learning from structure data. not computer vision.
- Not natural perception
- lots of data
other surpassing
- Speech recogiation
- Some image recogination
- Medical: ECG,skin cancer, radiology task

Improving your model performance(guideline)

The two fundamental assumptions of supervised learning
1. You can fit the training set pretty well
  - avoidable bias
2. The training set performance generalizes pretty well to the dev/test set.
  - variance
Reducing(avoidable) bias and variance
- Human-level
  - Avoidable bias
    - Train bigger model
    - Train longer/better optimization algorithms: momentum, RMSprom, Adam
    - NN architecture/hyperparameters search:RNN,CNN
- Training error
  - Variance
    - More data
    - Regularization: L2, dropout, data augumentation
    - NN architecture/hyperparameters search
- Dev error

Machine Learning flight simulator

Having three evaluation metrics makes it harder for you to quickly choose between two different algorithms, and will slow down the speed with which your team can iterate.
Sometimes we’ll need to train the model on the data that is available, and its distribution may not be the same as the data that will occur in production. Also, adding training data that differs from the dev set may still help the model improve performance on the dev set. What matters is that the dev and test set have the same distribution.
You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?
- Use the data you have to define a new evaluation metric taking into account the new species.

Introduction to ML Strategy

Why ML Strategy

Orthogonalization（直交化）

Setting up your goal

Single number evaluation metric

Satisficing and Optimizing metric

Train/dev/test distributions

Size of the dev and test sets

When to change dev/test sets and metrics

Comparing to human-level performance

Why human-level performance?

Avoidable bias

Understanding human-level performance

Human-level error as a proxy for Bayes error

Error analysis example

Summary of bias/variance with human-level performance

Surpassing human-level performance

Improving your model performance(guideline)

Machine Learning flight simulator

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル