カテゴリー

# DL [Course 3/5] Structuring Machine Learning Projects [Week 1/2]

Key Concepts

• Apply satisficing and optimizing metrics to set up your goal for ML projects
• Take the correct ML Strategic decision based on observations of performances and dataset
• Understand why Machine Learning strategy is important
• Choose a correct train/dev/test split of your dataset
• Use human-level perform to define your key priorities in ML projects
• Understand how to define human-level performance

[mathjax]

## Introduction to ML Strategy

### Why ML Strategy

• Motivating example
• 90%
• Ideas
• Collect more data
• Collect more diverse training set
• Train algorithm longer with gradient descent
• Try bigger network
• Try smaller network
• Try dropout
• Network architecture
• Activation functions
• #hidden units

### Orthogonalization（直交化）

• TV tuning example
• So in this context, orthogonalization refers to that the TV designers had designed the knobs so that each knob kind of does only one thing. And this makes it much easier to tune the TV, so that the picture gets centered where you want it to be.
• Chain of assumptions in ML
1. Fit training set well on cost function
• bigger network
• (early stopping: less orthogonalized)
2. Fit dev set well on cost function
• Regularization
• Bigger train set
3. Fit test set well on cost function
• Bigger dev set
4. Performs well in real world
• Change dev set or cst function
• supervised learning system
• tune 4 knobs(train,dev,test,real)
• how to diagnose what exactly is the bottleneck to your system’s performance. As well as identify the specific set of knobs you could use to tune your system to improve that aspect of its performance.

### Single number evaluation metric

• F1 Score=Average of Precision and Recall: harmonic mean$F1=\frac2{\frac1P+\frac1R}$
• single number evaluation metric
• speed up iteration

### Satisficing and Optimizing metric

• cost = accuracy – 0.5 * runningTime
• maximize accuracy
• Subject to runningTime <= 100ms
• Accuracy: optimizing metric
• Running Time: satisficing metric
• N metric: 1 optimizing, N-1 satisficing

### Train/dev/test distributions

• Cat classification dev/test set
• Regions(randomly choose):
• US/UK/OtherEu/SouthAmeria … dev set
• Indiea/China/OtherAsia/Australia … test set
• it looks like that come from same distribution
• Randomly shuffle data into dev/test
• Guideline
• Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

### Size of the dev and test sets

• Old way of splitting data
• train:test = 70:30
• train:dev:test = 60:20:20
• modern deep learning era
• train:dev:test = 98:2:2
• Size of test set
• set your test set to be big enough to give high confidence in the overall performance of your system.
• 10,000-100,000 enough
• no test set might be ok
• test set can evaluate cost final for shipping quality

### When to change dev/test sets and metrics

• Cat dataset examples
• Metric: classification error
• Algorithm A: 3% error
• good for metric
• but letting through a lot of pornographic
• Algorithm B: 5% error
• good for users
• change evaluation metric: $Error: \underbrace{\frac 1{m_{dev}}}_{\color{red}{\frac1{\sum_i w^{(i)}}}} \sum_{i=1}^{m_{dev}} \color{red}{w^{(i)}} I\{ y_{predicted}^{(i)}=y^{(i)}\}\\ \color{red}{w^{(i)}=\begin{cases} 1 \text{ if } x^{(i)} \text{is non-porn}\\ 10 \text{ if } x^{(i)} \text{ is porn} \end{cases}}$
• Orthogonalization for cat pictures: anti-porn
1. So far we’ve only discussed how to define a metric to evaluate classifiers.(Place target: first knob)
2. Worry separately about how to do well on this metric.(Aim/Shoot at target: another knob)$J=\underbrace{\frac 1m}_{\color{red}{\frac1{\sum_i w^{(i)}}}} \sum_{i=1}^m \color{red}{w^{(i)}} L( \hat y^{(i)},y^{(i)})$
• Another example
• Dev/test: high resolution picture
• user images: low resolution picture
• If doing well on your metric+dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

## Comparing to human-level performance

### Why human-level performance?

• Comparing to human-level performance
• accuracy/time
• human/bayes optimal error(=best possible error)
• sometimes slows down after you surpass human level performance. why?
1. human task is not so far from bayes optimal error
2. when under human level tools works good but hard to use that surpassed human level.
• Humans are quite good at a lot of tasks.So long as ML is worse than humans,you can:
• Get labeled data from humans.
• Gain insight from manual error analysis:Why did a person get this right?
• Better analysis of bias/variance

### Avoidable bias

Cat classification example

• dataA: 1%-8% Avoidable Bias=7%
• dataB: 8%-10% Variance=2%

### Understanding human-level performance

#### Human-level error as a proxy for Bayes error

• Medical image classification example
• Typical human 3% error
• Typical doctor 1% error
• Experienced doctor 0.7% error
• Team of experienced doctors 0.5% error
• Bayes error <= 0.5%
• What is human-level error?: 0.5%

### Surpassing human-level performance

• Problems where ML significantly surpasses human-level performance
• Product recommendations
• Logistics (predicting transit time)
• Loan approvals
• 4 examples
• learning from structure data. not computer vision.
• Not natural perception
• lots of data
• other surpassing
• Speech recogiation
• Some image recogination

• The two fundamental assumptions of supervised learning
1. You can fit the training set pretty well
• avoidable bias
2. The training set performance generalizes pretty well to the dev/test set.
• variance
• Reducing(avoidable) bias and variance
• Human-level
• Avoidable bias
• Train bigger model
• Train longer/better optimization algorithms: momentum, RMSprom, Adam
• NN architecture/hyperparameters search:RNN,CNN
• Training error
• Variance
• More data
• Regularization: L2, dropout, data augumentation
• NN architecture/hyperparameters search
• Dev error

## Machine Learning flight simulator

• Having three evaluation metrics makes it harder for you to quickly choose between two different algorithms, and will slow down the speed with which your team can iterate.
• Sometimes we’ll need to train the model on the data that is available, and its distribution may not be the same as the data that will occur in production.  Also, adding training data that differs from the dev set may still help the model improve performance on the dev set. What matters is that the dev and test set have the same distribution.
• You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?
• Use the data you have to define a new evaluation metric taking into account the new species.