カテゴリー

# DL [Course 3/5] Structuring Machine Learning Projects [Week 2/2]

Key Concepts

• Understand what multi-task learning and transfer learning are
• Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets

[mathjax]

## Error Analysis

### Carrying out error analysis

• if your algorithm is not yet at the performance of a human

#### Look at dev examples to evaluate ideas

• 90%accuracy,10%error much worse
• Should you try to make your cat classifier do better on dogs?
• Error analysis:
• Get~100 mislabeled dev set examples.(5-10min)
• Count up how many are dogs.
• effort “ceiling
• 5/100 dogs in mislabeled (5%) then potentially from 10% to 9.5% error.
• 50/100 dogs in mislabeled (50%) then potentially from 10% to 5% error.
• simple counting procedure can effort should try to focus on reducing this mislabeled problem.
• error analysis can save you a lot of time in terms of deciding what’s the most important or what’s the most promising direction to focus on.

#### Evaluate multiple ideas in parallel

• Ideas for cat detection:
• Fix pictures of dogs being recognized as cats
• Fix great cats(lions,panthers,etc..)being misrecognized
• Improve performance on blurry images
• You can to better on great cats and blurry the potential improvement.
• There’s a ceiling in terms of how much you could improve performance is much higher.

### Clearing up incorrectly labeled data

#### Incorrectly labeled examples (in training set)

• DL algorithms are quite robust to random errors in the training set
• randomly hit keyboard
• no need to fix when total data set is big enough
• less robust system errors
• labeler consistently labels white dogs as cats

#### Error analysis in dev/test set

Goal of dev set is to help you select between two classifiers A&B.

#### Correcting incorrect dev/test set examples

• Apply same process to your dev and test sets to make sure they continue to come from the same distribution
• Consider examining examples your algorithm got right as well as ones it got wrong.
• not always
• Train and dev/test data may now come from slightly different distributions.
• wrap up advice
• In building practical systems, often there’s also more manual error analysis and more human insight that goes into the systems.
• Some engineers and researchers be reluctant to manually look at the examples.but sit down and look at a 100 or 200 examples to counter the number of errors help us, I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next.

### Build your first system quickly, then iterate

Speech recognition example

• Noisy background
• Cafe noise
• Car noise
• Accented speech
• Far from microphone
• Young children’s speech
• Stuttering
• Set up dev/test set and metric
• Build initial system quickly
• Use Bias/Variance analysis & Error analysis to prioritize next
• Build your first system quickly, then iterate
• build something quick and dirty

## Mismatched training and dev/test set

### Training and testing on different distributions

• Cat app example
• Data from webpages
• 200,000
• 10,000
• Option1
• merge and shuffle
• pros: train/dev/test come from same distribution
• cons: dev 2500 come from webpage 200k/210k, only 119 mobile apps
• Option2
• train 205k(web200k), dev(2.5k mobile),test(2.5k mobile)
• pros: dev/test aim mobile app
• cons: train contain web images
• Speech recognition example
• Training
• Purchased data (from vendor)
• Smart speaker control
• Voice keyboard
• 500k+10k(from dev/test)
• Dev/test
• Speech activated
• rearview mirror
• dev/test=5k/5k (and 10k for train)

### Bias and Variance with mismatched data distributions

Cat classifier example

Asuum humans get$$\approx 0$$% error

• Training error 1%
• Dev error 10%
• variance problem?
• train set is high-resolution.
• dev set contains images much more difficult.
• hypothesis
• algorithm saw in train data but not in dev.
• train and dev are different distribution.
• because you changed 2 things in same time it’s difficult to know this.
• Training-dev set: Same distribution as training set ,but not used for training.

#### More general formulation

• $$\Updownarrow$$: general effective
• $$\updownarrow$$: additional insight

### Addressing data mismatch

• Carry out manual error analysis to try to understand difference between training and dev/test sets
• E.g. noisy car noise start, street numbers
• Make training data more similar; or collect more data similar to dev/test sets
• E.g. simulate noisy in-car data
• Artificial data synthesis ex1
• “The quick brown fox jumps over the lazy dog.”(include a-z)
• + Car noise
• = Synthesized in-car audio
• caution in data synthesis
• 10k hour speech and 1 hour car noise
• copy car noise loop 10k times
• overfitting to 1 hour of car-noise that is very small subset
• 10k hour unique car-noise
• may be performance up or not?
• Artificial data synthesis ex2
• CG cars, 20 unique cars in game
• overfit subset of 20 cars
• when data mismatch problem
• recommend to do error analysis
• training set and dev set differ
• get more training data like dev set
• artificial data synthesis can work very well but just be caution and bear in mind whether or not might be accidentally

## Learning from multiple tasks

### Transfer learning （転移学習）

• Image
• Image recognition (pre-training)
• <1,000k data set>
• Radiology diagnosis (fine-tuning)
• swap new data set
• (x:radiology-images,y:diagnosis) <100 data>
• swap last output layer and weight feeding into that layer: $$w^{[L]},b^{[L]}$$
• small dataset: training last layer only
• lot of data: retrain all network
• Audio
• pre-training: Speech recognition <10,000h>
• fine-tuning: wake word/trigger word detection <1h>
• swap last output layer with multiple layers

#### When transfer learning makes sense

• Task A and B have the same input x.
• You have a lot more data for Task A than Task B.
• Low level features from A could be helpful for learning B.

#### Simplified autonomous driving example

camera image: $$x^{(i)}$$

$Y=\begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ y^{(1)}& y^{(2)}& y^{(3)}& \dots & y^{(m)}&\\ \vdots & \vdots & \vdots & & \vdots \end{bmatrix}$

#### Neural network architecture

$Loss: \hat y^{(i)}\in \Bbb R^{4\times1}\\ \frac1m \sum_{i=1}^m \underbrace{\sum_{j=1}^4 L(\hat y_j^{(i)}, y_j^{(i)})}_{\text{unlike softmax regression}}\dots \text{usual logistic loss}\\$

• Unlike softmax regression
• One image have multiple labels
• Nothing more than Single NN training
• You can train 4 separate NN but:
• earlier features can be shared, 1NN training gets much more performance
• stall label: sum only over value of j with 0/1 label: $Y=\begin{bmatrix} 1 & 1 && 0 && ?\\ 0 & 1 && 1 && 1\\ ? & ? &\dots& 1 &\dots& ?\\ ? & ? && 0 && ?\\ \end{bmatrix}$

#### When multi-task learning makes sense

• Training on a set of tasks that could benefit from having shared lower-level features.
• Usually: amount of data you have for each task is quite similar.
• Can train a big enough neural network to do well on all the tasks.(otherwise less performance)
• multitask learningはtransfer learningと比べて利用が少ない
• small data setはtransfer learning
• 例外的によく使われるのはComputer vision object recognizing.

## End-to-end deep learning

### What is end-to-end deep learning?

#### Speech recognition

• 3,000h data: Traditional pipeline approach actually works just as well.
• 10k-100k data: end-to-end approach suddenly starts to works very well
• medium data: bypass the features and just learn to output the phonemes of the neural network.

#### Face recognition

• Camera image(x) -> Identify(y)
• any position in vision
• Camera image(x) -> Zoom face(x’) -> Identify(y)
• Why works
1. Each of the two problems you’re solving is actually much simpler.
2. Have lot of data for each of the two sub-tasks.
• breaking down into two sub-problems results better performance than a pure end-to-end deep learning approach.
• Machine translation
• x: English
• y: French
• lot of English to French data set
• DL works well
• Estimating child’s age
• X: x-ray picture
• Y: age of child
• step: Image->bones->age
• not enough data to train this task in an end-to-end
• task1: bone segmentation is simple problem
• task2: child hand size statistics data
• pipeline approach may works well

### Whether to use end-to-end deep learning

#### Pros and cons of end-to-end deep learning

• Pros
• Let the data speak
• Less hand-designing of components needed
• Cons
• May need large amount of data
• Excludes potentially useful hand-designed components

#### Applying end-to-end deep learning

• Key question: Do you have sufficient data to learn a function of the complexity needed to map x to y?
• driving
• Carefully choose X->Y what task you can get from data