カテゴリー
機械学習

DL [Course 3/5] Structuring Machine Learning Projects [Week 2/2]

Key Concepts

  • Understand what multi-task learning and transfer learning are
  • Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets

Structuring Machine Learning Projects (deeplearning.ai)の受講メモ

Error Analysis

Carrying out error analysis

  • if your algorithm is not yet at the performance of a human

Look at dev examples to evaluate ideas

  • 90%accuracy,10%error much worse
    • Should you try to make your cat classifier do better on dogs?
  • Error analysis:
    • Get~100 mislabeled dev set examples.(5-10min)
    • Count up how many are dogs.
    • effort “ceiling
      • 5/100 dogs in mislabeled (5%) then potentially from 10% to 9.5% error.
      • 50/100 dogs in mislabeled (50%) then potentially from 10% to 5% error.
  • simple counting procedure can effort should try to focus on reducing this mislabeled problem.
  • error analysis can save you a lot of time in terms of deciding what’s the most important or what’s the most promising direction to focus on.

Evaluate multiple ideas in parallel

  • Ideas for cat detection:
    • Fix pictures of dogs being recognized as cats
    • Fix great cats(lions,panthers,etc..)being misrecognized
    • Improve performance on blurry images
ImageDogGreat CatsBlurryInstagram
Snapchat
Comments
1Pit bull
2
3Raining day at zoo
% of total8%43%61%12%
  • You can to better on great cats and blurry the potential improvement.
  • There’s a ceiling in terms of how much you could improve performance is much higher.

Clearing up incorrectly labeled data

Incorrectly labeled examples (in training set)

  • DL algorithms are quite robust to random errors in the training set
    • randomly hit keyboard
    • no need to fix when total data set is big enough
  • less robust system errors
    • labeler consistently labels white dogs as cats

Error analysis in dev/test set

ImageDogGreat
Cat
BlurryIncorrectly
labeled
Comments
98Labeler missed cat
in background
99
100Drawing of a cat;
Not a real cat.
% of total8%43%61%6%
Overall dev set error10%2%
Errors due incorrect labels0.6%0.6%
Errors due to other causes9.4%1.4%
incorrect label is
not important
worthwhile to fix up the incorrect labels

Goal of dev set is to help you select between two classifiers A&B.

Correcting incorrect dev/test set examples

  • Apply same process to your dev and test sets to make sure they continue to come from the same distribution
  • Consider examining examples your algorithm got right as well as ones it got wrong.
    • not always
  • Train and dev/test data may now come from slightly different distributions.
  • wrap up advice
    • In building practical systems, often there’s also more manual error analysis and more human insight that goes into the systems.
    • Some engineers and researchers be reluctant to manually look at the examples.but sit down and look at a 100 or 200 examples to counter the number of errors help us, I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next. 

Build your first system quickly, then iterate

Speech recognition example

  • Noisy background
    • Cafe noise
    • Car noise
  • Accented speech
  • Far from microphone
  • Young children’s speech
  • Stuttering
  • Set up dev/test set and metric
  • Build initial system quickly
  • Use Bias/Variance analysis & Error analysis to prioritize next
  • Build your first system quickly, then iterate
    • build something quick and dirty

Mismatched training and dev/test set

Training and testing on different distributions

  • Cat app example
    • Data from webpages
      • 200,000
    • Data from mobile app (care about this)
      • 10,000
    • Option1
      • merge and shuffle
      • pros: train/dev/test come from same distribution
      • cons: dev 2500 come from webpage 200k/210k, only 119 mobile apps
    • Option2
      • train 205k(web200k), dev(2.5k mobile),test(2.5k mobile)
      • pros: dev/test aim mobile app
      • cons: train contain web images
  • Speech recognition example
    • Training
      • Purchased data (from vendor)
      • Smart speaker control
      • Voice keyboard
      • 500k+10k(from dev/test)
    • Dev/test
      • Speech activated
      • rearview mirror
      • dev/test=5k/5k (and 10k for train)

Bias and Variance with mismatched data distributions

Cat classifier example

Asuum humans get\(\approx 0\)% error

  • Training error 1%
  • Dev error 10%
  • variance problem?
    • train set is high-resolution.
    • dev set contains images much more difficult.
    • hypothesis
      • algorithm saw in train data but not in dev.
      • train and dev are different distribution.
    • because you changed 2 things in same time it’s difficult to know this.
  • Training-dev set: Same distribution as training set ,but not used for training.
Training error1%1%
Training-dev error9%1.5%
Dev error10%10%
Variance problemData mismatch
Human error(proxy bayes error)0%0%
Training error10%10%
Training-dev error11%11%
Dev error12%20%
Avoidable biasAvoidable bias+
Data mismatch

Bias/variance on mismatched training and dev/test sets

Human level error4%4
↕ avoidable bias
Training error7%7
↕ variance
Training-dev error10%10
↕ data mismatch
Dev error12%6
↕ degree of overfitting to dev set
Test error12%6
Sometimes if your dev/test distribution is
much easier whatever app working on.

More general formulation

General speech recognitionRearview mirror
speech recognition
Human level“Human level” 4%\(\leftrightarrow\)6%
\(\Updownarrow\)\(\updownarrow\) \(\updownarrow\) avoidable bisas
Error on example
trained on
“Training error” 7%\(\leftrightarrow\)6%
\(\Updownarrow\) \(\updownarrow\) \(\updownarrow\) variance
Error on example
not trained on
“Training-dev error” 10%\(\Leftrightarrow\)“Dev/Test error” 6%
\(\longleftrightarrow\)
data mismatch
  • \(\Updownarrow\): general effective
  • \(\updownarrow\): additional insight

Addressing data mismatch

  • Carry out manual error analysis to try to understand difference between training and dev/test sets
    • E.g. noisy car noise start, street numbers
  • Make training data more similar; or collect more data similar to dev/test sets
    • E.g. simulate noisy in-car data
  • Artificial data synthesis ex1
    • “The quick brown fox jumps over the lazy dog.”(include a-z)
    • + Car noise
    • = Synthesized in-car audio
  • caution in data synthesis
    • 10k hour speech and 1 hour car noise
    • copy car noise loop 10k times
      • overfitting to 1 hour of car-noise that is very small subset
    • 10k hour unique car-noise
      • may be performance up or not?
  • Artificial data synthesis ex2
    • CG cars, 20 unique cars in game
    • overfit subset of 20 cars
  • when data mismatch problem
    • recommend to do error analysis
    • training set and dev set differ
    • get more training data like dev set
    • artificial data synthesis can work very well but just be caution and bear in mind whether or not might be accidentally

Learning from multiple tasks

Transfer learning (転移学習)

  • Image
  • Image recognition (pre-training)
    • <1,000k data set>
  • Radiology diagnosis (fine-tuning)
    • swap new data set
    • (x:radiology-images,y:diagnosis) <100 data>
    • swap last output layer and weight feeding into that layer: \(w^{[L]},b^{[L]}\)
    • small dataset: training last layer only
    • lot of data: retrain all network
  • Audio
    • pre-training: Speech recognition <10,000h>
    • fine-tuning: wake word/trigger word detection <1h>
    • swap last output layer with multiple layers

When transfer learning makes sense

  • Task A and B have the same input x.
  • You have a lot more data for Task A than Task B.
  • Low level features from A could be helpful for learning B.

Multi-task learning

Simplified autonomous driving example

camera image: \(x^{(i)}\)

\(y^{(i)}: (4,1)\)
pedestrians0
cars1
stop sign1
traffic lights0

\[
Y=\begin{bmatrix}
\vdots & \vdots & \vdots & & \vdots \\
y^{(1)}& y^{(2)}& y^{(3)}& \dots & y^{(m)}&\\
\vdots & \vdots & \vdots & & \vdots
\end{bmatrix}
\]

Neural network architecture

\[
Loss: \hat y^{(i)}\in \Bbb R^{4\times1}\\
\frac1m \sum_{i=1}^m \underbrace{\sum_{j=1}^4 L(\hat y_j^{(i)}, y_j^{(i)})}_{\text{unlike softmax regression}}\dots \text{usual logistic loss}\\

\]

  • Unlike softmax regression
    • One image have multiple labels
    • Nothing more than Single NN training
  • You can train 4 separate NN but:
    • earlier features can be shared, 1NN training gets much more performance
  • stall label: sum only over value of j with 0/1 label: \[
    Y=\begin{bmatrix}
    1 & 1 && 0 && ?\\
    0 & 1 && 1 && 1\\
    ? & ? &\dots& 1 &\dots& ?\\
    ? & ? && 0 && ?\\
    \end{bmatrix}
    \]

When multi-task learning makes sense

  • Training on a set of tasks that could benefit from having shared lower-level features.
  • Usually: amount of data you have for each task is quite similar.
  • Can train a big enough neural network to do well on all the tasks.(otherwise less performance)
  • multitask learningはtransfer learningと比べて利用が少ない
  • small data setはtransfer learning
  • 例外的によく使われるのはComputer vision object recognizing.

End-to-end deep learning

What is end-to-end deep learning?

Speech recognition

x“c a t”y
audioMFCC
featuresML
Phonemeswordstranscript
audiotranscript
audioML
Phonemestranscript
  • 3,000h data: Traditional pipeline approach actually works just as well.
  • 10k-100k data: end-to-end approach suddenly starts to works very well
  • medium data: bypass the features and just learn to output the phonemes of the neural network.

Face recognition

  • Camera image(x) -> Identify(y)
    • any position in vision
  • Camera image(x) -> Zoom face(x’) -> Identify(y)
  • Why works
    1. Each of the two problems you’re solving is actually much simpler.
    2. Have lot of data for each of the two sub-tasks.
  • breaking down into two sub-problems results better performance than a pure end-to-end deep learning approach.
  • Machine translation
    • x: English
    • y: French
    • lot of English to French data set
    • DL works well
  • Estimating child’s age
    • X: x-ray picture
    • Y: age of child
    • step: Image->bones->age
    • not enough data to train this task in an end-to-end
    • task1: bone segmentation is simple problem
    • task2: child hand size statistics data
    • pipeline approach may works well

Whether to use end-to-end deep learning

Pros and cons of end-to-end deep learning

  • Pros
    • Let the data speak
    • Less hand-designing of components needed
  • Cons
    • May need large amount of data
    • Excludes potentially useful hand-designed components

Applying end-to-end deep learning

  • Key question: Do you have sufficient data to learn a function of the complexity needed to map x to y?
  • driving
    • Carefully choose X->Y what task you can get from data

Machine Learning flight simulator

Autonomous driving (case study)

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です