カテゴリー
機械学習

Machine Learning [Week 3/11] Logistic Regression

Machine Learning (Stanford University) の受講メモ

Classification and Representation

Classification

  • \(y\in\{0,1\}\)
    • 0: Negative Class
    • 1: Positive Class
  • Threshold classifier output \(h_\theta(x)\) at 0.5:$$
    \text{If }h_\theta(x)\geq 0.5, \text{ predict }”y=1″\\
    \text{If }h_\theta(x)\lt 0.5, \text{ predict }”y=0″\\
    $$

Hypothesis Representation

  • Logistic Regression Model
    • \(\text{Want }0\leq h_\theta(x) \leq 1\)
      \[
      h_\theta(x)=g(\theta^Tx)=g(z)=\frac{1}{1+e^{-z}}=\frac{1}{1-e^{-\theta^Tx}}
      \]
    • g function
      • Sigmoid function same as Logistic function
  • Interpretation of Hypothesis Output
    • \(h(x)\)=estimated probability that y=1 on input x
    • Example: If\[
      x=\begin{bmatrix}
      x_0\\
      x_1
      \end{bmatrix}
      = \begin{bmatrix}
      1\\
      tumorSize
      \end{bmatrix}
      \]
    • \[
      h_\theta(x)=P(y=1|x;\theta)\\
      y=0 or 1
      \]
    • probability that y = 1, given x, parameterized by \(\theta\)
    • \[
      P(y=0|x;\theta)+P(y=1|x;\theta)=1\\
      P(y=0|x;\theta)=1-P(y=1|x;\theta)
      \]

Decision Boundary

  • Logistic regression
    • $$
      h_\theta(x)=g(\theta^Tx)\\
      g(z)=\frac{1}{1+e^{-z}}
      $$
  • Discrete 0 or 1 classification
    • $$
      \begin{align}
      & h_\theta(x) \geq 0.5 \rightarrow y = 1 \newline
      & h_\theta(x) < 0.5 \rightarrow y = 0 \newline\end{align}
      $$
    • $$
      \begin{align*}
      & g(z) \geq 0.5 \newline
      & when \; z \geq 0\end{align*}
      $$
    • $$
      \begin{align}
      z=0, e^{0}=1 \Rightarrow g(z)=\frac12\newline
      z \to \infty, e^{-\infty} \to 0 \Rightarrow g(z)=1 \newline
      z \to -\infty, e^{\infty}\to \infty \Rightarrow g(z)=0 \end{align}
      $$
    • input to g is \(\theta^TX\)
    • $$
      \begin{align*}
      & h_\theta(x) = g(\theta^T x) \geq 0.5 \newline
      & when \; \theta^T x \geq 0\end{align*}
      $$
    • $$
      \begin{align*}
      & \theta^T x \geq 0 \Rightarrow y = 1 \newline
      & \theta^T x < 0 \Rightarrow y = 0 \newline\end{align*}
      $$
  • Decision Boundary
    • \[
      h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2)
      \]
    • Predict $$
      “y=1” \text{ if } -3+x_1+x_2\geq0
      $$
    • Decision Boundary $$
      x_1+x_2=3
      $$
  • Non-linear decision boundaries
    • $$
      h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x_2^2)\geq 0
      $$
    • Predict $$
      “y=1” \text{ if } -1+x_1^2+x_2^2\geq 0
      $$
    • Decision Boundary $$
      x_1^2+x_2^2= 1
      $$

Logistic Regression Model

  • How to choose parameters \(\theta\) ?
    • Training set:$$
      \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)})\dots (x^{(m)},y^{(m)}) \}
      $$
    • m examples$$
      x\in\begin{bmatrix}
      x_0\\
      x_1\\
      \dots\\
      x_n
      \end{bmatrix}\Bbb R^{n+1}\\
      x_0=1,y\in\{0,1\}\\
      h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}
      $$

Cost function

  • Linear regression: $$
    J(\theta)=\frac1m\sum_{i=1}^m \frac12(h_\theta(x^{(i)})-y^{(i)})^2
    $$
  • $$
    Cost(h_\theta(x),y)=\frac12(h_\theta(x),y)^2
    $$
  • Logistic regression cost function $$
    Cost(h_\theta(x),y)=\begin{cases}
    -log(h_\theta(x)) \text{ if } y=1\\
    -log(1-h_\theta(x)) \text{ if } y=0
    \end{cases}
    $$

Simplified Cost function and Gradient Decent

  • $$
    \begin{align*}
    & \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \newline
    & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \newline
    & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \newline \end{align*}
    $$
  • Logistic regression cost function$$
    \begin{align*}
    J(\theta)&=\frac1m\sum_{i=1}^m Cost(h_\theta(x^{(i)},y^{(i)})\\
    &=-\frac1m \sum_{i=1}^m [y^{(i)} \log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]
    \end{align*}
    $$
    • To fit parameters \(\theta\):$$\min_\theta J(\theta)$$
    • To make a prediction given new x:$$
      h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}
      $$
  • Remember that the general form of gradient descent is:$$
    \begin{align*}
    &Repeat:\{\\
    &\theta_j := \theta_j – \alpha \frac{\partial}{\partial \theta_j} J(\theta)\\
    &\}
    \end{align*}
    $$
  • We can work out the derivative part using calculus to get:$$
    \begin{align*}
    &Repeat:\{\\
    &\theta_j := \theta_j – \frac{\alpha}{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\\
    &\}
    \end{align*}
    $$
  • A vectorized implementation is: $$
    \theta := \theta-\frac{\alpha}{m}X^T(g(X\theta)-\vec{y})
    $$

Advanced Optimization

  • Optimization algorithm
  • Advantages
    • \(\alpha\)をpickしなくてもいい
    • はやい
  • Disadvantages
    • 複雑

Octave

evaluates the following two functions for a given input value θ:

$$ \begin{align*} & J(\theta) \newline & \dfrac{\partial}{\partial \theta_j}J(\theta)\end{align*}$$

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

octaveのfminunc()を利用

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification

  • One-vs-all
    • 複数のクラス分け問題を、1つ対他の2クラスと置き換える
    • i: to predict the probability that y=i.
    • On a new input x to make a prediction, pick the class i that maximizes.
  • $$
    \begin{align*}& y \in \lbrace0, 1 … n\rbrace \newline
    & h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \newline
    & h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \newline
    & \cdots \newline& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \newline
    & \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\newline\end{align*}
    $$

Solving the Problem of Overfitting

Problem of Overfitting

  • underfitting, high bias
  • overfitting, high variance
  • Addressing overfitting
    • Reduce number of features.
      • manually select which features to keep.
      • Model selection algorithm (later in course).
    • Regularization
      • Keep all the features, but reduce magnitude/values of parameters \(\theta_j\).
      • Works well when we have a lot of features, each of which contributes a bit to predicting \(y\).

Cost function

  • Regularization
    • \(\theta\)を小さく保つことがスムーズな関数であることを意味する(e.g. 高次項の影響を最小化)
    • Regularization parameter: \(\lambda\)
    • $$
      J(\theta)=\frac{1}{2m}[\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^n \theta_j^2]
      $$
    • \(\lambda\)を大きくすると、\(\theta_0\)以外がほぼゼロになる、といういうことは、直線でunderfittingしているのとほぼ同じ

Regularized Linear Regression

  • Gradient descent
    • Repeat{ $$
      \begin{align*}
      \theta_0 &:= \theta_0 – \alpha \frac1m \sum_{i-1}^m (h_\theta(x^{(i)})-y^{(i)})x_0^{(i)} \dots (j=0)\\
      \theta_j &:= \theta_j – \alpha [\frac1m \sum_{i-1}^m (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j] \dots (j=1,2,3,\dots,n)
      \end{align*}
      $$
      }
    • $$
      \theta_j := \theta_j(1 – \alpha \frac{\lambda}m) – \alpha \frac1m \sum_{i-1}^m (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}
      $$
    • updateのたびにちょっとだけ縮める $$ 1 – \alpha \frac{\lambda}m >1$$
  • Normal equation
    • $$
      X=\begin{bmatrix}
      (x^{(1)})^T\\
      \vdots \\
      (x^{(m)})^T
      \end{bmatrix}\Bbb R^{m \times n+1}
      $$
    • $$
      y=\begin{bmatrix}
      y^{(1)}\\
      \vdots \\
      y^{(m)}
      \end{bmatrix} \Bbb R^{m}
      $$
    • \(min_\theta J(\theta)\)
    • $$
      \text{If }\lambda\gt0\\
      \theta=(X^TX + \lambda \begin{bmatrix}
      0&&&&\\
      &1&&&\\
      &&1&&\\
      &&&\ddots&\\
      &&&&1
      \end{bmatrix})^-1X^Ty
      )$$
  • Non-invertibility (optional/advanced)
    • Supporse \(m\leq n\)
    • $$
      \theta=(\underbrace{X^TX}_{\text{non-invertible}})^{-1}X^Ty
      $$
      • Octave: pinvをつかう
    • 正規化でinvertibleになる$$
      \text{If }\lambda\gt0\\
      \theta=(X^TX + \lambda \cdot L)^{-1}X^Ty
      $$

Regularized Logistic Regression

  • Logistic regression cost function$$
    \begin{align*}
    J(\theta)&=-\frac1m \sum_{i=1}^m [y^{(i)} \log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]
    \end{align*}
    $$
  • Logistic regression cost function$$
    \begin{align*}
    J(\theta)&=-\frac1m \sum_{i=1}^m [y^{(i)} \log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]+\frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2
    \end{align*}
    $$
    • The second sum, \(\sum_{j=1}^n \theta_j^2\) means to explicitly exclude the bias term, \(\theta_0\)
  • Gradient descent
    • Repeat{ $$
      \begin{align*}
      \theta_0 &:= \theta_0 – \alpha \frac1m \sum_{i-1}^m (h_\theta(x^{(i)})-y^{(i)})x_0^{(i)} \dots (j=0)\\
      \theta_j &:= \theta_j – \alpha [\frac1m \sum_{i-1}^m (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j] \dots (j=1,2,3,\dots,n)
      \end{align*}
      $$
      }
    • Regularized Linear Regressionとの違いは、$$h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}$$

MATLAB code memo

cost function(normal)

h_theta_x = sigmoid(X * theta);
J = 1/m * sum(-y.* log(h_theta_x) - (1-y) .* log(1-h_theta_x));
grad = 1/m * sum((h_theta_x-y) .* X)';

cost function(regularized)

h_theta_x = sigmoid(X * theta);

cost = 1/m * sum(-y.* log(h_theta_x) - (1-y) .* log(1-h_theta_x));
theta_reg = [0; theta(2:size(theta))];
reg = lambda / (2 * m) * sum(theta_reg'*theta_reg);
J = cost+reg;

grad = 1/m * sum((h_theta_x-y) .* X)' + lambda / m * theta_reg;

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です