吴恩达机器学习笔记(二)

第三周

逻辑回归模型

关于分类问题，首先讨论二分类问题，即输出值$y$只有两种值0和1，$y=1$表示正分类，$y=0$表示负分类。预测函数 $h_\theta(x)$ 表示如下:

$h_\theta(x) = g(\theta^T x)$ $z = \theta^T x$ $g(z) = \dfrac{1}{1 + e^{-z}}$

$g(z)$为”Sigmoid”函数，也称为逻辑函数。

$h_{\theta}(x)$表示的是输出为1的概率。

$h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta)$ $P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1$

判定边界

$h_\theta(x) \geq 0.5 \Rightarrow y = 1$ $h_\theta(x) < 0.5 \Rightarrow y = 0$ $\theta^T x \geq 0 \Rightarrow y = 1$ $\theta^T x < 0 \Rightarrow y = 0$

代价函数

逻辑回归代价函数定义如下:

$J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)})$ $\mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \text{if y = 1}$ $\mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \text{if y = 0}$

简化代价函数:

$\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))$ $J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$

根据代价函数，可以得到如下结果:

$\mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y$ $\mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \mathrm{and} h_\theta(x) \rightarrow 1$ $\mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \mathrm{and} h_\theta(x) \rightarrow 0$

矩阵表达式如下($X$为设计矩阵[mx(n+1)]):

$h = g(X\theta)$ $J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right)$

梯度下降

$J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right)$

要使代价函数 $J(\theta)$ 最小，依旧采用求偏导的方式，同时更新下面表达式:

$\theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta)$

矩阵表达式如下:

$\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})$

优化$\theta$,除了梯度下降算法，还有”Conjugate gradient”, “BFGS”, 和”L-BFGS”等其他优化算法，能更快更精确的优化$\theta$参数,但是算法也更复杂。

多分类: One-vs-all

$y \in \lbrace0, 1 ... n\rbrace$ $h_\theta^{(0)}(x) = P(y = 0 | x ; \theta)$ $h_\theta^{(1)}(x) = P(y = 1 | x ; \theta)$ $\cdots$ $h_\theta^{(n)}(x) = P(y = n | x ; \theta)$ $\mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )$

过拟合

通常有两种方式处理过拟合问题

减少样本特征数量
- 手动选择所需的特征
- 使用模型选择算法
正则化
- 保留所有的特征，但是减小参数$\theta_j$的大小
- 当拥有大量有用的特征时，正则化是有效的

正则化线性回归

引入正则化参数$\lambda$，$\lambda$过大可能造成欠拟合

$min_\theta\ \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2$

应用梯度下降算法

$\theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)}$ $\theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] \ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n$ $\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$

式子中 $\frac{\lambda}{m}\theta_j$ 完成正则化。

标准方程表示

$\theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty$ $\text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}$