0%

「机器学习」逻辑回归原理与推导

1. 逻辑回归定义

逻辑回归虽然是分类模型,但是如同它名字一样与回归有点关联的。众所周知,线性回归的方程式为 \(y = \theta^T x\),我们给它添加一个归一化函数 \(sigmoid\),进而得到的 \(y\)\((0,1)\) 范围内,并且设定阈值 \(q\), 令大于 \(q\) 的分类为 1 类,小于 \(q\) 的分类为 0 类。这就是逻辑回归模型,表达式为: \[\begin{aligned} y(x) & = sigmoid(\theta^T x) \\ & = \frac{1}{1+e^{-\theta^T x}} \end{aligned}\]

2. 逻辑回归推导

在二分类中,标签只有 \(0\)\(1\)。如果我们把采集到的任意一组样本看作一个事件,那么被分类为 \(1\) 事件的概率为 \(p\),被分类为 \(0\) 事件的概率为 \(1-p\),即: \[\begin{aligned} P(y|x) = \left\{\begin{matrix} & \qquad p , \quad y = 1 \\ & 1-p , \quad y = 0 \end{matrix}\right. \end{aligned}\] 把它转化为一下(简单解释一下,就是当 \(y=1\) 的时候结果是 \(p\)\(y=0\) 的时候结果是 \(1-p\)): \[\begin{aligned} P(y_i|x_i) = p^{y_i} (1-p)^{1-y_i} \end{aligned}\]

假如我们有一组样本量为 \(N\) 数据集 \(D = \lbrace (x_0, y_0), (x_1, y_1), \cdots, (x_N, y_N) \rbrace\)。则根据似然函数得: \[\begin{aligned} L(\theta) & = \prod_{i=1}^{N} P(y_i|x_i) \\ & = \prod_{i=1}^{N} p^{y_i} (1-p)^{1-y_i} \end{aligned}\]

两边取对数: \[\begin{aligned} l(\theta ) = ln(L(\theta)) & = ln(\prod_{i=1}^{N} p^{y_i} (1-p)^{1-y_i}) \\ & = \sum_{i=0}^{N} ln(p^{y_i} (1-p)^{1-y_i}) \\ & = \sum_{i=0}^{N}(y_iln(p) + (1-y_i)ln(1-p)) \end{aligned}\]

此时 \(l(\theta) = \sum_{i=0}^{N}(y_iln(p) + (1-y_i)ln(1-p))\) 是逻辑回归的损失函数了。接下来对 \(l(\theta)\) 求偏导,但是在求偏导前先对 \(p\) 求导: \[\begin{aligned} p' = f'(\theta) & = (\frac{1}{1+e^{-\theta^T x}})' \\ & = -\frac{1}{(1+e^{-\theta^T x})^2} \cdot (1 + e^{-\theta^T x})' \\ & = -\frac{1}{(1+e^{-\theta^T x})^2} \cdot e^{-\theta^T x} \cdot (-x) \\ & = \frac{1}{(1+e^{-\theta^T x})^2} \cdot x(e^{-\theta^T x} + 1 - 1) \\ & = x(\frac{1}{1+e^{-\theta^T x}} - \frac{1}{(1+e^{-\theta^T x})^2}) \\ & = p(1 - p)x \end{aligned}\]

同样的把 \(1-p\) 代进去可得: \[\begin{aligned} (1 - p)' = -p(1 - p)x \end{aligned}\]

好了,现在 \(p\)\(1-p\) 的导数都已经求出来了,那么我们继续对 \(l(\theta)\) 求偏导: \[\begin{aligned} \frac{\partial l(\theta)}{\partial \theta} & = \sum_{i=0}^{N} (y_iln(p) + (1-y_i)ln(1-p))' \\ & = \sum_{i=i}^{N} [y_i(1-p)x_i + px_i(1-y_i)] \\ & = \sum_{i=i}^{N} (x_iy_i - px_i) \\ & = \sum_{i=i}^{N} (y_i - p)x_i \end{aligned}\]

\(p=\frac{1}{1+e^{-\theta^T x}}\) 代进去得到: \[\begin{aligned} \frac{\partial l(\theta)}{\partial \theta} = \sum_{i=i}^{N} (y_i - \frac{1}{1+e^{-\theta^T x_i}})x_i \end{aligned}\]

根据极大似然估计要求对数似然函数取最大值,然而最大化对数似然函数就等效于最小化负对数似然函数,即: \[\begin{aligned} \theta^* = \mathop{\arg\max}\limits_{\theta} l(\theta) = - \mathop{\arg\min}\limits_{\theta} l(\theta) \end{aligned}\]

所以在使用梯度下降方法求解时,参数更新方程为: \[\begin{aligned} \theta_j & = \theta_j - \alpha \frac{\partial l(\theta)}{\partial \theta} \\ & = \theta_j - \alpha \sum_{i=i}^{N} (y_i - \frac{1}{1+e^{-\theta^T x_i}})x_i \end{aligned}\]

其中 \(\alpha\) 为学习速率。

3. Python 代码

使用 iris 数据集训练逻辑回归模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# import iris data
iris = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

# fit
logreg = LogisticRegression(C=1e5, max_iter=1000)
logreg.fit(x_train, y_train)

# predict
y_pred = logreg.predict(x_test)

# classification report
print(classification_report(y_test, y_pred, target_names=iris.target_names))

运行程序

1
2
3
4
5
6
7
8
9
10
11
$ python lr.py
precision recall f1-score support

setosa 1.00 0.94 0.97 16
versicolor 0.82 1.00 0.90 14
virginica 1.00 0.87 0.93 15

accuracy 0.93 45
macro avg 0.94 0.93 0.93 45
weighted avg 0.95 0.93 0.93 45