[機器學習]邏輯迴歸公式推導及其梯度下降法的Python實現

NO IMAGE

一般來說,二項邏輯斯諦迴歸模型是一個二分類判別模型,由條件概率分佈P(Y|X)” role=”presentation” style=”position: relative;”>P(Y|X)P(Y|X)P(Y|X)表示,隨機變數X” role=”presentation” style=”position: relative;”>XXX為實數,Y” role=”presentation” style=”position: relative;”>YYY取值0或者1。我們通過比較P(Y=1|x)” role=”presentation” style=”position: relative;”>P(Y=1|x)P(Y=1|x)P(Y=1|x)和P(Y=0|x)” role=”presentation” style=”position: relative;”>P(Y=0|x)P(Y=0|x)P(Y=0|x)值大小來判斷給定x的類別為1還是0。

從線性模型推導

我們先說廣義的線性迴歸:y=wx b” role=”presentation” style=”position: relative;”>y=wx by=wx by=wx b,這裡 y” role=”presentation” style=”position: relative;”>yyy為迴歸的目標,x為輸入的資料,w” role=”presentation” style=”position: relative;”>www和b” role=”presentation” style=”position: relative;”>bbb分別為需要學習的引數和偏置,我們簡化為y=wx” role=”presentation” style=”position: relative;”>y=wxy=wxy=wx ,此時w” role=”presentation” style=”position: relative;”>www為引數向量,包括了偏置b” role=”presentation” style=”position: relative;”>bbb。如果此時我們把線性迴歸的目標y” role=”presentation” style=”position: relative;”>yyy設為一個事件的對數機率(log odds),即事件發生的概率p” role=”presentation” style=”position: relative;”>ppp與不發生的概率1−p” role=”presentation” style=”position: relative;”>1−p1−p1-p的比值再取對數,也就是:

log⁡(p1−p)=wx” role=”presentation”>log(p1−p)=wxlog⁡(p1−p)=wx

\log(\frac{p}{1-p})=wx

那麼這就是邏輯斯諦迴歸模型,此時p” role=”presentation” style=”position: relative;”>ppp可以為類別1事件發生的概率P(Y=1|x)” role=”presentation” style=”position: relative;”>P(Y=1|x)P(Y=1|x)P(Y=1|x), 則1−p” role=”presentation” style=”position: relative;”>1−p1−p1-p為類別0事件發生的概率P(Y=0|x)” role=”presentation” style=”position: relative;”>P(Y=0|x)P(Y=0|x)P(Y=0|x),我們用p” role=”presentation” style=”position: relative;”>ppp表示上述式子:

p=exp(wx)1 exp(wx)=σ(wx)” role=”presentation”>p=exp(wx)1 exp(wx)=σ(wx)p=exp(wx)1 exp(wx)=σ(wx)

p=\frac{exp(wx)}{1 exp(wx)}= \sigma(wx)

就得到了我們平時接觸最多的logistic函式(σ” role=”presentation” style=”position: relative;”>σσ\sigma)的形式,那麼換一種角度理解,這個模型就是一個線性模型wx” role=”presentation” style=”position: relative;”>wxwxwx經過一個非線性變換σ” role=”presentation” style=”position: relative;”>σσ\sigma得到的,更通俗的說,是一個神經網路,這個網路除了輸入層只有一個輸出神經元,並且使用了sigmoid啟用函式σ” role=”presentation” style=”position: relative;”>σσ\sigma,最後輸出的p表示類別為1的概率。

極大似然法估計與交叉墒

學習邏輯斯諦迴歸模型時,對於給定的資料集T={(x1,y1),(x2,y2),…,(xN,yN)}” role=”presentation” style=”position: relative;”>T={(x1,y1),(x2,y2),…,(xN,yN)}T={(x1,y1),(x2,y2),…,(xN,yN)}T=\{(x_1,y_1),(x_2,y_2),\dots,(x_N,y_N)\},我們通過極大似然估計法來估計出一套引數w^” role=”presentation” style=”position: relative;”>ŵ w^\hat{w},使模型對於此資料集中樣本值的發生概率最大,也就是使模型能儘可能擬合現有資料集中現有的樣本。

由於資料集中每個樣本(xi,yi)” role=”presentation” style=”position: relative;”>(xi,yi)(xi,yi)(x_i,y_i)之間相互獨立,用p(xi)” role=”presentation” style=”position: relative;”>p(xi)p(xi)p(x_i)表示模型判別xi” role=”presentation” style=”position: relative;”>xixix_i為類別1的概率,則1−p(xi)” role=”presentation” style=”position: relative;”>1−p(xi)1−p(xi)1-p(x_i)表示模型判別xi” role=”presentation” style=”position: relative;”>xixix_i為類別0的概率,那麼整個資料集T” role=”presentation” style=”position: relative;”>TTT所有樣本發生的概率為每個樣本對應實際類別概率的連乘:

∏1N[p(xi)]yi[1−p(xi)]1−yi” role=”presentation”>∏1N[p(xi)]yi[1−p(xi)]1−yi∏1N[p(xi)]yi[1−p(xi)]1−yi

\prod _1 ^N [p(x_i)]^{y_i}[1-p(x_i)]^{1-y_i}

這就是我們需要最大化的似然函式,由於連乘的性質,我們一般採用對數似然函式將乘法變成加法,於是有:

L(w)=∑1N[yilog⁡p(xi) (1−yi)log⁡(1−p(xi))]” role=”presentation”>L(w)=∑1N[yilogp(xi) (1−yi)log(1−p(xi))]L(w)=∑1N[yilog⁡p(xi) (1−yi)log⁡(1−p(xi))]

L(w)=\sum _1 ^N [y_i \log p(x_i) (1-y_i) \log (1-p(x_i))]

對L(w)” role=”presentation” style=”position: relative;”>L(w)L(w)L(w)求極大值,得到對w的估計值w^” role=”presentation” style=”position: relative;”>ŵ w^\hat w。

從另外一個角度來看,帶負號的L(w)” role=”presentation” style=”position: relative;”>L(w)L(w)L(w)其實非常像交叉墒:交叉墒在一定程度上是衡量著兩個分佈之間的差異,這裡具體是指p(xi)” role=”presentation” style=”position: relative;”>p(xi)p(xi)p(x_i)和yi” role=”presentation” style=”position: relative;”>yiyiy_i。當模型學到的p(xi)” role=”presentation” style=”position: relative;”>p(xi)p(xi)p(x_i)和真實目標分佈yi” role=”presentation” style=”position: relative;”>yiyiy_i越相近,模型的效果越好,交叉墒越小。理想情況下,當兩個分佈完全一致時,此時相對墒為0,交叉墒達到最低值,即真實目標分佈的墒 yilog⁡(yi)” role=”presentation” style=”position: relative;”>yilog(yi)yilog⁡(yi)y_i \log (y_i),詳情見如何通俗的解釋交叉熵與相對熵?。當帶負號的L(w)” role=”presentation” style=”position: relative;”>L(w)L(w)L(w)達到極小值時,也就是L(w)” role=”presentation” style=”position: relative;”>L(w)L(w)L(w)達到極大值,此時的模型引數是理想的,也是我們最終需要達到的目標,這與我們前面用似然函式估計分析的結果一致,可以看出這兩者實際上是相通的。

梯度下降法最優化

我們把求L(w)” role=”presentation” style=”position: relative;”>L(w)L(w)L(w)極大值問題轉化為求−L(w)” role=”presentation” style=”position: relative;”>−L(w)−L(w)-L(w)的極小值的最優化問題。最優化的方法通常是梯度下降法和牛頓法,我們具體給出梯度下降法的做法。我們把概率p(xi)” role=”presentation” style=”position: relative;”>p(xi)p(xi)p(x_i)表示為非線性函式σ(wxi)” role=”presentation” style=”position: relative;”>σ(wxi)σ(wxi)\sigma(wx_i),因此目標函式重新寫成如下:

L=∑iN−yilog⁡σ(wxi)−(1−yi)log⁡(1−σ(wxi))” role=”presentation”>L=∑iN−yilogσ(wxi)−(1−yi)log(1−σ(wxi))L=∑iN−yilog⁡σ(wxi)−(1−yi)log⁡(1−σ(wxi))

L= \sum_i^N -y_i \log \sigma(wx_i)-(1-y_i) \log( 1-\sigma(wx_i))

對wxi” role=”presentation” style=”position: relative;”>wxiwxiwx_i進行求導,有:

(1)∂L∂(wxi)=∂L∂(σ(wxi))∂(σ(wxi))∂(wxi)(2)=[−yi1σ(wxi)−(1−yi)−11−σ(wxi)]σ(wxi)(1−σ(wxi))(3)=−yi(1−σ(wxi)) (1−yi)σ(wxi)(4)=−yi yiσ(wxi) σ(wxi)−yiσ(wxi)(5)=σ(wxi)−yi” role=”presentation”>∂L∂(wxi)=∂L∂(σ(wxi))∂(σ(wxi))∂(wxi)=[−yi1σ(wxi)−(1−yi)−11−σ(wxi)]σ(wxi)(1−σ(wxi))=−yi(1−σ(wxi)) (1−yi)σ(wxi)=−yi yiσ(wxi) σ(wxi)−yiσ(wxi)=σ(wxi)−yi(1)(2)(3)(4)(5)(1)∂L∂(wxi)=∂L∂(σ(wxi))∂(σ(wxi))∂(wxi)(2)=[−yi1σ(wxi)−(1−yi)−11−σ(wxi)]σ(wxi)(1−σ(wxi))(3)=−yi(1−σ(wxi)) (1−yi)σ(wxi)(4)=−yi yiσ(wxi) σ(wxi)−yiσ(wxi)(5)=σ(wxi)−yi

\begin{align}
\frac{\partial L}{\partial (wx_i)} & = \frac{\partial L}{\partial (\sigma (wx_i))} \frac{\partial (\sigma (wx_i))}{\partial (wx_i)} \\
& = [-y_i \frac {1}{\sigma (wx_i)}- (1-y_i) \frac {-1} {1-\sigma(wx_i)}] \sigma(wx_i)(1-\sigma(wx_i)) \\
& = -y_i(1-\sigma(wx_i)) (1-y_i)\sigma(wx_i) \\
& = -y_i y_i \sigma(wx_i) \sigma(wx_i)-y_i \sigma(wx_i) \\
& = \sigma(wx_i)-y_i
\end{align}
其中利用了sigmoid函式的導數σ′=σ(1−σ)” role=”presentation” style=”position: relative;”>σ′=σ(1−σ)σ′=σ(1−σ)\sigma’=\sigma(1-\sigma)。從直觀上解釋:對於單個樣本xi” role=”presentation” style=”position: relative;”>xixix_i, 線性函式wxi” role=”presentation” style=”position: relative;”>wxiwxiwx_i的梯度為其非線性啟用後的值減去真實標籤值,也就是輸出值p(xi)” role=”presentation” style=”position: relative;”>p(xi)p(xi)p(x_i)減去yi” role=”presentation” style=”position: relative;”>yiyiy_i,更具體來說,當yi” role=”presentation” style=”position: relative;”>yiyiy_i為1時,梯度為p(xi)−1″ role=”presentation” style=”position: relative;”>p(xi)−1p(xi)−1p(x_i)-1,當yi” role=”presentation” style=”position: relative;”>yiyiy_i為0時,梯度為p(xi)” role=”presentation” style=”position: relative;”>p(xi)p(xi)p(x_i)。

如果用批量梯度下降法,那麼每個epoch,引數w” role=”presentation” style=”position: relative;”>www的梯度為:

(6)∂L∂(w))=∑iN∂L∂(wxi)∂(wxi)∂w(7)=∑iN(σ(wxi)−yi)×xi” role=”presentation”>∂L∂(w))=∑iN∂L∂(wxi)∂(wxi)∂w=∑iN(σ(wxi)−yi)×xi(6)(7)(6)∂L∂(w))=∑iN∂L∂(wxi)∂(wxi)∂w(7)=∑iN(σ(wxi)−yi)×xi

\begin{align}
\frac{\partial L}{\partial (w))} & = \sum_i^N \frac{\partial L} {\partial (wx_i)} \frac{\partial (wx_i)} {\partial w} \\
& = \sum_i^N (\sigma(wx_i)-y_i) \times x_i
\end{align}

Python實現(批量梯度下降法)

我們使用iris資料集中的前100行資料,實現一個簡單的邏輯迴歸二分類器,使用批量梯度下降法:

from sklearn import datasets
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from numpy.linalg import inv

iris = datasets.load_iris()
X = iris.data[:100, :]
y = iris.target[:100].reshape((100, -1))


def logit(x):
    return 1. / (1   np.exp(-x))


m, n = X.shape
alpha = 0.0065 // 步長
w = np.random.random((n, 1)) // 引數矩陣
maxCycles = 30
J = pd.Series(np.arange(maxCycles, dtype=float))

for i in range(maxCycles):
    h = logit(np.dot(X, w)) // 輸出估計值h
    J[i] = -(1 / 100.) * np.sum(y * np.log(h)   (1 - y) * np.log(1 - h)) // 記錄目標函式的值
    error = h - y //計算wx的梯度,輸出值減去真實值
    grad = np.dot(X.T, error) //計算w的梯度
    w -= alpha * grad // 更新引數w,使用負梯度最小化J
print w
J.plot()
plt.show()

牛頓法與梯度下降法比較

牛頓法是直接利用函式有極小點的必要條件,即一階導數 f’導數為零,所以每次迭代後的點 xk 1″ role=”presentation” style=”position: relative;”>xk 1xk 1x_{k 1} 的特點就是 f′(xk 1)=0″ role=”presentation” style=”position: relative;”>f′(xk 1)=0f′(xk 1)=0f'(x_{k 1})=0 ,又由於一階導數可以通過二階導數與步長的乘積來計算(二階泰勒展開):f′(xk 1)=f′(xk) f″(xk)(xk 1−xk)” role=”presentation” style=”position: relative;”>f′(xk 1)=f′(xk) f″(xk)(xk 1−xk)f′(xk 1)=f′(xk) f″(xk)(xk 1−xk)f'(x_{k 1})=f'(x_k) f”(x_k)(x_{k 1}-x_k) ,因此就有 f′(xk) f″(xk)(xk 1−xk)=0″ role=”presentation” style=”position: relative;”>f′(xk) f″(xk)(xk 1−xk)=0f′(xk) f″(xk)(xk 1−xk)=0f'(x_k) f”(x_k)(x_{k 1}-x_k)=0 即每次更新的步長等於 Δx=xk 1−xk=−f′(xk)/f″(xk)” role=”presentation” style=”position: relative;”>Δx=xk 1−xk=−f′(xk)/f″(xk)Δx=xk 1−xk=−f′(xk)/f″(xk)\Delta x = x_{k 1}-x_{k}=-f'(x_k)/f”(x_k) ,這個 Δx” role=”presentation” style=”position: relative;”>ΔxΔx\Delta x 在梯度下降法中的體現就是 α” role=”presentation” style=”position: relative;”>αα\alpha ,只不過梯度下降法每次更新都是以一個非常小的固定步長 α” role=”presentation” style=”position: relative;”>αα\alpha (一般情況下)在更新,而牛頓法一步到位直接找到一個相對最優的點,利用了二階導數的資訊,因此迭代次數更少,收斂快。

舉例子,比如一個簡單的強凸函式 f=x2″ role=”presentation” style=”position: relative;”>f=x2f=x2f=x^2 , 無論起始點是哪,牛頓法總能一次迭代就找到最小值點0, 而梯度下降法只能根據步長和起始點得位置慢慢逼近0,但是實際上函式 f” role=”presentation” style=”position: relative;”>fff 可能要複雜得多,函式可能是整體上非凸,但是區域性是凸函式(比如最優點附近),直接用牛頓法迭代多次也不能保證收斂到最優點,因此需要先用梯度下降找到一個相對好的解後再用牛頓法可能效果比較好(根據曾文俊的回答),從這裡我們也可以看出牛頓法對函式的性質以及初始點的位置選擇比較挑剔,另外一個是二階導數矩陣的逆矩陣計算量比較大(當 xk” role=”presentation” style=”position: relative;”>xkxkx_k 為引數矩陣時,此時 f″(xk)” role=”presentation” style=”position: relative;”>f″(xk)f″(xk)f”(x_k) 為一個二階偏導數矩陣,即Hesse 矩陣),通常使用擬牛頓法。

個人思考:前面的 f=x2″ role=”presentation” style=”position: relative;”>f=x2f=x2f=x^2 最大階數只有2,我們利用二階導數能完美解決它的最優化問題;但是當 f” role=”presentation” style=”position: relative;”>fff 的階數 n 大於2時,此時用泰勒展開式展開到2階(牛頓法只展開到2階),後面還有一個高階無窮小 Rn” role=”presentation” style=”position: relative;”>RnRnR_n ,因此現在展開的等式兩邊都是約等於的關係(如果不考慮 Rn” role=”presentation” style=”position: relative;”>RnRnR_n ),即可推出 f′(xk 1)≈f′(xk) f″(xk)(xk 1−xk)” role=”presentation” style=”position: relative;”>f′(xk 1)≈f′(xk) f″(xk)(xk 1−xk)f′(xk 1)≈f′(xk) f″(xk)(xk 1−xk)f'(x_{k 1}) \approx f'(x_k) f”(x_k)(x_{k 1}-x_k) ,嚴格上來說,此時求得的 Δx” role=”presentation” style=”position: relative;”>ΔxΔx\Delta x 並不能使引數更新到最優點,只能說是“相對最優”的點,所以多次迭代還是有必要的。