深度學習/機器學習入門基礎數學知識整理(二):梯度與導數,矩陣求導,泰勒展開等

深度學習/機器學習入門基礎數學知識整理(二):梯度與導數,矩陣求導,泰勒展開等

導數與梯度

導數:一個一元函式函式在某一點的導數描述了這個函式在這一點附近的變化率。

f′(a)=limh→0f(a h)−f(a)h

f'(a) = \lim_{h \rightarrow 0} \frac{f(a h)-f(a)}{h}

梯度:多元函式的導數就是梯度。

  • 一階導數,即梯度(gradient):

∇f(X)=∂f(X)∂X=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂f(X)∂x1∂f(X)∂x2⋮∂f(X)∂xn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

\nabla f(\bf{X}) = \frac{\partial f(\bf{X})}{\partial \bf{X}} = \begin{bmatrix}
\frac{\partial f(\bf{X})}{\partial {x_1}} \\
\frac{\partial f(\bf{X})}{\partial {x_2}} \\
\vdots\\
\frac{\partial f(\bf{X})}{\partial {x_n}} \\
\end{bmatrix}

  • 二階導數,Hessian矩陣:
    H(x)=∇2f(X)=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂2f(X)∂x12∂2f(X)∂x2∂x1⋮∂2f(X)∂xn∂x1∂2f(X)∂x1∂x2∂2f(X)∂x22⋮∂2f(X)∂xn∂x2⋯⋯⋱⋯∂2f(X)∂x1∂xn∂2f(X)∂x2∂xn⋮∂2f(X)∂xn2⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

    \bf{H}(x)= \nabla^2f(\bf{X}) = \begin{bmatrix}
    \frac{\partial ^2 f(\bf{X})}{\partial {x_1}^2} & \frac{\partial ^2 f(\bf{X})}{\partial {x_1}\partial {x_2}} & \cdots & \frac{\partial ^2 f(\bf{X})}{\partial {x_1}\partial {x_n}} &\\
    \frac{\partial ^2 f(\bf{X})}{\partial {x_2}\partial {x_1}} & \frac{\partial ^2 f(\bf{X})}{\partial {x_2}^2} & \cdots & \frac{\partial ^2 f(\bf{X})}{\partial {x_2}\partial {x_n}} &\\
    \vdots & \vdots & \ddots & \vdots \\
    \frac{\partial ^2 f(\bf{X})}{\partial {x_n}\partial {x_1}} & \frac{\partial ^2 f(\bf{X})}{\partial {x_n}\partial {x_2}} & \cdots & \frac{\partial ^2 f(\bf{X})}{\partial {x_n}^2} &\\
    \end{bmatrix}

一階導數和二階導數經常記為f′(x)和f′′(x)f'(x)和f”(x)

泰勒展開:一元函式的泰勒展開:

f(xk δ)≈f(xk) f′(xk)δ 12f′′(xk)δ2 ⋯ 1n!f(n)(xk)δn

f(x_k \delta) \approx f(x_k) f'(x_k)\delta \frac{1}{2}f”(x_k)\delta^2 \cdots \frac{1}{n!}f^{(n)}(x_k)\delta^n
多元函式的泰勒展開(僅前三項):

f(xk δ)≈f(xk) ∇Tf(xk)δ 12δTf′′(xk)δ

f(\bf{x}_k \bf{\delta}) \approx f(x_k) \nabla^Tf(\bf{x}_k) \bf{\delta} \frac{1}{2}\bf{\delta^T}f”(\bf{x}_k)\bf{\delta}

如果∇Tf(xk)=0​\nabla^T f(\bf{x}_k) =0​,則xk\bf{x}_k稱為“平穩點”,如果是一元函式,那麼這個點肯定是一個區域性極值點,最大或者最小區域性極值點,如果f是凸函式則是全域性最小值,凸函式是在下一節會簡單介紹一下。

如果是多元函式,∇2f(xk)≻0\nabla^2 f(\bf{x}_k) \succ 0正定,即所有特徵值都是正的,那麼上式的第三項是正的,則 xk\bf{x}_k為一嚴格區域性極小值點(反之,∇2f(xk)≺0\nabla^2 f(\bf{x}_k) \prec0負定嚴格區域性極小值點)。更復雜的,如果二階導數特徵值有正有負,那麼就是不定的,這個時候xk\bf{x}_k為一個鞍點,即有些維度是區域性極小值,有些是區域性極大值,鞍點是當前神經網路訓練面臨的核心難點之一,後面在其他博文中有時間我會寫到,這裡還是回到基礎先。

泰勒展開確實是很多數學問題的基礎核心,這裡再展開一點:
問題:為什麼優化時選擇梯度方向,梯度方向為什麼是變化最快的方向?

由泰勒級數展開式的前兩項f(xk δ)≈f(xk) ∇Tf(xk)δf(\bf{x}_k \bf{\delta}) \approx f(x_k) \nabla^Tf(\bf{x}_k) \bf{\delta}可知,當δ\delta 是一個模為定值但方向不確定的向量時,f(xk δ)−f(xk)≈∇Tf(xk)δf(\bf{x}_k \bf{\delta}) – f(x_k) \approx \nabla^Tf(\bf{x}_k) \bf{\delta},此時∇Tf(xk)δ=||∇Tf(xk)||⋅||δ||cos(θ)\nabla^Tf(\bf{x}_k) \bf{\delta} = ||\nabla^Tf(\bf{x}_k)|| \cdot|| \bf{\delta}||cos(\theta),最大在cos(θ)=1cos(\theta) = 1取到,即δ\delta取梯度方向或者負梯度方向。如果是求極小值,那麼就是梯度下降法,δ\delta取負梯度方向,使得f(x)f(x)下降最快。

矩陣求導總結

(1)對標量求導

  • 標量關於標量x的求導:
    ∂y∂x

    \frac {\partial y}{\partial x}

  • 向量關於標量x的求導:
    向量y=⎡⎣⎢⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥⎥{\bf y} = \begin {bmatrix} y_1 \\ y_2\\ \vdots \\ y_n\end{bmatrix}關於標量x 的求導就是 y 的每一個元素分別對x求導,可以表示為

    ∂y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y1∂x∂y2∂x⋮∂yn∂x⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

    \frac {\partial \bf y}{\partial x} = \begin {bmatrix} \frac{\partial y_1}{\partial x} \\ \frac{\partial y_2}{\partial x} \\ \vdots \\ \frac{\partial y_n}{\partial x} \end{bmatrix}

  • 矩陣·關於標量x的求導:
    矩陣對標量的求導類似於向量關於標量的求導,也就是矩陣的每個元素分別對標量x求導

    ∂Y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y11∂x∂y21∂x⋮∂yn1∂x∂y12∂x∂y22∂x⋮∂yn2∂x⋯⋯⋱⋯∂y1n∂x∂y2n∂x⋮∂ynn∂x⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

    \frac {\partial \bf Y}{\partial x} = \begin {bmatrix} \frac{\partial y_{11} }{\partial x } & \frac{\partial y_{12} }{\partial x }& \cdots & \frac{\partial y_{1n} }{\partial x } \\ \frac{\partial y_{21}}{\partial x } & \frac{\partial y_{22}}{\partial x } & \cdots & \frac{\partial y_{2n}}{\partial x } \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_{n1} }{\partial x } & \frac{\partial y_{n2} }{\partial x } & \cdots & \frac{\partial y_{nn}}{\partial x } \end{bmatrix}

(2)對向量求導

  • 標量關於向量x的導數
    標量y 關於向量 x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥{\bf x } = \begin {bmatrix} x_1 \\ x_2\\ \vdots \\ x_n\end{bmatrix} 的求導可以表示為

    =[∂y∂x1 ∂y∂x2 ⋯ ∂y∂xn]

    = \begin {bmatrix} \frac{\partial y}{\partial x_{1} }\ \frac{\partial y}{\partial x_{2} } \ \cdots \ \frac{\partial y}{\partial x_{n} } \end{bmatrix}

  • 向量關於向量 x 的導數
    向量函式(即函式組成的向量)y=⎡⎣⎢⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥⎥{\bf y} = \begin {bmatrix} y_1 \\ y_2\\ \vdots \\ y_n\end{bmatrix}關於x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥{\bf x } = \begin {bmatrix} x_1 \\ x_2\\ \vdots \\ x_n\end{bmatrix}的導數

    ∂y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y1∂x1∂y2∂x1⋮∂yn∂x1∂y1∂x2∂y2∂x2⋮∂yn∂x2⋯⋯⋱⋯∂y1∂xn∂y2∂xn⋮∂yn∂xn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

    \frac {\partial \bf y}{\partial \bf x} = \begin {bmatrix} \frac{\partial y_{1} }{\partial x_{1} } & \frac{\partial y_{1} }{\partial x_{2} }& \cdots & \frac{\partial y_{1} }{\partial x_{n} } \\ \frac{\partial y_{2}}{\partial x_{1} } & \frac{\partial y_{2}}{\partial x_{2} } & \cdots & \frac{\partial y_{2}}{\partial x_{n} } \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_{n} }{\partial x_{1} } & \frac{\partial y_{n} }{\partial x_{2} } & \cdots & \frac{\partial y_{n}}{\partial x_{n} } \end{bmatrix}

此時獲得的矩陣∂y∂x\frac {\partial y}{\partial \bf x}叫做Jacobian 矩陣。

  • 矩陣關於向量的導數
    矩陣Y=⎡⎣⎢⎢⎢⎢⎢y11y21⋮yn1y12y22⋮yn2⋯⋯⋱⋯y1ny2n⋮ynn⎤⎦⎥⎥⎥⎥⎥{\bf Y} = \begin {bmatrix} y_{11} & y_{12} & \cdots & y_{1n} \\ y_{21} & y_{22} & \cdots & y_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n1} & y_{n2} & \cdots & y_{nn} \end{bmatrix}關於 x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥{\bf x } = \begin {bmatrix} x_1 \\ x_2\\ \vdots \\ x_n\end{bmatrix}的導數是推導中最複雜的一種,表示為

    ∂Y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y11∂x1∂y21∂x1⋮∂yn1∂x1∂y1n∂x2∂y22∂x2⋮∂yn2∂x2⋯⋯⋱⋯∂y1n∂xn∂y2n∂xn⋮∂ynn∂xn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

    \frac {\partial \bf Y}{\partial \bf x} = \begin {bmatrix} \frac{\partial y_{11} }{\partial x_{1} } & \frac{\partial y_{1n} }{\partial x_{2} }& \cdots & \frac{\partial y_{1n} }{\partial x_{n} } \\ \frac{\partial y_{21}}{\partial x_{1} } & \frac{\partial y_{22}}{\partial x_{2} } & \cdots & \frac{\partial y_{2n}}{\partial x_{n} } \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_{n1} }{\partial x_{1} } & \frac{\partial y_{n2} }{\partial x_{2} } & \cdots & \frac{\partial y_{nn}}{\partial x_{n} } \end{bmatrix}
    (3)對矩陣求導

    一般只考慮標量關於矩陣的導數,即標量y 對矩陣 X 的導數,此時的導數是梯度矩陣,可以表示為下式:

    ∂y∂X=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y∂x11∂y∂x12⋮∂y∂x1n∂y∂x21∂y∂x22⋮∂y∂x2n⋯⋯⋱⋯∂y∂xn1∂y∂xn2⋮∂y∂xnn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

    \frac {\partial y}{\partial \bf X} =\begin {bmatrix} \frac{\partial y }{\partial x_{11} } & \frac{\partial y }{\partial x_{21} }& \cdots & \frac{\partial y }{\partial x_{n1} } \\ \frac{\partial y}{\partial x_{12} } & \frac{\partial y}{\partial x_{22} } & \cdots & \frac{\partial y}{\partial x_{n2} } \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y }{\partial x_{1n} } & \frac{\partial y }{\partial x_{2n} } & \cdots & \frac{\partial y}{\partial x_{nn} } \end{bmatrix}

下圖是機器學習中常見的矩陣求導形式,可供參考

這裡寫圖片描述

下一篇是關於Hessian矩陣和凸函式的基本概念,待續。

[1] http://blog.csdn.net/u010976453/article/details/54342895
[2] http://blog.csdn.net/u010976453/article/details/78482502
[3] http://blog.csdn.net/u010976453/article/details/54381248
[4] Jacobian矩陣和Hessian矩陣 http://jacoxu.com/jacobian%e7%9f%a9%e9%98%b5%e5%92%8chessian%e7%9f%a9%e9%98%b5/
[5] https://en.wikipedia.org/wiki/Norm_(mathematics)
[6] https://en.wikipedia.org/wiki/Matrix_norm
[7] 機器學習中的線性代數之矩陣求導 http://blog.csdn.net/u010976453/article/details/54381248
[8] 牛頓法與Hessian矩陣http://blog.csdn.net/linolzhang/article/details/60151623