Stacked Autoencoders

Stacked Autoencoders

           博文內容參照網頁Stacked Autoencoders,Stacked Autocoders是棧式的自編碼器(參考網頁Autoencoder
and Sparsity
和博文自編碼與稀疏性),就是多層的自編碼器,把前一層自編碼器的輸出(中間隱藏層)作為後一層自編碼器的輸入,其實就是把很多自編碼器的編碼部分疊加起來,然後再疊加對應自編碼器的解碼部分,這樣就是一個含有多個隱含層的自編碼器了。本博文介紹棧式自編碼、微調棧式自編碼演算法,然後用棧式自編碼演算法實現MNIST的數字識別。

1、棧式自編碼概述

            前面博文Self-Taught Learning to Deep Networks說到訓練深度網路可以採用逐層貪婪訓練方法,每次只訓練一個隱藏層,訓練時可以採用有監督(比如對每一層隱藏層輸入到softmax迴歸計算分類誤差)或無監督(比如稀疏自編碼),這裡就採用無監督的稀疏自編碼演算法來學習隱藏層的特徵。由於是多層的稀疏自編碼神經網路,並且是逐層編碼的,我們把它叫做stacked
autocoders。

       棧式自編碼神經網路的編碼步驟:

     解碼步驟為:

    其實就類似hinton用棧式RBM組成的神經網路模型(論文是06年在science上發表的,有興趣可以看看):

    只是我們這裡是用稀疏自編碼器,而不是用RBM。

  

     如果我們把最後一層隱藏層,即對原資料最高階的特徵表示,輸入到softmax迴歸模型,就可以實現分類啦。把整個網路模型合起來得到:

    棧式自編碼具有更強大的表達能力及深度網路的所有優點,自編碼器傾向於學習到資料的特徵表示,那麼對於棧式自編碼器,第一層可以學習到一階特徵,第二層可以學到二階特徵等等,對於影象而言,第一層可能學習到邊,第二層可能學習到如何去組合邊形成輪廓、點,更高層可能學習到更形象且更有意義的特徵,學到的特徵方便我們更好地處理影象,比如對影象分類、檢索等等。

2、微調棧式自編碼演算法

        前面也說過微調可以改善深度網路的學習效果,微調就是在原來訓練好的模型引數下再稍微修改各層權重以更好地學習資料,哪該如何微調呢?沒錯,就是用BP演算法(參考網頁Backpropogation algorithm和博文淺談神經網路),利用BP微調的演算法如下:

       要注意的是第二步中對輸出層即softmax的輸入層求導時,不是BP演算法中的平方損失函式,而是softmax損失函式對x的求導,認真推算是可以得到那個表示式的。

3.Exercise:Implement deep networks for digit classification

該實驗是用兩層隱藏層是stacked autocoders softmax對MNIST數字進行分類。

    

     實驗步驟:

  1. 初始化引數;
  2. 在原資料上訓練第一個自編碼器,然後算出L1 features;
  3. 在L1 features上訓練第二個自編碼器,然後算出L2 features;
  4. 在L2 features上訓練softmax分類器;
  5. stacked autocoders softmax模型,用BP演算法微調引數;
  6. 測試模型
     實驗結果:
    
      用display_network函式去顯示第一個自編碼器的編碼權重得到的影象如下:
   
    至於第二個自編碼器,由於輸入層的size是200,不能顯示正方形的影象,該如何顯示第二層的特徵我也還沒搞清楚。
 
     最終結果為:
     Before Finetuning Test Accuracy: 92.080%
     After Finetuning Test Accuracy: 98.210%   
    整個模型的損失函式教程上說不要加上自編碼層的W的懲罰項,只加上softmax層的懲罰項,我不是很理解,我加了後的結果如下:
    Before Finetuning Test Accuracy: 91.750%
    After Finetuning Test Accuracy: 97.430%
 matlab程式碼:
stackedAEExercise.m
%% CS294A/CS294W Stacked Autoencoder Exercise
%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  sstacked autoencoder exercise. You will need to complete code in
%  stackedAECost.m
%  You will also need to have implemented sparseAutoencoderCost.m and 
%  softmaxCost.m from previous exercises. You will need the initializeParameters.m
%  loadMNISTImages.m, and loadMNISTLabels.m files from previous exercises.
%  
%  For the purpose of completing the assignment, you do not need to
%  change the code in this file. 
%
%%======================================================================
%% STEP 0: Here we provide the relevant parameters values that will
%  allow your sparse autoencoder to get good filters; you do not need to 
%  change the parameters below.
inputSize = 28 * 28;
numClasses = 10;
hiddenSizeL1 = 200;    % Layer 1 Hidden Size
hiddenSizeL2 = 200;    % Layer 2 Hidden Size
sparsityParam = 0.1;   % desired average activation of the hidden units.
% (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
%  in the lecture notes). 
lambda = 3e-3;         % weight decay parameter       
beta = 3;              % weight of sparsity penalty term       
%%======================================================================
%% STEP 1: Load data from the MNIST database
%
%  This loads our training data from the MNIST database files.
% Load MNIST database files
trainData = loadMNISTImages('mnist/train-images-idx3-ubyte');
trainLabels = loadMNISTLabels('mnist/train-labels-idx1-ubyte');
trainLabels(trainLabels == 0) = 10; % Remap 0 to 10 since our labels need to start from 1
%新增L-BFGS演算法的目錄路徑
addpath minFunc/
%%======================================================================
%% STEP 2: Train the first sparse autoencoder
%  This trains the first sparse autoencoder on the unlabelled STL training
%  images.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.
%  Randomly initialize the parameters
sae1Theta = initializeParameters(hiddenSizeL1, inputSize);
%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the first layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL1"
%                You should store the optimal parameters in sae1OptTheta
%訓練第一個自編碼器
sae1OptTheta = sae1Theta;
options.Method = 'lbfgs';
options.maxIter = 400;
options.display = 'on';
[sae1OptTheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
inputSize, hiddenSizeL1, ...
lambda, sparsityParam, ...
beta, trainData), ...
sae1Theta, options);
% -------------------------------------------------------------------------
fprintf('第一個自編碼器訓練完成\n');
%%======================================================================
%% STEP 2: Train the second sparse autoencoder
%  This trains the second sparse autoencoder on the first autoencoder
%  featurse.
%  If you've correctly implemented sparseAutoencoderCost.m, you don't need
%  to change anything here.
%利用第一個自編碼器的編碼得到輸入資料的一階表示
[sae1Features] = feedForwardAutoencoder(sae1OptTheta, hiddenSizeL1, ...
inputSize, trainData);
%  Randomly initialize the parameters
sae2Theta = initializeParameters(hiddenSizeL2, hiddenSizeL1);
%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the second layer sparse autoencoder, this layer has
%                an hidden size of "hiddenSizeL2" and an inputsize of
%                "hiddenSizeL1"
%
%                You should store the optimal parameters in sae2OptTheta
%訓練第二個自編碼器
sae2OptTheta = sae2Theta;
[sae2OptTheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
hiddenSizeL1, hiddenSizeL2, ...
lambda, sparsityParam, ...
beta, sae1Features), ...
sae2Theta, options);
% -------------------------------------------------------------------------
fprintf('第二個自編碼器訓練完成\n');
%%======================================================================
%% STEP 3: Train the softmax classifier
%  This trains the sparse autoencoder on the second autoencoder features.
%  If you've correctly implemented softmaxCost.m, you don't need
%  to change anything here.
% 利用第二個自編碼器得到輸入資料的二階表示
[sae2Features] = feedForwardAutoencoder(sae2OptTheta, hiddenSizeL2, ...
hiddenSizeL1, sae1Features);
%  Randomly initialize the parameters
saeSoftmaxTheta = 0.005 * randn(hiddenSizeL2 * numClasses, 1);
%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the softmax classifier, the classifier takes in
%                input of dimension "hiddenSizeL2" corresponding to the
%                hidden layer size of the 2nd layer.
%
%                You should store the optimal parameters in saeSoftmaxOptTheta 
%
%  NOTE: If you used softmaxTrain to complete this part of the exercise,
%        set saeSoftmaxOptTheta = softmaxModel.optTheta(:);
% 用softmax模型對二階特徵進行訓練
options.maxIter = 100;
lambda = 1e-4;
softmaxModel = softmaxTrain(hiddenSizeL2, numClasses, lambda, ...
sae2Features, trainLabels, options);
saeSoftmaxOptTheta = softmaxModel.optTheta(:);
% -------------------------------------------------------------------------
fprintf('softmax訓練完成\n');
%%======================================================================
%% STEP 5: Finetune softmax model
%微調,要計算出整個網路模型的損失函式和梯度
% Implement the stackedAECost to give the combined cost of the whole model
% then run this cell.
% Initialize the stack using the parameters learned
stack = cell(2,1);
stack{1}.w = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), ...
hiddenSizeL1, inputSize);
stack{1}.b = sae1OptTheta(2*hiddenSizeL1*inputSize 1:2*hiddenSizeL1*inputSize hiddenSizeL1);
stack{2}.w = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), ...
hiddenSizeL2, hiddenSizeL1);
stack{2}.b = sae2OptTheta(2*hiddenSizeL2*hiddenSizeL1 1:2*hiddenSizeL2*hiddenSizeL1 hiddenSizeL2);
% Initialize the parameters for the deep model
[stackparams, netconfig] = stack2params(stack);
stackedAETheta = [ saeSoftmaxOptTheta ; stackparams ]; %得到fine-tune前的模型引數
%% ---------------------- YOUR CODE HERE  ---------------------------------
%  Instructions: Train the deep network, hidden size here refers to the '
%                dimension of the input to the classifier, which corresponds 
%                to "hiddenSizeL2".
%
%
%BP演算法fine-tuning
[stackedAEOptTheta, cost] = minFunc( @(p) stackedAECost(p, inputSize, hiddenSizeL2, ...
numClasses, netconfig, ...
lambda, trainData, trainLabels), ...
stackedAETheta, options);
% -------------------------------------------------------------------------
fprintf('整個模型微調完成\n');
%%======================================================================
%% STEP 6: Test 
%  Instructions: You will need to complete the code in stackedAEPredict.m
%                before running this part of the code
%
% Get labelled test images
% Note that we apply the same kind of preprocessing as the training set
testData = loadMNISTImages('mnist/t10k-images-idx3-ubyte');
testLabels = loadMNISTLabels('mnist/t10k-labels-idx1-ubyte');
testLabels(testLabels == 0) = 10; % Remap 0 to 10
[pred] = stackedAEPredict(stackedAETheta, inputSize, hiddenSizeL2, ...
numClasses, netconfig, testData);
acc = mean(testLabels(:) == pred(:));
fprintf('Before Finetuning Test Accuracy: %0.3f%%\n', acc * 100);
[pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSizeL2, ...
numClasses, netconfig, testData);
acc = mean(testLabels(:) == pred(:));
fprintf('After Finetuning Test Accuracy: %0.3f%%\n', acc * 100);
% Accuracy is the proportion of correctly classified images
% The results for our implementation were:
%
% Before Finetuning Test Accuracy: 87.7%
% After Finetuning Test Accuracy:  97.6%
%
% If your values are too low (accuracy less than 95%), you should check 
% your code for errors, and make sure you are training on the 
% entire data set of 60000 28x28 training images 
% (unless you modified the loading code, this should be the case)

stackedAECost.m

function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
numClasses, netconfig, ...
lambda, data, labels)
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% netconfig:   the network configuration of the stack
% lambda:      the weight regularization penalty
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% labels: A vector containing labels, where labels(i) is the label for the
% i-th training example
%% Unroll softmaxTheta parameter
% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses 1:end), netconfig);
% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
stackgrad{d}.w = zeros(size(stack{d}.w));
stackgrad{d}.b = zeros(size(stack{d}.b));
end
cost = 0; % You need to compute this
% You might find these variables useful
M = size(data, 2);
groundTruth = full(sparse(labels, 1:M, 1));
%% --------------------------- YOUR CODE HERE -----------------------------
%  Instructions: Compute the cost function and gradient vector for 
%                the stacked autoencoder.
%
%                You are given a stack variable which is a cell-array of
%                the weights and biases for every layer. In particular, you
%                can refer to the weights of Layer d, using stack{d}.w and
%                the biases using stack{d}.b . To get the total number of
%                layers, you can use numel(stack).
%
%                The last layer of the network is connected to the softmax
%                classification layer, softmaxTheta.
%
%                You should compute the gradients for the softmaxTheta,
%                storing that in softmaxThetaGrad. Similarly, you should
%                compute the gradients for each layer in the stack, storing
%                the gradients in stackgrad{d}.w and stackgrad{d}.b
%                Note that the size of the matrices in stackgrad should
%                match exactly that of the size of the matrices in stack.
%
depth = size(stack, 1);
a = cell(depth 1, 1);
a{1} = data; %輸入層
Jweight = 0; %權重懲罰項
m = size(data, 2); %樣本數
for i=2:numel(a)
a{i} = sigmoid(stack{i-1}.w*a{i-1} repmat(stack{i-1}.b, [1 size(a{i-1}, 2)]));
%Jweight = Jweight   sum(sum(stack{i-1}.w).^2);
end
M = softmaxTheta*a{depth 1};
M = bsxfun(@minus, M, max(M, [], 1));
M = exp(M);
p = bsxfun(@rdivide, M, sum(M));
Jweight = Jweight   sum(sum(softmaxTheta.^2));
%與目標誤差項 權重懲罰項
cost = -1/m .* groundTruth(:)'*log(p(:))   lambda/2*Jweight; 
%計算softmax層梯度
softmaxThetaGrad = -1/m .* (groundTruth - p)*a{depth 1}'   lambda*softmaxTheta;
%隱藏層節點誤差,對z的求導
delta = cell(depth 1, 1);
%對最後一層隱藏層,即softmax的輸入層求導,delta{depth 1}的每一列是對每個樣本的求導
delta{depth 1} = -softmaxTheta' * (groundTruth - p) .* a{depth 1} .* (1-a{depth 1});
for i=depth:-1:2
delta{i} = stack{i}.w'*delta{i 1}.*a{i}.*(1-a{i});
end
for i=depth:-1:1
stackgrad{i}.w = 1/m .* delta{i 1}*a{i}';
stackgrad{i}.b = 1/m .* sum(delta{i 1}, 2);
end
% -------------------------------------------------------------------------
%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];
end
% You might find this useful
function sigm = sigmoid(x)
sigm = 1 ./ (1   exp(-x));
end

stackedAEPredict.m

function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)
% stackedAEPredict: Takes a trained theta and a test data set,
% and returns the predicted labels for each example.
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize:  the number of hidden units *at the 2nd layer*
% numClasses:  the number of categories
% data: Our matrix containing the training data as columns.  So, data(:,i) is the i-th training example. 
% Your code should produce the prediction matrix 
% pred, where pred(i) is argmax_c P(y(c) | x(i)).
%% Unroll theta parameter
% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses 1:end), netconfig);
%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute pred using theta assuming that the labels start 
%                from 1.
depth = numel(stack);
a = cell(depth 1);
a{1} = data;
m = size(data, 2);
for i=2:depth 1
a{i} = sigmoid(stack{i-1}.w*a{i-1}  repmat(stack{i-1}.b, [1 m]));
end
[prob pred] = max(softmaxTheta*a{depth 1});
% -----------------------------------------------------------
end
% You might find this useful
function sigm = sigmoid(x)
sigm = 1 ./ (1   exp(-x));
end