說說雞尾酒會問題(Cocktail Party Problem)和程式實現

說說雞尾酒會問題(Cocktail Party Problem)和程式實現

“雞尾酒會問題”(cocktailparty problem)是在計算機語音識別領域的一個問題,當前語音識別技術已經可以以較高精度識別一個人所講的話,但是當說話的人數為兩人或者多人時,語音識別率就會極大的降低,這一難題被稱為雞尾酒會問題。

該問題描述的是給定混合訊號,如何分離出雞尾酒會中同時說話的每個人的獨立訊號。當有N個訊號源時,通常假設觀察訊號也有N個(例如N個麥克風或者錄音機)。該假設意味著混合矩陣是個方陣,即J = D,其中D是輸入資料的維數,J是系統模型的維數。

 

要分離出雞尾酒會中同時說話的每個人的獨立訊號,常用的方法是盲訊號分離演算法。

 

1 盲訊號分離

1.1 簡介

盲訊號(BlindSource Separation,BSS)分離指的是從多個觀測到的混合訊號中分析出沒有觀測的原始訊號。通常觀測到的混合訊號來自多個感測器的輸出,並且感測器的輸出訊號獨立性(線性不相關)。盲訊號的“盲”字強調了兩點:1)原始訊號並不知道;2)對於訊號混合的方法也不知道。

 

1.2 原理(簡單的數學描述)

為了簡單易懂,我們先看看只有兩個訊號源的情況,即觀測訊號也是隻有兩個。 和 是兩個源訊號; 和 是兩個觀測訊號; 和 是對聲源訊號 和 的估計。矩陣A是混合矩陣(說得有點彆扭,混合矩陣就是將兩個訊號混合在一起,然後產生輸出兩個觀測訊號)。W是權重矩陣(用於提取估計出聲源訊號)。下面就是BSS演算法的主要流程圖。

 

圖1 BSS演算法主要流程圖

 

下面是圖1所示流程的,矩陣表達形式:

 

圖2 BSS演算法主要流程方程形式

 

 

看完簡單的,我們看看難一點的,盲訊號分離的模型。

具有n個獨立的訊號源和n個獨立的觀察量,觀察量和訊號源具有如下的關係:

其中 , , A是一個的係數矩陣,原問題變成了已知的獨立性,求對
的估計問題。

假定有如下公式

其中是對 的估計,W是一個 係數矩陣,問題變成了如何有效的對矩陣W做出估計。

因此,BSS的主要工作是尋找一個線性濾波器或者是一個權重矩陣W。理想的條件下,W矩陣是A矩陣的逆矩陣。

由於引數空間不是歐幾里得度量,在大多的情況下都是黎曼度量,因此對於W矩陣的求解選用自然梯度解法。

 

1.3 自然梯度法計算W矩陣

自然梯度法的計算公式為:

其中W為我們需要估計的矩陣。為步長,是一個非線性變換,比如

實際計算時y為一個矩陣,m為原始訊號個數,k為取樣點個數

 

計算步驟:

1)初始化W(0)為單位矩陣;

2)迴圈執行如下的步驟,直到W(n 1)與W(n)差異小於規定值(計算矩陣差異的方法可以人為規定),有時候也人為規定迭代次數;

3)利用公式 ,(其中 );

4)利用公式

2 程式實現

1.main.m

% fftproject.m
% By: Johan Bylund(原始作者)
% 修改註釋:Harvey
close all;
clear, clf, format compact;
% Makes a matrix out of wav files in project directory.
% The arreys will all be in the same length as the shortest one.
% Files longer than the shortest one will be truncated.
disp('Reading .wav files from project directory');
mic_1=wavread('000100100mix2.wav'); %Reading file from right microphone.
mic_2=wavread('000100100mix1.wav'); %Reading file from left microphone.
size(mic_1)
size(mic_2)
% The below operation makes them 1xN:
% 將矩陣設定成1*N的矩陣就一行多列的矩陣
mic_1=mic_1';
mic_2=mic_2';
% 規範長度,將長度統一在一起
if length(mic_1)>length(mic_2)
mic_1=mic_1(1:length(mic_2));
else
mic_2=mic_2(1:length(mic_1));
end
size(mic_1)
size(mic_2)
% 分別播放兩個原始的接受源訊號
disp('Playing the recording of the right microphone (closer to music)');
soundsc(mic_1)
disp('Playing the recording of the left microphone (closer to me)');
soundsc(mic_2)
subplot(2,1,1)
plot(mic_1), axis([0 16000 -0.12 0.12]);
title('Right microphone (closer to music)')
xlabel('Sampled points');
subplot(2,1,2)
plot(mic_2), axis([0 16000 -0.12 0.12]);
title('Left microphone (closer to me)')
xlabel('Sampled points');
% I also choose to look at the frequency spectra of the signal:
% 訊號頻譜經過快速傅立葉變換顯示
Fs=8000; % Sampling frequency
% Matrix with the frequency spectra from the two microphones:
fftsounds=[real(fft(mic_1,Fs));real(fft(mic_2,Fs))];
f=[1:Fs/2];
figure(2)
subplot(2,1,1)
plot(f,fftsounds(1,f)), axis([0 4000 -15 15]);
title('Frequency spectra of the right microphone')
xlabel('Frequency (Hz)');
subplot(2,1,2)
plot(f,fftsounds(2,f)), axis([0 4000 -15 15]);
title('Frequency spectra of the left microphone')
xlabel('Frequency (Hz)');
% At first I tried the algorithm in the time domain, and it didn't
% manage to separate the two sources very well.
% After that I used the frequency spectra of the two microphone signals
% in the algorithm and it worked out much better.
% 演算法開始了……
% 按照原作者說的話,使用頻域來運算的效果會比時域要好(請看上面的英文註釋)
sounds=[real(fft(mic_1));real(fft(mic_2))];
% N="number of microphones"
% P="number of points"
[N,P]=size(sounds) % P=?, N=2, in this case.
permute=randperm(P); % Generate a permutation vector.
s=sounds(:,permute); % Time-scrambled inputs for stationarity.
x=s;
mixes=sounds;
% Spheres the data (normalisation).
mx=mean(mixes');
c=cov(mixes');
x=x-mx'*ones(1,P); % Subtract means from mixes.
wz=2*inv(sqrtm(c)); % Get decorrelating matrix.
x=wz*x; % Decorrelate mixes so cov(x')=4*eye(N);
w=pi^2*rand(N); % Initialise unmixing matrix.
M=size(w,2); % M=N usually
sweep=0; oldw=w; olddelta=ones(1,N*N);
Id=eye(M);
% L="learning rate, B="points per block"
% Both are used in sep.m, which goes through the mixed signals
% in batch blocks of size B, adjusting weights, w, at the end
% of each block.
%L=0.01; B=30; sep
%L=0.001; B=30; sep % Annealing will improve solution.
%L=0.0001; B=30; sep % ...and so on
%for multiple sweeps:
L=0.0001; B=30;
for I=1:100
sep; % For details see sep.m
end;
uu=w*wz*mixes; % make unmixed sources
% Plot the two separated vectors in the frequency domain.
figure(3)
subplot(2,1,1)
plot(f,uu(1,f)), axis([0 4000 -22 22]);
title('Frequency spectra of one of the separated signals')
xlabel('Frequency (Hz)');
subplot(2,1,2)
plot(f,uu(2,f)), axis([0 4000 -22 22]);
title('Frequency spectra of the other separated signal')
xlabel('Frequency (Hz)');
% Transform signals back to time domain.
uu(2,:)=real(ifft(uu(2,:)));
uu(1,:)=real(ifft(uu(1,:)));
disp('Playing the first of the separated vectors');
soundsc(uu(1,:))
% Plot the vector that is played above.
figure(4);
subplot(2,1,1)
plot(uu(1,:)), axis([0 16000 -0.12 0.12]);
title('Plot of one of the separated vectors (time domain)')
disp('Playing the second of the separated vectors');
soundsc(uu(2,:))
% Plot the vector that is played above.
subplot(2,1,2)
plot(uu(2,:)), axis([0 16000 -0.12 0.12]);
title('Plot of the other separated vector (time domain)')

2. sep.m

% SEP goes once through the scrambled mixed speech signals, x 
% (which is of length P), in batch blocks of size B, adjusting weights,
% w, at the end of each block.
%
% I suggest a learning rate L, of 0.01 at least for 2->2 separation.
% But this will be unstable for higher dimensional data. Test it.
% Use smaller values. After convergence at a value for L, lower
% L and it will fine tune the solution.
%
% NOTE: this rule is the rule in our NC paper, but multiplied by w^T*w,
% as proposed by Amari, Cichocki & Yang at NIPS '95. This `natural
% gradient' method speeds convergence and avoids the matrix inverse in the 
% learning rule.
sweep=sweep 1; t=1;
noblocks=fix(P/B);
BI=B*Id;
for t=t:B:t-1 noblocks*B,
u=w*x(:,t:t B-1); 
w=w L*(BI (1-2*(1./(1 exp(-u))))*u')*w;
end;
sepout

3. sepout.m

% SEPOUT - put whatever textual output report you want here.
%  Called after each pass through the data.
%  If your data is real, not artificially mixed, you will need
%  to comment out line 4, since you have no idea what the matrix 'a' is.
% 
[change,olddelta,angle]=wchange(oldw,w,olddelta); 
oldw=w;
fprintf('****sweep=%d, change=%.4f angle=%.1f deg., [N%d,M%d,P%d,B%d,L%.5f] \n',...
sweep,change,180*angle/pi,N,M,P,B,L);
% w*wz*a     %should be a permutation matrix for artif. mixed data

上述程式碼中所使用的聲音檔案可以到這個網址下下載:http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi

參考文獻

[1]維基百科. 獨立成分分析.
http://zh.wikipedia.org/wiki/%E7%8B%AC%E7%AB%8B%E6%88%90%E5%88%86%E5%88%86%E6%9E%90
,2014-1-14.

[2] 我愛公開課. Coursera公開課筆記: 斯坦福大學機器學習第一課“引言(Introduction)”.
http://52opencourse.com/54/coursera%E5%85%AC%E5%BC%80%E8%AF%BE%E7%AC%94%E8%AE%B0-%E6%96%AF%E5%9D%A6%E7%A6%8F%E5%A4%A7%E5%AD%A6%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%AC%AC%E4%B8%80%E8%AF%BE-%E5%BC%95%E8%A8%80-introduction
,2014-1-14.

[3]維基百科. 盲訊號分離.
http://zh.wikipedia.org/wiki/%E7%9B%B2%E4%BF%A1%E5%8F%B7%E5%88%86%E7%A6%BB
,2014-1-14.

[4]Johan Bylund. A hunble attempt to solvethe “Cocktail-party” problem Using Blind Source Separation[EB].http://www.vislab.uq.edu.au/education/sc3/2001/johan/johan.pdf.