【语音识别】基于matlab MFCC+IPC特征+SVM中英语种识别【含Matlab源码 612期】
一、语种识别音频处理简介
1 基本原理
语种识别,根据一段音频判断该音频是英语、中语还是法语,即判断音频的语种。语种识别项目的整体思想就是把语音数据转换成相应的语谱图或者MFCC特征,再对特征进行分析,从而判断出该语音数据的语种类别。
2 公开数据集
Topcoder 竞赛 数据(44.1khz 的 mp3 录音,每条 10 秒,176 种语言合计 66176(176*376)条数据,诸多小语种)。
3 基本音频处理流程
语音输入,然后音频信号特征提取,然后进行特征分析处理,最终得到结果,其中音频特征提取多半采用频谱图或者MFCC特征。
4 详解
4.1 语音输入
wav(波形音频文件)mp3 文件或是麦克风中输入的音频信号输入音频
4.2 音频信号特证提取
语音信号处理的目的是弄清语音中各个频率成分的分布。常用的数学工具是傅里叶变换,而傅里叶变换要求输入信号是平稳的,需要对语音信号进行分帧处理,截取出来的一小段信号(通常 20-30ms)就叫一帧。【微观里断定输入信号是平稳的】
语音分帧→每一帧分别 FFT( 离散傅立叶变换) →求取 FFT 之后的幅度/能量,这些数值都是正值,类似图像的像素点,显示出来就是语谱图。
其中语谱图的 x 是时间,y 轴是频率。利用语谱图可以查看指定频率端的能量分布。
二、部分源代码
clc;
clear;
load traindata Myfeature
A1=zeros(1,30);
A2=ones(1,30);
Group=[A1,A2];
TrainData=Myfeature;
SVMStruct = svmtrain(TrainData,Group);
N=5.3;
Tw = 25; % analysis frame duration (ms)
Ts = 10; % analysis frame shift (ms)
alpha = 0.97; % preemphasis coefficient
R = [ 300 3700 ]; % frequency range to consider
M = 20; % number of filterbank channels
C = 13; % number of cepstral coefficients
L = 22; % cepstral sine lifter parameter
fs = 16000;
hamming = @(N)(0.54-0.46*cos(2*pi*[0:N-1].'/(N-1)));
[filename, pathname] = uigetfile({'*.*';'*.flac'; '*.wav'; '*.mp3'; }, '选择语音');
% %没有图像
if filename == 0
return;
end
[speech,fs] = audioread([pathname, filename]);
[voice,fs]=extractvoice_simple(speech,-30, -20,0.2);
voicex=voice(1:N*16000);
[ mfccs, FBEs, frames ] = ...
mfcc( voicex, fs, Tw, Ts, alpha, hamming, R, M, C, L );
ceps_mfccx=mfccs(:);
[cep,ER]=lpces(voicex,17,256,256); ceps_lpc=cep(2:17,:);%LPC
%[lpc,ER]=lpces(voice,12,256,256);
%ceps_lpcc=lpc2lpcc(cep);%LPCC
ceps_lpcx=ceps_lpc(:);
ceps=[ceps_mfccx(1000:2000);ceps_lpcx(1:2000)];
TestData = ceps';
languagex=svmclassify(SVMStruct,TestData);
if languagex == 1
language='Chinese'
else
language='English'
end
% t=[1:2000];
% figure
% scatter(t,ceps_lpcx(1:2000),50,'r');
% xlabel('sample point');
% ylabel('LPC');
% title('LPC features');
% hold on
% [filename, pathname] = uigetfile({'*.*';'*.flac'; '*.wav'; '*.mp3'; }, '选择语音');
% % %没有图像
% if filename == 0
% return;
% end
% [speech,fs] = audioread([pathname, filename]);
% [voice,fs]=extractvoice_simple(speech,-30, -20,0.2);
% voicex=voice(1:N*16000);
% [ mfccs, FBEs, frames ] = ...
% mfcc( voicex, fs, Tw, Ts, alpha, hamming, R, M, C, L );
% ceps_mfccx=mfccs(:);
% [cep,ER]=lpces(voicex,17,256,256); ceps_lpc=cep(2:17,:);%LPC
%
function [ H, f, c ] = trifbank( M, K, R, fs, h2w, w2h )
% TRIFBANK Triangular filterbank.
%
% [H,F,C]=TRIFBANK(M,K,R,FS,H2W,W2H) returns matrix of M triangular filters
% (one per row), each K coefficients long along with a K coefficient long
% frequency vector F and M+2 coefficient long cutoff frequency vector C.
% The triangular filters are between limits given in R (Hz) and are
% uniformly spaced on a warped scale defined by forward (H2W) and backward
% (W2H) warping functions.
%
% Inputs
% M is the number of filters, i.e., number of rows of H
%
% K is the length of frequency response of each filter
% i.e., number of columns of H
%
% R is a two element vector that specifies frequency limits (Hz),
% i.e., R = [ low_frequency high_frequency ];
%
% FS is the sampling frequency (Hz)
%
% H2W is a Hertz scale to warped scale function handle
%
% W2H is a wared scale to Hertz scale function handle
%
% Outputs
% H is a M by K triangular filterbank matrix (one filter per row)
%
% F is a frequency vector (Hz) of 1xK dimension
%
% C is a vector of filter cutoff frequencies (Hz),
% note that C(2:end) also represents filter center frequencies,
% and the dimension of C is 1x(M+2)
%
% Example
% fs = 16000; % sampling frequency (Hz)
% nfft = 2^12; % fft size (number of frequency bins)
% K = nfft/2+1; % length of each filter
% M = 23; % number of filters
%
% hz2mel = @(hz)(1127*log(1+hz/700)); % Hertz to mel warping function
% mel2hz = @(mel)(700*exp(mel/1127)-700); % mel to Hertz warping function
%
% % Design mel filterbank of M filters each K coefficients long,
% % filters are uniformly spaced on the mel scale between 0 and Fs/2 Hz
% [ H1, freq ] = trifbank( M, K, [0 fs/2], fs, hz2mel, mel2hz );
%
% % Design mel filterbank of M filters each K coefficients long,
% % filters are uniformly spaced on the mel scale between 300 and 3750 Hz
% [ H2, freq ] = trifbank( M, K, [300 3750], fs, hz2mel, mel2hz );
%
% % Design mel filterbank of 18 filters each K coefficients long,
% % filters are uniformly spaced on the Hertz scale between 4 and 6 kHz
% [ H3, freq ] = trifbank( 18, K, [4 6]*1E3, fs, @(h)(h), @(h)(h) );
%
% hfig = figure('Position', [25 100 800 600], 'PaperPositionMode', ...
% 'auto', 'Visible', 'on', 'color', 'w'); hold on;
% subplot( 3,1,1 );
% plot( freq, H1 );
% xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' );
%
% subplot( 3,1,2 );
% plot( freq, H2 );
% xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' );
%
% subplot( 3,1,3 );
% plot( freq, H3 );
% xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' );
%
% Reference
% [1] Huang, X., Acero, A., Hon, H., 2001. Spoken Language Processing:
% A guide to theory, algorithm, and system development.
% Prentice Hall, Upper Saddle River, NJ, USA (pp. 314-315).
% Author Kamil Wojcicki, UTD, June 2011
if( nargin~= 6 ), help trifbank; return; end; % very lite input validation
f_min = 0; % filter coefficients start at this frequency (Hz)
f_low = R(1); % lower cutoff frequency (Hz) for the filterbank
f_high = R(2); % upper cutoff frequency (Hz) for the filterbank
f_max = 0.5*fs; % filter coefficients end at this frequency (Hz)
f = linspace( f_min, f_max, K ); % frequency range (Hz), size 1xK
fw = h2w( f );
% filter cutoff frequencies (Hz) for all filters, size 1x(M+2)
c = w2h( h2w(f_low)+[0:M+1]*((h2w(f_high)-h2w(f_low))/(M+1)) );
cw = h2w( c );
H = zeros( M, K ); % zero otherwise
for m = 1:M
% implements Eq. (6.140) on page 314 of [1]
% k = f>=c(m)&f<=c(m+1); % up-slope
% H(m,k) = 2*(f(k)-c(m)) / ((c(m+2)-c(m))*(c(m+1)-c(m)));
% k = f>=c(m+1)&f<=c(m+2); % down-slope
% H(m,k) = 2*(c(m+2)-f(k)) / ((c(m+2)-c(m))*(c(m+2)-c(m+1)));
% implements Eq. (6.141) on page 315 of [1]
k = f>=c(m)&f<=c(m+1); % up-slope
H(m,k) = (f(k)-c(m))/(c(m+1)-c(m));
k = f>=c(m+1)&f<=c(m+2); % down-slope
H(m,k) = (c(m+2)-f(k))/(c(m+2)-c(m+1));
end
三、运行结果
四、matlab版本及参考文献
1 matlab版本
2014a
2 参考文献
[1]韩纪庆,张磊,郑铁然.语音信号处理(第3版)[M].清华大学出版社,2019.
[2]柳若边.深度学习:语音识别技术实践[M].清华大学出版社,2019.
文章来源: qq912100926.blog.csdn.net,作者:海神之光,版权归原作者所有,如需转载,请联系作者。
原文链接:qq912100926.blog.csdn.net/article/details/115139610
- 点赞
- 收藏
- 关注作者
评论(0)