用 Python 訓練自己的語音識別系統，這波操作穩了

文章來源於微信公眾號：CSDN

作者 / 李秋鍵責編 / Caro

原文連結：請點選

文章僅用於學習交流，如有侵權請聯絡刪除

近幾年來語音識別技術得到了迅速發展，從手機中的Siri語音智慧助手、微軟的小娜以及各種平臺的智慧音箱等等，各種語音識別的專案得到了廣泛應用。

語音識別屬於感知智慧，而讓機器從簡單的識別語音到理解語音，則上升到了認知智慧層面，機器的自然語言理解能力如何，也成為了其是否有智慧的標誌，而自然語言理解正是目前難點。

同時考慮到目前大多數的語音識別平臺都是藉助於智慧雲，對於語音識別的訓練對於大多數人而言還較為神秘，故今天我們將利用python搭建自己的語音識別系統。

最終模型的識別效果如下：

實驗前的準備

首先我們使用的python版本是3。6。5所用到的庫有cv2庫用來影象處理；Numpy庫用來矩陣運算；Keras框架用來訓練和載入模型。Librosa和python_speech_features庫用於提取音訊特徵。Glob和pickle庫用來讀取本地資料集。

資料集準備

首先資料集使用的是清華大學的thchs30中文資料。

這些錄音根據其文字內容分成了四部分，A（句子的ID是1~250），B（句子的ID是251~500），C（501~750），D（751~1000）。ABC三組包括30個人的10893句發音，用來做訓練，D包括10個人的2496句發音，用來做測試。

data資料夾中包含（。wav檔案和。trn檔案；trn檔案裡存放的是。wav檔案的文字描述：第一行為詞，第二行為拼音，第三行為音素）；

資料集如下：

模型訓練

1、提取語音資料集的MFCC特徵：

首先人的聲音是透過聲道產生的，聲道的形狀決定了發出怎樣的聲音。如果我們可以準確的知道這個形狀，那麼我們就可以對產生的音素進行準確的描述。聲道的形狀在語音短時功率譜的包絡中顯示出來。而MFCCs就是一種準確描述這個包絡的一種特徵。其中提取的MFCC特徵如下圖可見。

故我們在讀取資料集的基礎上，要將其語音特徵提取儲存以方便載入入神經網路進行訓練。其對應的程式碼如下：

#讀取資料集檔案

text_paths = glob。glob（‘data/*。trn’）

total = len（text_paths）

print（total）

with open（text_paths［0］， ‘r’， encoding=‘utf8’） as fr：

lines = fr。readlines（）

print（lines）

#資料集檔案trn內容讀取儲存到陣列中

texts = ［］

paths = ［］

for path in text_paths：

with open（path， ‘r’， encoding=‘utf8’） as fr：

lines = fr。readlines（）

line = lines［0］。strip（‘\n’）。replace（‘ ’， ‘’）

texts。append（line）

paths。append（path。rstrip（‘。trn’））

print（paths［0］， texts［0］）

#定義mfcc數

mfcc_dim = 13

#根據資料集標定的音素讀入

def load_and_trim（path）：

audio， sr = librosa。load（path）

energy = librosa。feature。rmse（audio）

frames = np。nonzero（energy >= np。max（energy） / 5）

indices = librosa。core。frames_to_samples（frames）［1］

audio = audio［indices［0］：indices［-1］］ if indices。size else audio［0：0］

return audio， sr

#提取音訊特徵並存儲

features = ［］

for i in tqdm（range（total））：

path = paths［i］

audio， sr = load_and_trim（path）

features。append（mfcc（audio， sr， numcep=mfcc_dim， nfft=551））

print（len（features）， features［0］。shape）

2、神經網路預處理：

在進行神經網路載入訓練前，我們需要對讀取的MFCC特徵進行歸一化，主要目的是為了加快收斂，提高效果和減少干擾。然後處理好資料集和標籤定義輸入和輸出即可。

對應程式碼如下：

#隨機選擇100個數據集

samples = random。sample（features， 100）

samples = np。vstack（samples）

#平均MFCC的值為了歸一化處理

mfcc_mean = np。mean（samples， axis=0）

#計算標準差為了歸一化

mfcc_std = np。std（samples， axis=0）

print（mfcc_mean）

print（mfcc_std）

#歸一化特徵

features = ［（feature - mfcc_mean） / （mfcc_std + 1e-14） for feature in features］

#將資料集讀入的標籤和對應id儲存列表

chars = {}

for text in texts：

for c in text：

chars［c］ = chars。get（c， 0） + 1

chars = sorted（chars。items（）， key=lambda x： x［1］， reverse=True）

chars = ［char［0］ for char in chars］

print（len（chars）， chars［：100］）

char2id = {c： i for i， c in enumerate（chars）}

id2char = {i： c for i， c in enumerate（chars）}

data_index = np。arange（total）

np。random。shuffle（data_index）

train_size = int（0。9 * total）

test_size = total - train_size

train_index = data_index［：train_size］

test_index = data_index［train_size：］

#神經網路輸入和輸出X，Y的讀入資料集特徵

X_train = ［features［i］ for i in train_index］

Y_train = ［texts［i］ for i in train_index］

X_test = ［features［i］ for i in test_index］

Y_test = ［texts［i］ for i in test_index］

3、神經網路函式定義：

其中包括訓練的批次，卷積層函式、標準化函式、啟用層函式等等。

其中第⼀個維度為⼩⽚段的個數，原始語⾳越長，第⼀個維度也越⼤，第⼆個維度為 MFCC 特徵的維度。得到原始語⾳的數值表⽰後，就可以使⽤ WaveNet 實現。由於 MFCC 特徵為⼀維序列，所以使⽤ Conv1D 進⾏卷積。因果是指，卷積的輸出只和當前位置之前的輸⼊有關，即不使⽤未來的特徵，可以理解為將卷積的位置向前偏移。WaveNet 模型結構如下所⽰：

具體如下可見：

batch_size = 16

#定義訓練批次的產生，一次訓練16個

def batch_generator（x， y， batch_size=batch_size）：

offset = 0

while True：

offset += batch_size

if offset == batch_size or offset >= len（x）：

data_index = np。arange（len（x））

np。random。shuffle（data_index）

x = ［x［i］ for i in data_index］

y = ［y［i］ for i in data_index］

offset = batch_size

X_data = x［offset - batch_size： offset］

Y_data = y［offset - batch_size： offset］

X_maxlen = max（［X_data［i］。shape［0］ for i in range（batch_size）］）

Y_maxlen = max（［len（Y_data［i］） for i in range（batch_size）］）

X_batch = np。zeros（［batch_size， X_maxlen， mfcc_dim］）

Y_batch = np。ones（［batch_size， Y_maxlen］） * len（char2id）

X_length = np。zeros（［batch_size， 1］， dtype=‘int32’）

Y_length = np。zeros（［batch_size， 1］， dtype=‘int32’）

for i in range（batch_size）：

X_length［i， 0］ = X_data［i］。shape［0］

X_batch［i，：X_length［i， 0］，：］ = X_data［i］

Y_length［i， 0］ = len（Y_data［i］）

Y_batch［i，：Y_length［i， 0］］ = ［char2id［c］ for c in Y_data［i］］

inputs = {‘X’： X_batch， ‘Y’： Y_batch， ‘X_length’： X_length， ‘Y_length’： Y_length}

outputs = {‘ctc’： np。zeros（［batch_size］）}

epochs = 50

num_blocks = 3

filters = 128

X = Input（shape=（None， mfcc_dim，）， dtype=‘float32’， name=‘X’）

Y = Input（shape=（None，）， dtype=‘float32’， name=‘Y’）

X_length = Input（shape=（1，）， dtype=‘int32’， name=‘X_length’）

Y_length = Input（shape=（1，）， dtype=‘int32’， name=‘Y_length’）

#卷積1層

def conv1d（inputs， filters， kernel_size， dilation_rate）：

return Conv1D（filters=filters， kernel_size=kernel_size， strides=1， padding=‘causal’， activation=None，

dilation_rate=dilation_rate）（inputs）

#標準化函式

def batchnorm（inputs）：

return BatchNormalization（）（inputs）

#啟用層函式

def activation（inputs， activation）：

return Activation（activation）（inputs）

#全連線層函式

def res_block（inputs， filters， kernel_size， dilation_rate）：

hf = activation（batchnorm（conv1d（inputs， filters， kernel_size， dilation_rate））， ‘tanh’）

hg = activation（batchnorm（conv1d（inputs， filters， kernel_size， dilation_rate））， ‘sigmoid’）

h0 = Multiply（）（［hf， hg］）

ha = activation（batchnorm（conv1d（h0， filters， 1， 1））， ‘tanh’）

hs = activation（batchnorm（conv1d（h0， filters， 1， 1））， ‘tanh’）

return Add（）（［ha， inputs］）， hs

h0 = activation（batchnorm（conv1d（X， filters， 1， 1））， ‘tanh’）

shortcut = ［］

for i in range（num_blocks）：

for r in ［1， 2， 4， 8， 16］：

h0， s = res_block（h0， filters， 7， r）

shortcut。append（s）

h1 = activation（Add（）（shortcut）， ‘relu’）

h1 = activation（batchnorm（conv1d（h1， filters， 1， 1））， ‘relu’）

#softmax損失函式輸出結果

Y_pred = activation（batchnorm（conv1d（h1， len（char2id） + 1， 1， 1））， ‘softmax’）

sub_model = Model（inputs=X， outputs=Y_pred）

#計算損失函式

def calc_ctc_loss（args）：

y， yp， ypl， yl = args

return K。ctc_batch_cost（y， yp， ypl， yl）

4、模型的訓練：

訓練的過程如下可見：

ctc_loss = Lambda（calc_ctc_loss， output_shape=（1，）， name=‘ctc’）（［Y， Y_pred， X_length， Y_length］）

#載入模型訓練

model = Model（inputs=［X， Y， X_length， Y_length］， outputs=ctc_loss）

#建立最佳化器

optimizer = SGD（lr=0。02， momentum=0。9， nesterov=True， clipnorm=5）

#啟用模型開始計算

model。compile（loss={‘ctc’： lambda ctc_true， ctc_pred： ctc_pred}， optimizer=optimizer）

checkpointer = ModelCheckpoint（filepath=‘asr。h5’， verbose=0）

lr_decay = ReduceLROnPlateau（monitor=‘loss’， factor=0。2， patience=1， min_lr=0。000）

#開始訓練

history = model。fit_generator（

generator=batch_generator（X_train， Y_train），

steps_per_epoch=len（X_train） // batch_size，

epochs=epochs，

validation_data=batch_generator（X_test， Y_test），

validation_steps=len（X_test） // batch_size，

callbacks=［checkpointer， lr_decay］）

#儲存模型

sub_model。save（‘asr。h5’）

#將字儲存在pl=pkl中

with open（‘dictionary。pkl’， ‘wb’） as fw：

pickle。dump（［char2id， id2char， mfcc_mean， mfcc_std］， fw）

train_loss = history。history［‘loss’］

valid_loss = history。history［‘val_loss’］

plt。plot（np。linspace（1， epochs， epochs）， train_loss， label=‘train’）

plt。plot（np。linspace（1， epochs， epochs）， valid_loss， label=‘valid’）

plt。legend（loc=‘upper right’）

plt。xlabel（‘Epoch’）

plt。ylabel（‘Loss’）

plt。show（）

測試模型

讀取我們語音資料集生成的字典，透過呼叫模型來對音訊特徵識別。

程式碼如下：

wavs = glob。glob（‘A2_103。wav’）

print（wavs）

with open（‘dictionary。pkl’， ‘rb’） as fr：

［char2id， id2char， mfcc_mean， mfcc_std］ = pickle。load（fr）

mfcc_dim = 13

model = load_model（‘asr。h5’）

index = np。random。randint（len（wavs））

print（wavs［index］）

audio， sr = librosa。load（wavs［index］）

energy = librosa。feature。rmse（audio）

frames = np。nonzero（energy >= np。max（energy） / 5）

indices = librosa。core。frames_to_samples（frames）［1］

audio = audio［indices［0］：indices［-1］］ if indices。size else audio［0：0］

X_data = mfcc（audio， sr， numcep=mfcc_dim， nfft=551）

X_data = （X_data - mfcc_mean） / （mfcc_std + 1e-14）

print（X_data。shape）

pred = model。predict（np。expand_dims（X_data， axis=0））

pred_ids = K。eval（K。ctc_decode（pred，［X_data。shape［0］］， greedy=False， beam_width=10， top_paths=1）［0］［0］）

pred_ids = pred_ids。flatten（）。tolist（）

print（‘’。join（［id2char［i］ for i in pred_ids］））

yield （inputs， outputs）

到這裡，我們整體的程式就搭建完成，下面為我們程式的執行結果：

原始碼地址：

https：//

pan。baidu。com/s/1tFlZkM

JmrMTD05cd_zxmAg

提取碼：ndrr

資料集需要自行下載。

作者簡介：

李秋鍵，CSDN部落格專家，CSDN達人課作者。碩士在讀於中國礦業大學，開發有taptap競賽獲獎等等。

「華來知識」

成立於2017年，孵化於清華大學智慧技術與系統國家重點實驗室，是一家技術領先的人工智慧企業。公司專注於提供新一代人工智慧人機互動解決方案，利用自身技術為企業打造由人工智慧驅動的知識體系，藉此改善人類生活。

華來科技將持續為企業客戶提供優質服務，助力企業在專業領域的人工智慧應用，提供完善可靠高效的產品解決方案。

用 Python 訓練自己的語音識別系統，這波操作穩了

幼師如何引導幼兒的“告狀”行為？

乾貨滿滿！吐血總結插畫師必逛網站

隨便看看

大眾315車機和187b哪個音質好？

解題王是本什麼書啊？

卷柏的精神值得我們學習？

手機充電器介面鬆了怎麼辦？

用 Python 訓練自己的語音識別系統，這波操作穩了

幼師如何引導幼兒的“告狀”行為？

乾貨滿滿！吐血總結插畫師必逛網站

猜你喜歡

Lua資料的記憶體結構

EM演算法與聚類

聚類演算法kmeans及kmeans++介紹(含python實現)

隨便看看

大眾315車機和187b哪個音質好？

解題王是本什麼書啊？

卷柏的精神值得我們學習？

手機充電器介面鬆了怎麼辦？