最近做了一個神經網路來實現語言模型的作業,基於Pytorch,不是特別難的一個任務。但是做的過程中發現網上各種寫法大多是基於 Pytorch 0。4 或更老的版本來實現的。在這個任務裡,想嘗試一些比較優雅的寫法,主要是用了 Pytorch dataloader 來比較方便的實現batch的資料結構。

首先明確模型訓練的任務是基於前一個單詞預測後一個單詞(或者基於當前的詞預測之後的單詞),以此來使得語言模型擁有識別詞法或者句法的能力。另一個需要在構建前明確的是模型的輸入和輸出,輸入對應每一個原始的句子,輸出對應原始句子之後一個時間步(timestep)得到的句子,這裡舉一個例子:

慢學NLP語言模型RNN-LM (Pytorch-batch)

輸入輸出樣式

大概明確了任務之後,就可以展開了,這裡我先選取了一個Corpus,這裡選用什麼語料都沒關係,也可以選取wikipedia的text。我這裡先用莎士比亞的《哈姆萊特》作為語料進行訓練。

2. 根據語料構建 dataset 和 dataloader(data_loader.py)

def

tokenize

lines

):

END

=

‘’

# 補全句子末尾結束符

sents

=

str

lower

str

line

))

split

()

+

END

for

line

in

lines

return

sents

# 可以進行適當的長度的選取

def

readin

path

):

list_lines

=

[]

with

open

path

‘r’

encoding

=

‘utf-8’

as

f

for

line

in

f

if

len

line

strip

())

==

0

# 過濾空的行

continue

words

=

line

split

()

+

‘’

# if 20 < len(words) < 500:

list_lines

append

line

return

list_lines

[:

100

# 構建輸入,輸出

def

construct_input_output

sents

):

input_sents

=

words

[:

-

1

for

words

in

sents

output_sents

=

words

1

:]

for

words

in

sents

return

input_sents

output_sents

def

process

path

):

list_of_lines

=

readin

path

lines_tokens

=

tokenize

list_of_lines

input

output

=

construct_input_output

lines_tokens

return

input

output

2. 根據語料構建dataset和dataloader (data_loader.py)

這一步是為了更好的實現 batch,相比於網上各種在訓練階段人為劃定的方式,這種方式更加友好高效且可讀。另一個重點就是實現 dataset 以及我們需要的詞表 vocab 。我們根據詞來獲得詞表序號(word2index)和根據詞表序號來還原詞(index2word),用 vectorize 和 unvectorize 2個方法來實現。最重要的是要實現 dataset 中獲取 item 和長度 2個預定義方法,基於

from

torch。utils。data

import

Dataset,這一部分可以參考 Pytorch 文件,一定需要實現 __getitem_和__len__2個方法。

import

torch

from

torch。nn

import

functional

as

F

from

torch。utils。data

import

Dataset

from

gensim。corpora。dictionary

import

Dictionary

class

LangDataset

Dataset

):

def

__init__

self

src_sents

trg_sents

max_len

=-

1

):

self

src_sents

=

src_sents

self

trg_sents

=

trg_sents

# Create the vocabulary for both the source and target。

self

vocab

=

Dictionary

src_sents

+

trg_sents

# Patch the vocabularies and add the and symbols。

special_tokens

=

{

0

1

‘’

2

}

self

vocab

patch_with_special_tokens

special_tokens

# Keep track of how many data points。

self

_len

=

len

src_sents

if

max_len

<

0

# If it‘s not set, find the longest text in the data。

max_src_len

=

max

len

sent

for

sent

in

src_sents

self

max_len

=

max_src_len

else

self

max_len

=

max_len

def

pad_sequence

self

vectorized_sent

max_len

):

# To pad the sentence:

# Pad left = 0; Pad right = max_len - len of sent。

pad_dim

=

0

max_len

-

len

vectorized_sent

))

return

F

pad

vectorized_sent

pad_dim

’constant‘

def

__getitem__

self

index

):

vectorized_src

=

self

vectorize

self

vocab

self

src_sents

index

])

vectorized_trg

=

self

vectorize

self

vocab

self

trg_sents

index

])

return

{

’x‘

self

pad_sequence

vectorized_src

self

max_len

),

’y‘

self

pad_sequence

vectorized_trg

self

max_len

),

’x_len‘

len

vectorized_src

),

’y_len‘

len

vectorized_trg

)}

def

__len__

self

):

return

self

_len

def

vectorize

self

vocab

tokens

):

“”“

:param tokens: Tokens that should be vectorized。

:type tokens: list(str)

”“”

# See https://radimrehurek。com/gensim/corpora/dictionary。html#gensim。corpora。dictionary。Dictionary。doc2idx

# Lets just cast list of indices into torch tensors directly =)

return

torch

tensor

vocab

doc2idx

tokens

unknown_word_index

=

1

))

def

unvectorize

self

vocab

indices

):

“”“

:param indices: Converts the indices back to tokens。

:type tokens: list(int)

”“”

return

vocab

i

for

i

in

indices

這裡運用了 gensim 最新的 pad_sequence 來增加詞表中特殊標定如 pad,unk 和 。進行到這裡,基本上已經可以將原始的詞轉化為是詞表序號的序列。接下來就可以構建模型了。

3. 構建模型,使用LSTM (model.py)

熟悉RNN類的其實也沒什麼可多說的,三層結構嵌入層-LSTM層-全連線層,eval如下:

RNNLM(

(embed): Embedding(3910, 300)

(lstm): LSTM(300, 1024, num_layers=2, batch_first=True)

(fc): Linear(in_features=1024, out_features=391, bias=True)

這裡採用了300維的詞向量,2層LSTM,hidden_size為1024層,程式碼如下:

import

torch。nn

as

nn

from

torch。nn

import

functional

as

F

class

RNNLM

nn

Module

):

def

__init__

self

vocab_size

embed_size

hidden_size

num_layers

=

1

dropout_p

=

0。5

):

super

RNNLM

self

__init__

()

self

embed

=

nn

Embedding

vocab_size

embed_size

self

lstm

=

nn

LSTM

embed_size

hidden_size

num_layers

batch_first

=

True

self

fc

=

nn

Linear

hidden_size

vocab_size

self

_dropout_p

=

dropout_p

def

forward

self

x

h

):

# Embed word ids to vectors

x

=

self

embed

x

# Forward propagate LSTM

out

h

=

self

lstm

x

h

batch_size

seq_size

hidden_size

=

out

shape

# Reshape output to (batch_size*sequence_length, hidden_size)

out

=

out

contiguous

()

view

batch_size

*

seq_size

hidden_size

# apply dropout

out

=

self

fc

F

dropout

out

p

=

self

_dropout_p

))

out_feat

=

out

shape

-

1

out

=

out

view

batch_size

seq_size

out_feat

return

out

h

我們這裡只是拿每一步的輸出,因此需要拿到每一個時間步的output就好了,對hidden state不做什麼事情。因為引入了batch,所以在全連線的時候需要對batch和sequence length相乘不然無法輸入到全連線層,透過全連線層FC之後,我們再把 output 還原成 (batch, sequence, feature) 的形式。

4. 訓練過程 (train.py)

針對訓練過程,大致的過程是

讀入輸入輸出x和y,掛載到 GPU上(不掛GPU,那就是烏龜速度了),獲得詞表長度作為embedding層的初始化引數。

初始化模型,model = RNNLM (一堆引數們( ̄▽ ̄)~*)。

定義最佳化器optimizer,這裡用的是 adam(可以獲得較快收斂吧,SGD什麼的隨意都行)。

定義損失函式計算loss,這裡用的是 cross entropy loss,有一點區別在於因為我們對句子進行了不定長的pad補齊,因此需要對pad的詞位進行mask來忽略它的影響。另外語言模型比較好的衡量loss的方式就是perplexity,perplexity和loss的關係就是一個e為底的指數關係,推導可以參考(Perplexity Vs Cross-entropy)。

記錄每一個epoch下的損失和準確率。

定義好模型儲存的路徑。

# 對輸出batch進行轉化

def

normalize_sizes

y_pred

y_true

):

if

len

y_pred

size

())

==

3

y_pred

=

y_pred

contiguous

()

view

-

1

y_pred

size

2

))

if

len

y_true

size

())

==

2

y_true

=

y_true

contiguous

()

view

-

1

return

y_pred

y_true

# 定義計算的準確率

def

compute_accuracy

y_pred

y_true

mask_index

=

0

):

y_pred

y_true

=

normalize_sizes

y_pred

y_true

_

y_pred_indices

=

y_pred

max

dim

=

1

correct_indices

=

torch

eq

y_pred_indices

y_true

float

()

valid_indices

=

torch

ne

y_true

mask_index

float

()

n_correct

=

correct_indices

*

valid_indices

sum

()

item

()

n_valid

=

valid_indices

sum

()

item

()

return

n_correct

/

n_valid

*

100

# 定義序列的損失函式

def

sequence_loss

y_pred

y_true

mask_index

=

0

):

y_pred

y_true

=

normalize_sizes

y_pred

y_true

return

F

cross_entropy

y_pred

y_true

ignore_index

=

mask_index

訓練過程:

# 訓練的過程

def

train

():

lang_dataset

dataloader

input_sents

output_sents

=

\

get_dataset_dataloader

path

config

batch_size

vocab_size

=

len

lang_dataset

vocab

model

=

RNNLM

vocab_size

config

embed_size

config

hidden_size

config

num_layers

config

dropout_p

model

=

model

to

device

optimizer

=

optim

Adam

model

parameters

(),

lr

=

config

learning_rate

train_loss

=

[]

train_acc

=

[]

# initialize the loss

best_loss

=

9999999。0

for

epoch

in

range

config

num_epochs

):

# 初始化 hidden state

states

=

Variable

torch

zeros

config

num_layers

config

batch_size

config

hidden_size

))

to

device

),

Variable

torch

zeros

config

num_layers

config

batch_size

config

hidden_size

))

to

device

))

running_loss

=

0。0

running_acc

=

0。0

model

train

()

batch_index

=

0

for

data_dict

in

tqdm

dataloader

):

batch_index

+=

1

optimizer

zero_grad

()

x

=

data_dict

’x‘

to

device

y

=

data_dict

’y‘

to

device

y_pred

states

=

model

x

states

loss

=

sequence_loss

y_pred

y

loss

backward

retain_graph

=

True

optimizer

step

()

running_loss

+=

loss

item

()

-

running_loss

/

batch_index

acc_t

=

compute_accuracy

y_pred

y

running_acc

+=

acc_t

-

running_acc

/

batch_index

+

1

print

’Epoch =

%d

, Train loss =

%f

, Train accuracy =

%f

, Train perplexity =

%f

%

epoch

running_loss

running_acc

math

exp

running_loss

)))

train_loss

append

running_loss

train_acc

append

running_acc

if

running_loss

<

best_loss

# 模型儲存

torch

save

model

’。/model_save/best_model_epoch

%d

_loss_

%f

。pth‘

%

epoch

loss

))

best_loss

=

running_loss

print

’ ‘

join

generate

model

lang_dataset

’the‘

)))

return

train_loss

train_acc

然後就基本上等著訓練就可以啦。

另外還寫了一個生成器,拿已經訓練好的模型,手工輸入第一個單詞,然後來看它能輸出一個什麼鬼的句子,生成過程中,遇到結束標誌符就停止。

5. 根據第一個單詞,輸出生成句子 (generation.py)

def

generate

input_word

dataset_p

model_p

word_len

=

100

temperature

=

1。0

):

model

=

torch

load

model_p

dataset

=

get_dataset

dataset_p

model

eval

()

hidden

=

Variable

torch

zeros

config

num_layers

1

config

hidden_size

))

to

device

),

Variable

torch

zeros

config

num_layers

1

config

hidden_size

))

to

device

))

# batch_size為1

start_idx

=

dataset

vectorize

dataset

vocab

input_word

])

input_tensor

=

torch

stack

([

start_idx

*

1

input

=

input_tensor

to

device

word_list

=

input_word

for

i

in

range

word_len

):

# generate word by word

output

hidden

=

model

input

hidden

word_weights

=

output

squeeze

()

data

div

temperature

exp

()

cpu

()

# get the 1st biggest prob index

word_idx

=

torch

multinomial

word_weights

1

)[

0

if

word_idx

==

2

break

input

data

fill_

word_idx

# put new word into input

word

=

dataset

unvectorize

dataset

vocab

word_idx

item

()])

word_list

append

word

0

])

return

word_list

這裡我用小資料集進行訓練,然後用

the

作為第一個詞來進行生成,截取了不同 epoch 下的一點生成結果:

Epoch = 4, Train loss = 4。982013, Train accuracy = 8。298720, Train perplexity = 145。767445

the voltemand, cold, of buried hamlet ***

Epoch = 11, Train loss = 2。527130, Train accuracy = 28。159594, Train perplexity = 12。517534

the terms terms use of of hamlet, the the states world late

Epoch = 15, Train loss = 1。226519, Train accuracy = 50。120127, Train perplexity = 3。409342

the tragedy of the prince of hamlet’s

可以看到隨著訓練的epoch增加,perplexity和loss都在降低,預測詞的準確率也在上升,生成的句子語義化程度也越來越高。基本上還是可以得到不錯的結果的。

是一款不錯的練手任務,可以用不同的語料來玩一下。。。

完整程式碼等交了作業再放。

慢學NLP語言模型RNN-LM (Pytorch-batch)