Pytorch BiLSTM + CRF做NER

說明：本篇文章為Pytorch官網上BiLSTM CRF的說明

一。 BILSTM + CRF介紹

https：//www。

jianshu。com/p/97cb3b6db

573

1。介紹

基於神經網路的方法，在命名實體識別任務中非常流行和普遍。如果你不知道Bi-LSTM和CRF是什麼，你只需要記住他們分別是命名實體識別模型中的兩個層。

1。1開始之前

我們假設我們的資料集中有兩類實體——人名和地名，與之相對應在我們的訓練資料集中，有五類標籤：

B-Person， I- Person，B-Organization，I-Organization

假設句子x由五個字元w1，w2，w3，w4，w5組成，其中【w1，w2】為人名類實體，【w3】為地名類實體，其他字元標籤為“O”。

1。2BiLSTM-CRF模型

以下將給出模型的結構：

第一，句子x中的每一個單元都代表著由字嵌入或詞嵌入構成的向量。其中，字嵌入是隨機初始化的，詞嵌入是透過資料訓練得到的。所有的嵌入在訓練過程中都會調整到最優。

第二，這些字或詞嵌入為BiLSTM-CRF模型的輸入，輸出的是句子x中每個單元的標籤。

儘管一般不需要詳細瞭解BiLSTM層的原理，但是為了更容易知道CRF層的執行原理，我們需要知道BiLSTM的輸出層。

如上圖所示，BiLSTM層的輸出為每一個標籤的預測分值，例如，對於單元w0，BiLSTM層輸出的是

1。5 （B-Person）， 0。9 （I-Person）， 0。1 （B-Organization）， 0。08 （I-Organization） 0。05 （O）。

這些分值將作為CRF的輸入。 ### 1。3 如果沒有CRF層會怎樣

你也許已經發現了，即使沒有CRF層，我們也可以訓練一個BiLSTM命名實體識別模型，如圖所示：

由於BiLSTM的輸出為單元的每一個標籤分值，我們可以挑選分值最高的一個作為該單元的標籤。例如，對於單元w0，“B-Person”有最高分值—— 1。5，因此我們可以挑選“B-Person”作為w0的預測標籤。同理，我們可以得到w1——“I-Person”，w2—— “O” ，w3——“B-Organization”，w4——“O”。

雖然我們可以得到句子x中每個單元的正確標籤，但是我們不能保證標籤每次都是預測正確的。例如，圖4。中的例子，標籤序列是“I-Organization I-Person” and “B-Organization I-Person”，很顯然這是錯誤的。

1。4 CRF層能從訓練資料中獲得約束性的規則

CRF層可以為最後預測的標籤新增一些約束來保證預測的標籤是合法的。在訓練資料訓練過程中，這些約束可以透過CRF層自動學習到。這些約束可以是： I：句子中第一個詞總是以標籤“B-“ 或 “O”開始，而不是“I-” II：標籤“B-label1 I-label2 I-label3 I-…”，label1， label2， label3應該屬於同一類實體。例如，“B-Person I-Person” 是合法的序列，但是“B-Person I-Organization” 是非法標籤序列。 III：標籤序列“O I-label” is 非法的。實體標籤的首個標籤應該是 “B-“ ，而非 “I-“，換句話說，有效的標籤序列應該是“O B-label”。有了這些約束，標籤序列預測中非法序列出現的機率將會大大降低。

二。標籤的score和損失函式的定義

https：//

zhuanlan。zhihu。com/p/27

338210

Bi-LSTM layer的輸出維度是tag size，這就相當於是每個詞 w~i~ 對映到tag的發射機率值，設Bi-LSTM的輸出矩陣為P，其中P~i，j~代表詞w~i~對映到tag~j~的非歸一化機率。對於CRF來說，我們假定存在一個轉移矩陣A，則A~i，j~代表tag~i~轉移到tag~j~的轉移機率。對於輸入序列

對應的輸出tag序列 y，定義分數為

利用Softmax函式，我們為每一個正確的tag序列y定義一個機率值（Y~X~代表所有的tag序列，包括不可能出現的）

因而在訓練中，我們只需要最大化似然機率p（y|X）即可，這裡我們利用對數似然

所以我們將損失函式定義為-log（p（y|X）），就可以利用梯度下降法來進行網路的學習了。

loss function:

在對損失函式進行計算的時候，S（X，y）的計算很簡單，而

（下面記作logsumexp）的計算稍微複雜一些，因為需要計算每一條可能路徑的分數。這裡用一種簡便的方法，對於到詞w~i+1~的路徑，可以先把到詞w~i~的logsumexp計算出來，因為

因此先計算每一步的路徑分數和直接計算全域性分數相同，但這樣可以大大減少計算的時間。

三。對於損失函式的詳細解釋

這篇文章對於理解十分有用

https：//

blog。csdn。net/cuihuijun

1hao/article/details/79405740

舉例說【我愛中國人民】對應標籤【N V N】那這個標籤就是一個完整的路徑，也就對應一個Score值。接下來我想講的是這個公式：

這個公式成立是很顯然的，動筆算一算就知道了，程式碼裡其實就是用了這個公式的原理。

def

_forward_alg

（

self

，

feats

）：

# Do the forward algorithm to compute the partition function

init_alphas

torch

。

full

（（

，

self

。

tagset_size

），

10000。

）

# START_TAG has all of the score。

init_alphas

［

］［

self

。

tag_to_ix

［

START_TAG

］］

0。

# Wrap in a variable so that we will get automatic backprop

forward_var

init_alphas

# Iterate through the sentence

for

feat

feats

：

alphas_t

［］

# The forward tensors at this timestep

for

next_tag

range

（

self

。

tagset_size

）：

# broadcast the emission score： it is the same regardless of

# the previous tag

emit_score

feat

［

next_tag

］

。

view

（

，

）

。

expand

（

，

self

。

tagset_size

）

# the ith entry of trans_score is the score of transitioning to

# next_tag from i

trans_score

self

。

transitions

［

next_tag

］

。

view

（

，

）

# The ith entry of next_tag_var is the value for the

# edge （i -> next_tag） before we do log-sum-exp

next_tag_var

forward_var

trans_score

emit_score

# The forward variable for this tag is log-sum-exp of all the

# scores。

alphas_t

。

append

（

log_sum_exp

（

next_tag_var

）

。

view

（

））

forward_var

torch

。

cat

（

alphas_t

）

。

view

（

，

）

terminal_var

forward_var

self

。

transitions

［

self

。

tag_to_ix

［

STOP_TAG

］］

alpha

log_sum_exp

（

terminal_var

）

return

alpha

我們看到有這麼一段程式碼：

next_tag_var

forward_var

trans_score

emit_score

我們主要就是來講講他。首先這個演算法的思想是：假設我們要做一個詞性標註的任務，對句子【我愛中華人民】，我們要對這個句子做

意思就是對這個句子所有可能的標註，都算出來他們的Score，然後按照指數次冪加起來，再取對數。一般來說取所有可能的標註情況比較複雜，我們這裡舉例是長度為三，但是實際過程中，可能比這個要大得多，所以我們需要有一個簡單高效得演算法。也就是我們程式中得用得演算法，他是這麼算得：先算出【我，愛】可能標註得所有情況，取 log_sum_exp 然後加上轉換到【中國人民】得特徵值再加上【中國人民】對應得某個標籤得特徵值。其等價於【我，愛，中國人民】所有可能特徵值指數次冪相加，然後取對數。接下來我們來驗證一下是不是這樣

首先我們假設詞性一共只有兩種名詞N 和動詞 V 那麼【我，愛】得詞性組合一共有四種 N + N，N + V， V + N， V + V 那麼【愛】標註為N時得log_sum_exp 為

【愛】標註為 V時的 log_sum_exp為

我們的forward列表裡就是存在著這兩個值，即：

假設【中華人民】得詞性為N，我們按照程式碼來寫一下公式，在forward列表對應位置相加就是這樣

四。程式碼塊詳細說明：

https：//

blog。csdn。net/Jason__Li

ang/article/details/81772632

先說明兩個重要的矩陣：

feats：發射矩陣（emit score）是sentence 在embedding後，再經過LSTM後得到的矩陣（也就是LSTM的輸出），維度為11 * 5 （11為sentence 的length，5是標籤數）。這個矩陣表示經過LSTM後sentence的每個word對應的每個labels的得分）。表示發射機率。

self。transitions：轉移矩陣，維度為5

5，transitions［i］［j］表示label j轉移到label i的機率。transtion［i］維度為1

5，表示每個label轉移到label i的機率。表示機率矩陣

1。 def log_sum_exp（vec）

# compute log sum exp in numerically stable way for the forward algorithm

def

log_sum_exp

（

vec

）：

#vec是1*5， type是Variable

max_score

vec

［

，

argmax

（

vec

）］

# max_score維度是１，max_score。view（1，-1）維度是１＊１，

# max_score。view（1， -1）。expand（1， vec。size（）［1］）的維度是１＊５

max_score_broadcast

max_score

。

view

（

，

）

。

expand

（

，

vec

。

size

（）［

］）

# 裡面先做減法，減去最大值可以避免e的指數次，計算機上溢

return

max_score

torch

。

log

（

torch

。

sum

（

torch

。

exp

（

vec

max_score_broadcast

）））

你可能會疑問return 的結果為什麼先減去max_score。其實這是一個小技巧，因為一上來就做指數運算可能會引起計算結果溢位，先減去score，經過log_sum_exp後，再把max_score給加上。其實就等同於：

return

torch

。

log

（

torch

。

sum

（

torch

。

exp

（

vec

）））

2。 def neg_log_likelihood（self， sentence， tags）

如果你完整地把程式碼讀完，你會發現neg_log_likelihood（）這個函式是loss function。

loss

model

。

neg_log_likelihood

（

sentence_in

，

targets

）

我們來分析一下neg_log_likelihood（）函式程式碼：

def

neg_log_likelihood

（

self

，

sentence

，

tags

）：

# feats： 11*5 經過了LSTM+Linear矩陣後的輸出，之後作為CRF的輸入。

feats

self

。

_get_lstm_features

（

sentence

）

forward_score

self

。

_forward_alg

（

feats

）

gold_score

self

。

_score_sentence

（

feats

，

tags

）

return

forward_score

gold_score

你在這裡可能會有疑問：問什麼forward_score - gold_score可以作為loss呢。這裡，我們回顧一下我們在上文中說明的loss function函式公式：

你就會發現forward_score和gold_score分別表示上述等式右邊的兩項。

3。 def _forward_alg（self， feats）：

我們透過上一個函式的分析得知這個函式就是用來求forward_score的，也就是loss function等式右邊的第一項：

# 預測序列的得分

# 只是根據隨機的transitions，前向傳播算出的一個score

#用到了動態規劃的思想，但因為用的是隨機的轉移矩陣，算出的值很大score>20

def

_forward_alg

（

self

，

feats

）：

# do the forward algorithm to compute the partition function

init_alphas

torch

。

full

（（

，

self

。

tagset_size

），

10000。

）

# 1*5 而且全是-10000

# START_TAG has all of the score

# 因為start tag是4，所以tensor（［［-10000。， -10000。， -10000。， 0。， -10000。］］），

# 將start的值為零，表示開始進行網路的傳播，

init_alphas

［

］［

self

。

tag_to_ix

［

START_TAG

］］

# warp in a variable so that we will get automatic backprop

forward_var

init_alphas

# 初始狀態的forward_var，隨著step t變化

# iterate through the sentence

# 會迭代feats的行數次

for

feat

feats

：

#feat的維度是５依次把每一行取出來~

alphas_t

［］

# the forward tensors at this timestep

for

next_tag

range

（

self

。

tagset_size

）：

#next tag 就是簡單 i，從0到len

# broadcast the emission（發射） score：

# it is the same regardless of the previous tag

# 維度是1*5 LSTM生成的矩陣是emit score

emit_score

feat

［

next_tag

］

。

view

（

，

）

。

expand

（

，

self

。

tagset_size

）

# the i_th entry of trans_score is the score of transitioning

# to next_tag from i

trans_score

self

。

transitions

［

next_tag

］

。

view

（

，

）

# 維度是1*5

# The ith entry of next_tag_var is the value for the

# edge （i -> next_tag） before we do log-sum-exp

# 第一次迭代時理解：

# trans_score所有其他標籤到Ｂ標籤的機率

# 由lstm執行進入隱層再到輸出層得到標籤Ｂ的機率，emit_score維度是１＊５，5個值是相同的

next_tag_var

forward_var

trans_score

emit_score

# The forward variable for this tag is log-sum-exp of all the scores

alphas_t

。

append

（

log_sum_exp

（

next_tag_var

）

。

view

（

））

# 此時的alphas t 是一個長度為5，例如：

# ［tensor（0。8259）， tensor（2。1739）， tensor（1。3526）， tensor（-9999。7168）， tensor（-0。7102）］

forward_var

torch

。

cat

（

alphas_t

）

。

view

（

，

）

#到第（t-1）step時５個標籤的各自分數

# 最後只將最後一個單詞的forward var與轉移 stop tag的機率相加

# tensor（［［ 21。1036， 18。8673， 20。7906， -9982。2734， -9980。3135］］）

terminal_var

forward_var

self

。

transitions

［

self

。

tag_to_ix

［

STOP_TAG

］］

alpha

log_sum_exp

（

terminal_var

）

# alpha是一個0維的tensor

return

alpha

4。 def _score_sentence（self， feats， tags）

由2的函式分析，我們知道這個函式就是求gold_score，即loss function的第二項

# 根據真實的標籤算出的一個score，

# 這與上面的def _forward_alg（self， feats）共同之處在於：

# 兩者都是用的隨機轉移矩陣算的score

# 不同地方在於，上面那個函式算了一個最大可能路徑，但實際上可能不是真實的各個標籤轉移的值

# 例如：真實標籤是N V V，但是因為transitions是隨機的，所以上面的函式得到其實是N N N這樣，

# 兩者之間的score就有了差距。而後來的反向傳播，就能夠更新transitions，使得轉移矩陣逼近

#真實的“轉移矩陣”

# 得到gold_seq tag的score 即根據真實的label 來計算一個score，

# 但是因為轉移矩陣是隨機生成的，故算出來的score不是最理想的值

def

_score_sentence

（

self

，

feats

，

tags

）：

#feats 11*5 tag 11 維

# gives the score of a provied tag sequence

score

torch

。

zeros

（

）

# 將START_TAG的標籤３拼接到tag序列最前面，這樣tag就是12個了

tags

torch

。

cat

（［

torch

。

tensor

（［

self

。

tag_to_ix

［

START_TAG

］］，

dtype

torch

。

long

），

tags

］）

for

，

feat

enumerate

（

feats

）：

# self。transitions［tags［i + 1］， tags［i］］實際得到的是從標籤i到標籤i+1的轉移機率

# feat［tags［i+1］］， feat是step i 的輸出結果，有５個值，

# 對應B， I， E， START_TAG， END_TAG，取對應標籤的值

# transition【j，i】就是從i ->j 的轉移機率值

score

self

。

transitions

［

tags

［

］，

tags

［

］］

feat

［

tags

［

］］

score

self

。

transitions

［

self

。

tag_to_ix

［

STOP_TAG

］，

tags

［

］］

return

score

5。 def _viterbi_decode（self， feats）：

# 維特比解碼，實際上就是在預測的時候使用了，輸出得分與路徑值

# 預測序列的得分

def

_viterbi_decode

（

self

，

feats

）：

backpointers

［］

# initialize the viterbi variables in long space

init_vvars

torch

。

full

（（

，

self

。

tagset_size

），

10000。

）

init_vvars

［

］［

self

。

tag_to_ix

［

START_TAG

］］

# forward_var at step i holds the viterbi variables for step i-1

forward_var

init_vvars

for

feat

feats

：

bptrs_t

［］

# holds the backpointers for this step

viterbivars_t

［］

# holds the viterbi variables for this step

for

next_tag

range

（

self

。

tagset_size

）：

# next-tag_var［i］ holds the viterbi variable for tag i

# at the previous step， plus the score of transitioning

# from tag i to next_tag。

# we don‘t include the emission scores here because the max

# does not depend on them（we add them in below）

# 其他標籤（B，I，E，Start，End）到標籤next_tag的機率

next_tag_var

forward_var

self

。

transitions

［

next_tag

］

best_tag_id

argmax

（

next_tag_var

）

bptrs_t

。

append

（

best_tag_id

）

viterbivars_t

。

append

（

next_tag_var

［

］［

best_tag_id

］

。

view

（

））

# now add in the emssion scores， and assign forward_var to the set

# of viterbi variables we just computed

# 從step0到step（i-1）時5個序列中每個序列的最大score

forward_var

（

torch

。

cat

（

viterbivars_t

）

feat

）

。

view

（

，

）

backpointers

。

append

（

bptrs_t

）

# bptrs_t有５個元素

# transition to STOP_TAG

# 其他標籤到STOP_TAG的轉移機率

terminal_var

forward_var

self

。

transitions

［

self

。

tag_to_ix

［

STOP_TAG

］］

best_tag_id

argmax

（

terminal_var

）

path_score

terminal_var

［

］［

best_tag_id

］

# follow the back pointers to decode the best path

best_path

［

best_tag_id

］

for

bptrs_t

reversed

（

backpointers

）：

best_tag_id

bptrs_t

［

best_tag_id

］

best_path

。

append

（

best_tag_id

）

# pop off the start tag

# we don’t want to return that ti the caller

start

best_path

。

pop

（）

assert

start

self

。

tag_to_ix

［

START_TAG

］

# Sanity check

best_path

。

reverse

（）

# 把從後向前的路徑正過來

return

path_score

，

best_path

如果對於該函式還沒有太理解，可以參考這篇部落格：

https：//

blog。csdn。net/appleml/a

rticle/details/78579675

總結

以上就是我結合了幾篇比較不錯的部落格後的總結，歡迎大家提問。

Pytorch BiLSTM + CRF做NER

究竟什麼樣的婚外情，徹底斷不了？

DC的漫畫《蝙蝠俠：第一年》好在哪裡？該如何評價？

隨便看看

陳情令魏嬰是什麼武功？

13排巨幕廳最好的位置？

6個月的狗狗能吃奶糕的狗糧嗎？

白內障有啥好辦法治嗎，朋友們，手術的併發症後遺症太多了？

Pytorch BiLSTM + CRF做NER

究竟什麼樣的婚外情，徹底斷不了？

DC的漫畫《蝙蝠俠：第一年》好在哪裡？該如何評價？

猜你喜歡

法國勝澳大利亞全靠高科技，如何看待新技術在足球比賽尤其是世界盃上的作用？

如何鑑別MAC地址是否正確

看到過不少懟江澄的言語，為什麼呢？

隨便看看

陳情令魏嬰是什麼武功？

13排巨幕廳最好的位置？

6個月的狗狗能吃奶糕的狗糧嗎？

白內障有啥好辦法治嗎，朋友們，手術的併發症後遺症太多了？