bert原始碼解析-modeling。py

bert是transformer的encoder部分,以google-bert原始碼為例。

由兩個重要的class組成:

1。

BertConfig

大多時候改動的引數並不多,知曉這些引數可以便於推算模型的大小,比如隱藏層大小768

class

BertConfig

object

):

def

__init__

self

vocab_size

hidden_size

=

768

num_hidden_layers

=

12

num_attention_heads

=

12

intermediate_size

=

3072

hidden_act

=

“gelu”

hidden_dropout_prob

=

0。1

attention_probs_dropout_prob

=

0。1

max_position_embeddings

=

512

type_vocab_size

=

16

initializer_range

=

0。02

):

“”“

構造引數

:param vocab_size: 詞彙量大小

:param hidden_size: 隱藏層輸出大小

:param num_hidden_layers: 隱藏層單元層數

:param num_attention_heads: 注意力模組個數

:param intermediate_size: 中間層輸出大小 用於前向傳播時 由hidden_size->intermediate_size

:param hidden_act: 隱藏層啟用函式

:param hidden_dropout_prob: 隱藏層dropout

:param attention_probs_dropout_prob: 注意力部分dropout

:param max_position_embeddings: 位置編碼最大值預設是512

:param type_vocab_size: token_type_ids的詞典大小,用於句子上下句是否是同一個用的標識,預設是大小是2 也就是用0或1表示

:param initializer_range: 初始化方法的範圍

”“”

self

vocab_size

=

vocab_size

self

hidden_size

=

hidden_size

self

num_hidden_layers

=

num_hidden_layers

self

num_attention_heads

=

num_attention_heads

self

hidden_act

=

hidden_act

self

intermediate_size

=

intermediate_size

self

hidden_dropout_prob

=

hidden_dropout_prob

self

attention_probs_dropout_prob

=

attention_probs_dropout_prob

self

max_position_embeddings

=

max_position_embeddings

self

type_vocab_size

=

type_vocab_size

self

initializer_range

=

initializer_range

2。

BertModel

模型部分

模型部分主要分成三個部分說明,對應的可以看transformer單個block的結構。

小白Bert系列-原始碼解析-modeling.py

從初始化方法__init__開始說明 初始化定義:

config

即上述BertConfig

is_training

判斷是否是訓練,如果不是訓練則不需要進行dropout,因為dropout是為了避免訓練過程過擬合

input_ids

輸入句子的數字化表示,例如原始碼註釋 tf。constant([[31, 51, 99], [15, 5, 0]])

input_mask

表示該位置是否有數字,長度和input_ids一致

token_type_ids

字的type

use_one_hot_embeddings

輸入初始化詞是否使用獨熱編碼

scope

tf變數作用域的名稱,預設是“bert”

初始化三個輸入: input_shape 大小是的[batch_size, seq_length] input_mask 大小是的[batch_size, seq_length] token_type_ids 大小是的[batch_size, seq_length]

主要結構如下,一個個來介紹。

宣告變數作用域(“bert”)

變數作用域(“embeddings”)

詞嵌入

位置編碼及mask編碼

變數作用域(“encoder”)

變數作用域(“pooler”)

(1)embedding-詞嵌入

①初始化詞嵌入矩陣 矩陣大小及 [vocab_size, embedding_size] 詞表大小*詞向量維度

②輸入input_ids拍平和嵌入矩陣相乘得到輸入的嵌入矩陣,再reshape成[batch_size, seq_length, embedding_size]輸出。

詳細註釋如下:

def

embedding_lookup

input_ids

vocab_size

embedding_size

=

128

initializer_range

=

0。02

word_embedding_name

=

“word_embeddings”

use_one_hot_embeddings

=

False

):

“”“

獲取詞嵌入

:param input_ids: 輸入詞的id

:param vocab_size: 詞典大小

:param embedding_size: 詞嵌入輸出維度

:param initializer_range: 生成截斷正態分佈的初始化的標準差

:param word_embedding_name: 詞嵌入再網路中的name

:param use_one_hot_embeddings: 判斷是使用何種初始編碼方式,one-hot或tf。gather

:return:

”“”

#如果輸入大小是[batch_size, seq_length]

#如果輸入是二維的 reshape成三維[batch_size, seq_length, 1]

if

input_ids

shape

ndims

==

2

input_ids

=

tf

expand_dims

input_ids

axis

=

-

1

])

#這個矩陣大小就是[vocab_size, embedding_size] 那麼每個詞都可以透過one-hot獲取到其對應的嵌入矩陣

embedding_table

=

tf

get_variable

name

=

word_embedding_name

shape

=

vocab_size

embedding_size

],

initializer

=

create_initializer

initializer_range

))

#輸入id reshape 成一維 例如[[1,2],[3,4]]=>[1,2,3,4]

flat_input_ids

=

tf

reshape

input_ids

-

1

])

#獲取輸出矩陣 兩種方式

#1。用one-hot與embedding_table相乘得到嵌入矩陣

#2。tf。gather直接切片提取

#得到output是(batch_size * seq_length) * embedding_size

if

use_one_hot_embeddings

one_hot_input_ids

=

tf

one_hot

flat_input_ids

depth

=

vocab_size

output

=

tf

matmul

one_hot_input_ids

embedding_table

else

output

=

tf

gather

embedding_table

flat_input_ids

#獲取input_ids的 [batch_size, seq_length]

input_shape

=

get_shape_list

input_ids

#reshape的尺寸input_shape[0:-1]是[batch_size, seq_length] + 1*embedding_size

#所以輸出是[batch_size, seq_length, embedding_size]

output

=

tf

reshape

output

input_shape

0

-

1

+

input_shape

-

1

*

embedding_size

])

return

output

embedding_table

(2)embedding-位置編碼及mask編碼

①token type 用於標識當前字的型別,比如上下句標識。“小明愛學習,小林愛學習”標識上下句[0,0,0,0,0,1,1,1,1,1] bert-chinese模型預設大小為2

token_type_ids 輸入大小是 [batch_size, seq_length],對應的嵌入矩陣是[token_type_vocab_size, embedding_size]

輸入拍平之後和嵌入矩陣相乘在reshape得到[batch_size, seq_length, embedding_size]直接和詞嵌入相加

②use_position_embeddings 位置編碼,因為注意力機制部分沒有位置資訊,所以輸入時候單獨新增,注意!!bert的位置是透過引數學習的,不是使用sin cos組合得到。

初始化嵌入矩陣 大小為[max_position_embeddings, embedding]

根據seq_length直接擷取前seq_length即可,但是此時矩陣大小是[seq_length,embedding ]

因為每個batch使用相同的位置向量,為了能直接相加,進行廣播到batch size 得到[batch_size, seq_length, embedding_size]

和上述output相加

③接入layer_norm_and_dropout 輸出

詳細註釋如下:

def

embedding_postprocessor

input_tensor

use_token_type

=

False

token_type_ids

=

None

token_type_vocab_size

=

16

token_type_embedding_name

=

“token_type_embeddings”

use_position_embeddings

=

True

position_embedding_name

=

“position_embeddings”

initializer_range

=

0。02

max_position_embeddings

=

512

dropout_prob

=

0。1

):

“”“

:param input_tensor: 輸入id對應的詞嵌入矩陣

:param use_token_type: 是否使用type

:param token_type_ids: type對應的ids陣列

:param token_type_vocab_size: 表示token type的大小,比如在中文bert模型使用0或1標識上下句關係,則大小為2。

:param token_type_embedding_name: token type的嵌入層name

:param use_position_embeddings: 是否使用位置編碼

:param position_embedding_name: 位置編碼層name

:param initializer_range: 生成截斷正態分佈的初始化的標準差

:param max_position_embeddings: 最大位置編碼長度

:param dropout_prob: dropout的大小,用於輸出時的layer_norm_and_dropout

:return:

”“”

input_shape

=

get_shape_list

input_tensor

expected_rank

=

3

batch_size

=

input_shape

0

seq_length

=

input_shape

1

width

=

input_shape

2

output

=

input_tensor

#token_type_vocab_size在 bert-chinese中預設值是2 SegmentPosition資訊分離上句和下句

if

use_token_type

if

token_type_ids

is

None

raise

ValueError

“`token_type_ids` must be specified if”

“`use_token_type` is True。”

token_type_table

=

tf

get_variable

name

=

token_type_embedding_name

shape

=

token_type_vocab_size

width

],

initializer

=

create_initializer

initializer_range

))

#token_type_ids 大小是 [batch_size, seq_length]

#flat_token_type_ids 矩陣拍平,此處類似上面embedding做法,得到SegmentPosition嵌入矩陣

flat_token_type_ids

=

tf

reshape

token_type_ids

-

1

])

one_hot_ids

=

tf

one_hot

flat_token_type_ids

depth

=

token_type_vocab_size

token_type_embeddings

=

tf

matmul

one_hot_ids

token_type_table

token_type_embeddings

=

tf

reshape

token_type_embeddings

batch_size

seq_length

width

])

#將Segment position資訊疊加到詞向量

output

+=

token_type_embeddings

#Position embedding資訊 位置編碼

if

use_position_embeddings

assert_op

=

tf

assert_less_equal

seq_length

max_position_embeddings

with

tf

control_dependencies

([

assert_op

]):

#full_position_embeddings 尺寸大小是中文base bert模型是512 * 768,因為位置編碼最大值是512,每個嵌入大小是768

full_position_embeddings

=

tf

get_variable

name

=

position_embedding_name

shape

=

max_position_embeddings

width

],

initializer

=

create_initializer

initializer_range

))

#獲取詞嵌入矩陣,這裡因為是單純的順序所以不需要乘法,直接slice獲取前seq_length長度的向量即可

position_embeddings

=

tf

slice

full_position_embeddings

0

0

],

seq_length

-

1

])

num_dims

=

len

output

shape

as_list

())

# 因為token type和詞嵌入都是大小 batch_size,seq_length,embedding 可以直接相加

# 但是position_embeddings是seq_length,embedding,因為每個batch使用相同的位置向量,為了能直接相加,進行廣播到batch size

position_broadcast_shape

=

[]

for

_

in

range

num_dims

-

2

):

position_broadcast_shape

append

1

position_broadcast_shape

extend

([

seq_length

width

])

position_embeddings

=

tf

reshape

position_embeddings

position_broadcast_shape

output

+=

position_embeddings

#進行layer norm和dropout

output

=

layer_norm_and_dropout

output

dropout_prob

return

output

(3)encoder主體部分

attention_mask 需要透過mask矩陣得到哪些位置有詞,例如seq_length=10,但是輸入時“小明愛學習”那麼mask[1,1,1,1,1,0,0,0,0,0]

# 用於計算 attention分數,需要透過mask矩陣得到哪些位置有詞,例如seq_length=10,但是輸入時“小明愛學習”那麼mask[1,1,1,1,1,0,0,0,0,0]

attention_mask

=

create_attention_mask_from_input_mask

input_ids

input_mask

建立attention_model 例如這裡num_hidden_layers=12個

迴圈12次:

attention_layer

attention_layer方法見下面

增加一個全連線層,構建殘差網路將輸入加到attention輸出,再進行layer_norm

前向傳播 例如Chinese-bert引數 786-》768*4-》768再進行殘差網路,再layer_norm

確認是否返回全部層資料,或者只返回最後一層

詳細註釋如下:

def

transformer_model

input_tensor

attention_mask

=

None

hidden_size

=

768

num_hidden_layers

=

12

num_attention_heads

=

12

intermediate_size

=

3072

intermediate_act_fn

=

gelu

hidden_dropout_prob

=

0。1

attention_probs_dropout_prob

=

0。1

initializer_range

=

0。02

do_return_all_layers

=

False

):

“”“

:param input_tensor: 輸入融合之後向量輸入

:param attention_mask:mask矩陣用於表示當前詞是否有值

:param hidden_size: 隱藏層輸出大小

:param num_hidden_layers: 主體部分層數

:param num_attention_heads: 多頭注意力機制個數

:param intermediate_size: 中間層輸出大小(attention之後的全連線層)

:param intermediate_act_fn: 中間層啟用函式

:param hidden_dropout_prob: 隱藏層dropout

:param attention_probs_dropout_prob: 注意力部分dropout

:param initializer_range:

:param do_return_all_layers: 是否需要返回所有層

:return:

”“”

#因為多頭是直接從hidden_size均分,所以一定是整除的

if

hidden_size

%

num_attention_heads

!=

0

raise

ValueError

“The hidden size (

%d

) is not a multiple of the number of attention ”

“heads (

%d

)”

%

hidden_size

num_attention_heads

))

attention_head_size

=

int

hidden_size

/

num_attention_heads

input_shape

=

get_shape_list

input_tensor

expected_rank

=

3

batch_size

=

input_shape

0

seq_length

=

input_shape

1

input_width

=

input_shape

2

# The Transformer performs sum residuals on all layers so the input needs

# to be the same as the hidden size。

if

input_width

!=

hidden_size

raise

ValueError

“The width of the input tensor (

%d

) != hidden size (

%d

)”

%

input_width

hidden_size

))

#我們將表示保留為2D張量,以避免將其從3D張量來回整形為2D張量。

# 重構在GPU/CPU上通常是免費的,但在TPU上可能不是免費的,因此我們希望最小化它們以幫助最佳化器。

#看起來是便於在tpu處理

prev_output

=

reshape_to_matrix

input_tensor

all_layer_outputs

=

[]

for

layer_idx

in

range

num_hidden_layers

):

with

tf

variable_scope

“layer_

%d

%

layer_idx

):

#prev_output儲存最新一層得輸出

layer_input

=

prev_output

with

tf

variable_scope

“attention”

):

attention_heads

=

[]

with

tf

variable_scope

“self”

):

attention_head

=

attention_layer

from_tensor

=

layer_input

to_tensor

=

layer_input

attention_mask

=

attention_mask

num_attention_heads

=

num_attention_heads

size_per_head

=

attention_head_size

attention_probs_dropout_prob

=

attention_probs_dropout_prob

initializer_range

=

initializer_range

do_return_2d_tensor

=

True

batch_size

=

batch_size

from_seq_length

=

seq_length

to_seq_length

=

seq_length

attention_heads

append

attention_head

attention_output

=

None

if

len

attention_heads

==

1

attention_output

=

attention_heads

0

else

# In the case where we have other sequences, we just concatenate

# them to the self-attention head before the projection。

attention_output

=

tf

concat

attention_heads

axis

=-

1

# 增加一個全連線層,構建殘差網路將輸入加到attention輸出,再進行layer_norm

with

tf

variable_scope

“output”

):

attention_output

=

tf

layers

dense

attention_output

hidden_size

kernel_initializer

=

create_initializer

initializer_range

))

attention_output

=

dropout

attention_output

hidden_dropout_prob

attention_output

=

layer_norm

attention_output

+

layer_input

# 前向傳播 例如Chinese-bert引數 786-》768*4-》768再進行殘差網路,再layer_norm

with

tf

variable_scope

“intermediate”

):

intermediate_output

=

tf

layers

dense

attention_output

intermediate_size

activation

=

intermediate_act_fn

kernel_initializer

=

create_initializer

initializer_range

))

with

tf

variable_scope

“output”

):

layer_output

=

tf

layers

dense

intermediate_output

hidden_size

kernel_initializer

=

create_initializer

initializer_range

))

layer_output

=

dropout

layer_output

hidden_dropout_prob

layer_output

=

layer_norm

layer_output

+

attention_output

prev_output

=

layer_output

all_layer_outputs

append

layer_output

#確認是否返回全部層資料,或者只返回最後一層

if

do_return_all_layers

final_outputs

=

[]

for

layer_output

in

all_layer_outputs

final_output

=

reshape_from_matrix

layer_output

input_shape

final_outputs

append

final_output

return

final_outputs

else

final_output

=

reshape_from_matrix

prev_output

input_shape

return

final_output

attention_layer方法

小白Bert系列-原始碼解析-modeling.py

1。計算q,k,v,計算方式輸入其實都是根據詞嵌入融合的結果得來。

2。將query 和 key進行點乘得到scores,然後除以根號d tf。multiply是兩個矩陣中對應元素各自相乘,將每個值乘以根號d。

3。attention_mask為什麼?這裡給mask部分一個很大負數,為什麼呢,因為再進行softmax時候,如果改值是0,那麼e得0次冪是1,勢必會影響,如果是一個很大的負數,那麼e的負數次冪=0,則相關性求softmax接近0。

4。歸一化 輸入到一個softmax 得到相關程度矩陣。

5。獲取value值,將相關性矩陣attention_scores [B, N, F, T]和value [B, N, T, H]相乘。

def

attention_layer

from_tensor

to_tensor

attention_mask

=

None

num_attention_heads

=

1

size_per_head

=

512

query_act

=

None

key_act

=

None

value_act

=

None

attention_probs_dropout_prob

=

0。0

initializer_range

=

0。02

do_return_2d_tensor

=

False

batch_size

=

None

from_seq_length

=

None

to_seq_length

=

None

):

“”“

:param from_tensor: 第一步詞向量融合輸入

:param to_tensor: 第一步詞向量融合輸入

:param attention_mask: mask矩陣用於表示當前詞是否有值

:param num_attention_heads: 多頭個數例如12

:param size_per_head: 每個頭的大小 例如768//12=64

:param key_act:

:param value_act:

:param attention_probs_dropout_prob:

:param initializer_range:

:param do_return_2d_tensor:

:param batch_size:

:param from_seq_length:

:param to_seq_length:

:return:

”“”

def

transpose_for_scores

input_tensor

batch_size

num_attention_heads

seq_length

width

):

output_tensor

=

tf

reshape

input_tensor

batch_size

seq_length

num_attention_heads

width

])

output_tensor

=

tf

transpose

output_tensor

0

2

1

3

])

return

output_tensor

from_shape

=

get_shape_list

from_tensor

expected_rank

=

2

3

])

to_shape

=

get_shape_list

to_tensor

expected_rank

=

2

3

])

if

len

from_shape

!=

len

to_shape

):

raise

ValueError

“The rank of `from_tensor` must match the rank of `to_tensor`。”

if

len

from_shape

==

3

batch_size

=

from_shape

0

from_seq_length

=

from_shape

1

to_seq_length

=

to_shape

1

elif

len

from_shape

==

2

if

batch_size

is

None

or

from_seq_length

is

None

or

to_seq_length

is

None

):

raise

ValueError

“When passing in rank 2 tensors to attention_layer, the values ”

“for `batch_size`, `from_seq_length`, and `to_seq_length` ”

“must all be specified。”

# Scalar dimensions referenced here:

# B = batch size (number of sequences)

# F = `from_tensor` sequence length

# T = `to_tensor` sequence length

# N = `num_attention_heads`

# H = `size_per_head`

from_tensor_2d

=

reshape_to_matrix

from_tensor

to_tensor_2d

=

reshape_to_matrix

to_tensor

#計算q,k,v,計算方式輸入其實都是根據詞嵌入融合的結果得來的

#輸入是batch_size*sequence_length 輸出大小是num_attention_heads*size_per_head 其實就是hidden_size 768

#相同的方式得到q, k, v

# `query_layer` = [B*F, N*H]

query_layer

=

tf

layers

dense

from_tensor_2d

num_attention_heads

*

size_per_head

activation

=

query_act

name

=

“query”

kernel_initializer

=

create_initializer

initializer_range

))

# `key_layer` = [B*T, N*H]

key_layer

=

tf

layers

dense

to_tensor_2d

num_attention_heads

*

size_per_head

activation

=

key_act

name

=

“key”

kernel_initializer

=

create_initializer

initializer_range

))

# `value_layer` = [B*T, N*H]

value_layer

=

tf

layers

dense

to_tensor_2d

num_attention_heads

*

size_per_head

activation

=

value_act

name

=

“value”

kernel_initializer

=

create_initializer

initializer_range

))

#這裡對原始結構進行位置更換,原始:[batch_size, seq_length, num_attention_heads, width]

#變換方式是[0, 2, 1, 3]

#變換之後的是 [batch_size, num_attention_heads, seq_length, size_per_head]

#key 相同的處理方式

# `query_layer` = [B, N, F, H]

query_layer

=

transpose_for_scores

query_layer

batch_size

num_attention_heads

from_seq_length

size_per_head

# `key_layer` = [B, N, T, H]

key_layer

=

transpose_for_scores

key_layer

batch_size

num_attention_heads

to_seq_length

size_per_head

#將query 和 key進行點乘得到scores

#然後除以根號d tf。multiply是兩個矩陣中對應元素各自相乘,將每個值乘以根號d

#矩陣大小是[batch_size, num_attention_heads, seq_length, seq_length]

# `attention_scores` = [B, N, F, T]

attention_scores

=

tf

matmul

query_layer

key_layer

transpose_b

=

True

attention_scores

=

tf

multiply

attention_scores

1。0

/

math

sqrt

float

size_per_head

)))

#這裡給mask部分一個很大負數,為什麼呢,因為再進行softmax時候,如果改值是0,那麼e得0次冪是1,勢必會影響,如果是一個很大的負數,

#那麼e的負數次冪=0,則相關性求softmax接近0

if

attention_mask

is

not

None

# `attention_mask` = [B, 1, F, T]

attention_mask

=

tf

expand_dims

attention_mask

axis

=

1

])

# Since attention_mask is 1。0 for positions we want to attend and 0。0 for

# masked positions, this operation will create a tensor which is 0。0 for

# positions we want to attend and -10000。0 for masked positions。

adder

=

1。0

-

tf

cast

attention_mask

tf

float32

))

*

-

10000。0

# Since we are adding it to the raw scores before the softmax, this is

# effectively the same as removing these entirely。

attention_scores

+=

adder

# 歸一化 輸入到一個softmax 得到相關程度矩陣

# `attention_probs` = [B, N, F, T]

attention_probs

=

tf

nn

softmax

attention_scores

# 是否需要對所有的字元都進行處理,dropout丟失一些

attention_probs

=

dropout

attention_probs

attention_probs_dropout_prob

#獲取value值

# `value_layer` = [B, T, N, H]

value_layer

=

tf

reshape

value_layer

batch_size

to_seq_length

num_attention_heads

size_per_head

])

# `value_layer` = [B, N, T, H]

value_layer

=

tf

transpose

value_layer

0

2

1

3

])

#將相關性矩陣[B, N, F, T]和value[B, N, T, H]相乘

# `context_layer` = [B, N, F, H]

context_layer

=

tf

matmul

attention_probs

value_layer

# `context_layer` = [B, F, N, H]

context_layer

=

tf

transpose

context_layer

0

2

1

3

])

#這裡[B, F, N*H] 即[batch_size, seq_length, num_attention_heads*hidden_size]

if

do_return_2d_tensor

# `context_layer` = [B*F, N*H]

context_layer

=

tf

reshape

context_layer

batch_size

*

from_seq_length

num_attention_heads

*

size_per_head

])

else

# `context_layer` = [B, F, N*H]

context_layer

=

tf

reshape

context_layer

batch_size

from_seq_length

num_attention_heads

*

size_per_head

])

return

context_layer

(3)pooler

只提取第一個token對應的向量,因為每一個token其實已經學習到其他所有的資訊。

with

tf

variable_scope

“pooler”

):

# 只提取第一個token對應的向量,因為每一個token其實已經學習到其他所有的資訊。

# 再連結一個全連線層輸出大小 [batch_size, hidden_size]

first_token_tensor

=

tf

squeeze

self

sequence_output

[:,

0

1

:],

axis

=

1

self

pooled_output

=

tf

layers

dense

first_token_tensor

config

hidden_size

activation

=

tf

tanh

kernel_initializer

=

create_initializer

config

initializer_range

))