小白Bert系列-原始碼解析-modeling.py

bert原始碼解析-modeling。py

bert是transformer的encoder部分，以google-bert原始碼為例。

由兩個重要的class組成：

1。

BertConfig

大多時候改動的引數並不多，知曉這些引數可以便於推算模型的大小，比如隱藏層大小768

class

BertConfig

（

object

）：

def

__init__

（

self

，

vocab_size

，

hidden_size

768

，

num_hidden_layers

，

num_attention_heads

，

intermediate_size

3072

，

hidden_act

“gelu”

，

hidden_dropout_prob

0。1

，

attention_probs_dropout_prob

0。1

，

max_position_embeddings

512

，

type_vocab_size

，

initializer_range

0。02

）：

“”“

構造引數

：param vocab_size：詞彙量大小

：param hidden_size：隱藏層輸出大小

：param num_hidden_layers：隱藏層單元層數

：param num_attention_heads：注意力模組個數

：param intermediate_size：中間層輸出大小用於前向傳播時由hidden_size->intermediate_size

：param hidden_act：隱藏層啟用函式

：param hidden_dropout_prob：隱藏層dropout

：param attention_probs_dropout_prob：注意力部分dropout

：param max_position_embeddings：位置編碼最大值預設是512

：param type_vocab_size： token_type_ids的詞典大小，用於句子上下句是否是同一個用的標識，預設是大小是2 也就是用0或1表示

：param initializer_range：初始化方法的範圍

”“”

self

。

vocab_size

self

。

hidden_size

self

。

num_hidden_layers

self

。

num_attention_heads

self

。

hidden_act

self

。

intermediate_size

self

。

hidden_dropout_prob

self

。

attention_probs_dropout_prob

self

。

max_position_embeddings

self

。

type_vocab_size

self

。

initializer_range

2。

BertModel

模型部分

模型部分主要分成三個部分說明，對應的可以看transformer單個block的結構。

從初始化方法__init__開始說明初始化定義：

config

即上述BertConfig

is_training

判斷是否是訓練，如果不是訓練則不需要進行dropout，因為dropout是為了避免訓練過程過擬合

input_ids

輸入句子的數字化表示，例如原始碼註釋 tf。constant（［［31， 51， 99］，［15， 5， 0］］）

input_mask

表示該位置是否有數字，長度和input_ids一致

token_type_ids

字的type

use_one_hot_embeddings

輸入初始化詞是否使用獨熱編碼

scope

tf變數作用域的名稱，預設是“bert”

初始化三個輸入： input_shape 大小是的［batch_size， seq_length］ input_mask 大小是的［batch_size， seq_length］ token_type_ids 大小是的［batch_size， seq_length］

主要結構如下，一個個來介紹。

宣告變數作用域（“bert”）

變數作用域（“embeddings”）

詞嵌入

位置編碼及mask編碼

變數作用域（“encoder”）

變數作用域（“pooler”）

（1）embedding-詞嵌入

①初始化詞嵌入矩陣矩陣大小及［vocab_size， embedding_size］詞表大小*詞向量維度

②輸入input_ids拍平和嵌入矩陣相乘得到輸入的嵌入矩陣，再reshape成［batch_size， seq_length， embedding_size］輸出。

詳細註釋如下：

def

embedding_lookup

（

input_ids

，

vocab_size

，

embedding_size

128

，

initializer_range

0。02

，

word_embedding_name

“word_embeddings”

，

use_one_hot_embeddings

False

）：

“”“

獲取詞嵌入

：param input_ids：輸入詞的id

：param vocab_size：詞典大小

：param embedding_size：詞嵌入輸出維度

：param initializer_range：生成截斷正態分佈的初始化的標準差

：param word_embedding_name：詞嵌入再網路中的name

：param use_one_hot_embeddings：判斷是使用何種初始編碼方式，one-hot或tf。gather

：return：

”“”

#如果輸入大小是［batch_size， seq_length］

#如果輸入是二維的 reshape成三維［batch_size， seq_length， 1］

input_ids

。

shape

。

ndims

：

input_ids

。

expand_dims

（

input_ids

，

axis

［

］）

#這個矩陣大小就是［vocab_size， embedding_size］那麼每個詞都可以透過one-hot獲取到其對應的嵌入矩陣

embedding_table

。

get_variable

（

name

word_embedding_name

，

shape

［

vocab_size

，

embedding_size

］，

initializer

create_initializer

（

initializer_range

））

#輸入id reshape 成一維例如［［1，2］，［3，4］］=>［1，2，3，4］

flat_input_ids

。

reshape

（

input_ids

，

［

］）

#獲取輸出矩陣兩種方式

#1。用one-hot與embedding_table相乘得到嵌入矩陣

#2。tf。gather直接切片提取

#得到output是（batch_size * seq_length） * embedding_size

use_one_hot_embeddings

：

one_hot_input_ids

。

one_hot

（

flat_input_ids

，

depth

vocab_size

）

output

。

matmul

（

one_hot_input_ids

，

embedding_table

）

else

：

output

。

gather

（

embedding_table

，

flat_input_ids

）

#獲取input_ids的［batch_size， seq_length］

input_shape

get_shape_list

（

input_ids

）

#reshape的尺寸input_shape［0：-1］是［batch_size， seq_length］ + 1*embedding_size

#所以輸出是［batch_size， seq_length， embedding_size］

output

。

reshape

（

output

，

input_shape

［

：

］

［

input_shape

［

］

embedding_size

］）

return

（

output

，

embedding_table

）

（2）embedding-位置編碼及mask編碼

①token type 用於標識當前字的型別，比如上下句標識。“小明愛學習，小林愛學習”標識上下句［0，0，0，0，0，1，1，1，1，1］ bert-chinese模型預設大小為2

token_type_ids 輸入大小是［batch_size， seq_length］，對應的嵌入矩陣是［token_type_vocab_size， embedding_size］

輸入拍平之後和嵌入矩陣相乘在reshape得到［batch_size， seq_length， embedding_size］直接和詞嵌入相加

②use_position_embeddings 位置編碼，因為注意力機制部分沒有位置資訊，所以輸入時候單獨新增，注意！！bert的位置是透過引數學習的，不是使用sin cos組合得到。

初始化嵌入矩陣大小為［max_position_embeddings， embedding］

根據seq_length直接擷取前seq_length即可，但是此時矩陣大小是［seq_length，embedding ］

因為每個batch使用相同的位置向量，為了能直接相加，進行廣播到batch size 得到［batch_size， seq_length， embedding_size］

和上述output相加

③接入layer_norm_and_dropout 輸出

詳細註釋如下：

def

embedding_postprocessor

（

input_tensor

，

use_token_type

False

，

token_type_ids

None

，

token_type_vocab_size

，

token_type_embedding_name

“token_type_embeddings”

，

use_position_embeddings

True

，

position_embedding_name

“position_embeddings”

，

initializer_range

0。02

，

max_position_embeddings

512

，

dropout_prob

0。1

）：

“”“

：param input_tensor：輸入id對應的詞嵌入矩陣

：param use_token_type：是否使用type

：param token_type_ids： type對應的ids陣列

：param token_type_vocab_size：表示token type的大小，比如在中文bert模型使用0或1標識上下句關係，則大小為2。

：param token_type_embedding_name： token type的嵌入層name

：param use_position_embeddings：是否使用位置編碼

：param position_embedding_name：位置編碼層name

：param initializer_range：生成截斷正態分佈的初始化的標準差

：param max_position_embeddings：最大位置編碼長度

：param dropout_prob： dropout的大小，用於輸出時的layer_norm_and_dropout

：return：

”“”

input_shape

get_shape_list

（

input_tensor

，

expected_rank

）

batch_size

input_shape

［

］

seq_length

input_shape

［

］

width

input_shape

［

］

output

input_tensor

#token_type_vocab_size在 bert-chinese中預設值是2 SegmentPosition資訊分離上句和下句

use_token_type

：

token_type_ids

None

：

raise

ValueError

（

“`token_type_ids` must be specified if”

“`use_token_type` is True。”

）

token_type_table

。

get_variable

（

name

token_type_embedding_name

，

shape

［

token_type_vocab_size

，

width

］，

initializer

create_initializer

（

initializer_range

））

#token_type_ids 大小是［batch_size， seq_length］

#flat_token_type_ids 矩陣拍平，此處類似上面embedding做法，得到SegmentPosition嵌入矩陣

flat_token_type_ids

。

reshape

（

token_type_ids

，

［

］）

one_hot_ids

。

one_hot

（

flat_token_type_ids

，

depth

token_type_vocab_size

）

token_type_embeddings

。

matmul

（

one_hot_ids

，

token_type_table

）

token_type_embeddings

。

reshape

（

token_type_embeddings

，

［

batch_size

，

seq_length

，

width

］）

#將Segment position資訊疊加到詞向量

output

token_type_embeddings

#Position embedding資訊位置編碼

use_position_embeddings

：

assert_op

。

assert_less_equal

（

seq_length

，

max_position_embeddings

）

with

。

control_dependencies

（［

assert_op

］）：

#full_position_embeddings 尺寸大小是中文base bert模型是512 * 768，因為位置編碼最大值是512，每個嵌入大小是768

full_position_embeddings

。

get_variable

（

name

position_embedding_name

，

shape

［

max_position_embeddings

，

width

］，

initializer

create_initializer

（

initializer_range

））

#獲取詞嵌入矩陣，這裡因為是單純的順序所以不需要乘法，直接slice獲取前seq_length長度的向量即可

position_embeddings

。

slice

（

full_position_embeddings

，

［

，

］，

［

seq_length

，

］）

num_dims

len

（

output

。

shape

。

as_list

（））

# 因為token type和詞嵌入都是大小 batch_size，seq_length，embedding 可以直接相加

# 但是position_embeddings是seq_length，embedding，因為每個batch使用相同的位置向量，為了能直接相加，進行廣播到batch size

position_broadcast_shape

［］

for

range

（

num_dims

）：

position_broadcast_shape

。

append

（

）

position_broadcast_shape

。

extend

（［

seq_length

，

width

］）

position_embeddings

。

reshape

（

position_embeddings

，

position_broadcast_shape

）

output

position_embeddings

#進行layer norm和dropout

output

layer_norm_and_dropout

（

output

，

dropout_prob

）

return

output

（3）encoder主體部分

attention_mask 需要透過mask矩陣得到哪些位置有詞，例如seq_length=10，但是輸入時“小明愛學習”那麼mask［1，1，1，1，1，0，0，0，0，0］

# 用於計算 attention分數，需要透過mask矩陣得到哪些位置有詞，例如seq_length=10，但是輸入時“小明愛學習”那麼mask［1，1，1，1，1，0，0，0，0，0］

attention_mask

create_attention_mask_from_input_mask

（

input_ids

，

input_mask

）

建立attention_model 例如這裡num_hidden_layers=12個

迴圈12次：

attention_layer

attention_layer方法見下面

增加一個全連線層，構建殘差網路將輸入加到attention輸出，再進行layer_norm

前向傳播例如Chinese-bert引數 786-》768*4-》768再進行殘差網路，再layer_norm

確認是否返回全部層資料，或者只返回最後一層

詳細註釋如下：

def

transformer_model

（

input_tensor

，

attention_mask

None

，

hidden_size

768

，

num_hidden_layers

，

num_attention_heads

，

intermediate_size

3072

，

intermediate_act_fn

gelu

，

hidden_dropout_prob

0。1

，

attention_probs_dropout_prob

0。1

，

initializer_range

0。02

，

do_return_all_layers

False

）：

“”“

：param input_tensor：輸入融合之後向量輸入

：param attention_mask：mask矩陣用於表示當前詞是否有值

：param hidden_size：隱藏層輸出大小

：param num_hidden_layers：主體部分層數

：param num_attention_heads：多頭注意力機制個數

：param intermediate_size：中間層輸出大小（attention之後的全連線層）

：param intermediate_act_fn：中間層啟用函式

：param hidden_dropout_prob：隱藏層dropout

：param attention_probs_dropout_prob：注意力部分dropout

：param initializer_range：

：param do_return_all_layers：是否需要返回所有層

：return：

”“”

#因為多頭是直接從hidden_size均分，所以一定是整除的

hidden_size

num_attention_heads

！=

：

raise

ValueError

（

“The hidden size （

） is not a multiple of the number of attention ”

“heads （

）”

（

hidden_size

，

num_attention_heads

））

attention_head_size

int

（

hidden_size

num_attention_heads

）

input_shape

get_shape_list

（

input_tensor

，

expected_rank

）

batch_size

input_shape

［

］

seq_length

input_shape

［

］

input_width

input_shape

［

］

# The Transformer performs sum residuals on all layers so the input needs

# to be the same as the hidden size。

input_width

！=

hidden_size

：

raise

ValueError

（

“The width of the input tensor （

）！= hidden size （

）”

（

input_width

，

hidden_size

））

#我們將表示保留為2D張量，以避免將其從3D張量來回整形為2D張量。

# 重構在GPU/CPU上通常是免費的，但在TPU上可能不是免費的，因此我們希望最小化它們以幫助最佳化器。

#看起來是便於在tpu處理

prev_output

reshape_to_matrix

（

input_tensor

）

all_layer_outputs

［］

for

layer_idx

range

（

num_hidden_layers

）：

with

。

variable_scope

（

“layer_

”

layer_idx

）：

#prev_output儲存最新一層得輸出

layer_input

prev_output

with

。

variable_scope

（

“attention”

）：

attention_heads

［］

with

。

variable_scope

（

“self”

）：

attention_head

attention_layer

（

from_tensor

layer_input

，

to_tensor

layer_input

，

attention_mask

，

num_attention_heads

，

size_per_head

attention_head_size

，

attention_probs_dropout_prob

，

initializer_range

，

do_return_2d_tensor

True

，

batch_size

，

from_seq_length

seq_length

，

to_seq_length

seq_length

）

attention_heads

。

append

（

attention_head

）

attention_output

None

len

（

attention_heads

）

：

attention_output

attention_heads

［

］

else

：

# In the case where we have other sequences， we just concatenate

# them to the self-attention head before the projection。

attention_output

。

concat

（

attention_heads

，

axis

）

# 增加一個全連線層，構建殘差網路將輸入加到attention輸出，再進行layer_norm

with

。

variable_scope

（

“output”

）：

attention_output

。

layers

。

dense

（

attention_output

，

hidden_size

，

kernel_initializer

create_initializer

（

initializer_range

））

attention_output

dropout

（

attention_output

，

hidden_dropout_prob

）

attention_output

layer_norm

（

attention_output

layer_input

）

# 前向傳播例如Chinese-bert引數 786-》768*4-》768再進行殘差網路，再layer_norm

with

。

variable_scope

（

“intermediate”

）：

intermediate_output

。

layers

。

dense

（

attention_output

，

intermediate_size

，

activation

intermediate_act_fn

，

kernel_initializer

create_initializer

（

initializer_range

））

with

。

variable_scope

（

“output”

）：

layer_output

。

layers

。

dense

（

intermediate_output

，

hidden_size

，

kernel_initializer

create_initializer

（

initializer_range

））

layer_output

dropout

（

layer_output

，

hidden_dropout_prob

）

layer_output

layer_norm

（

layer_output

attention_output

）

prev_output

layer_output

all_layer_outputs

。

append

（

layer_output

）

#確認是否返回全部層資料，或者只返回最後一層

do_return_all_layers

：

final_outputs

［］

for

layer_output

all_layer_outputs

：

final_output

reshape_from_matrix

（

layer_output

，

input_shape

）

final_outputs

。

append

（

final_output

）

return

final_outputs

else

：

final_output

reshape_from_matrix

（

prev_output

，

input_shape

）

return

final_output

attention_layer方法

1。計算q，k，v，計算方式輸入其實都是根據詞嵌入融合的結果得來。

2。將query 和 key進行點乘得到scores，然後除以根號d tf。multiply是兩個矩陣中對應元素各自相乘，將每個值乘以根號d。

3。attention_mask為什麼？這裡給mask部分一個很大負數，為什麼呢，因為再進行softmax時候，如果改值是0，那麼e得0次冪是1，勢必會影響，如果是一個很大的負數，那麼e的負數次冪=0，則相關性求softmax接近0。

4。歸一化輸入到一個softmax 得到相關程度矩陣。

5。獲取value值，將相關性矩陣attention_scores ［B， N， F， T］和value ［B， N， T， H］相乘。

def

attention_layer

（

from_tensor

，

to_tensor

，

attention_mask

None

，

num_attention_heads

，

size_per_head

512

，

query_act

None

，

key_act

None

，

value_act

None

，

attention_probs_dropout_prob

0。0

，

initializer_range

0。02

，

do_return_2d_tensor

False

，

batch_size

None

，

from_seq_length

None

，

to_seq_length

None

）：

“”“

：param from_tensor：第一步詞向量融合輸入

：param to_tensor：第一步詞向量融合輸入

：param attention_mask： mask矩陣用於表示當前詞是否有值

：param num_attention_heads：多頭個數例如12

：param size_per_head：每個頭的大小例如768//12=64

：param key_act：

：param value_act：

：param attention_probs_dropout_prob：

：param initializer_range：

：param do_return_2d_tensor：

：param batch_size：

：param from_seq_length：

：param to_seq_length：

：return：

”“”

def

transpose_for_scores

（

input_tensor

，

batch_size

，

num_attention_heads

，

seq_length

，

width

）：

output_tensor

。

reshape

（

input_tensor

，

［

batch_size

，

seq_length

，

num_attention_heads

，

width

］）

output_tensor

。

transpose

（

output_tensor

，

［

，

］）

return

output_tensor

from_shape

get_shape_list

（

from_tensor

，

expected_rank

［

，

］）

to_shape

get_shape_list

（

to_tensor

，

expected_rank

［

，

］）

len

（

from_shape

）

！=

len

（

to_shape

）：

raise

ValueError

（

“The rank of `from_tensor` must match the rank of `to_tensor`。”

）

len

（

from_shape

）

：

batch_size

from_shape

［

］

from_seq_length

from_shape

［

］

to_seq_length

to_shape

［

］

elif

len

（

from_shape

）

：

（

batch_size

None

from_seq_length

None

to_seq_length

None

）：

raise

ValueError

（

“When passing in rank 2 tensors to attention_layer， the values ”

“for `batch_size`， `from_seq_length`， and `to_seq_length` ”

“must all be specified。”

）

# Scalar dimensions referenced here：

# B = batch size （number of sequences）

# F = `from_tensor` sequence length

# T = `to_tensor` sequence length

# N = `num_attention_heads`

# H = `size_per_head`

from_tensor_2d

reshape_to_matrix

（

from_tensor

）

to_tensor_2d

reshape_to_matrix

（

to_tensor

）

#計算q，k，v，計算方式輸入其實都是根據詞嵌入融合的結果得來的

#輸入是batch_size*sequence_length 輸出大小是num_attention_heads*size_per_head 其實就是hidden_size 768

#相同的方式得到q， k， v

# `query_layer` = ［B*F， N*H］

query_layer

。

layers

。

dense

（

from_tensor_2d

，

num_attention_heads

size_per_head

，

activation

query_act

，

name

“query”

，

kernel_initializer

create_initializer

（

initializer_range

））

# `key_layer` = ［B*T， N*H］

key_layer

。

layers

。

dense

（

to_tensor_2d

，

num_attention_heads

size_per_head

，

activation

key_act

，

name

“key”

，

kernel_initializer

create_initializer

（

initializer_range

））

# `value_layer` = ［B*T， N*H］

value_layer

。

layers

。

dense

（

to_tensor_2d

，

num_attention_heads

size_per_head

，

activation

value_act

，

name

“value”

，

kernel_initializer

create_initializer

（

initializer_range

））

#這裡對原始結構進行位置更換，原始：［batch_size， seq_length， num_attention_heads， width］

#變換方式是［0， 2， 1， 3］

#變換之後的是［batch_size， num_attention_heads， seq_length， size_per_head］

#key 相同的處理方式

# `query_layer` = ［B， N， F， H］

query_layer

transpose_for_scores

（

query_layer

，

batch_size

，

num_attention_heads

，

from_seq_length

，

size_per_head

）

# `key_layer` = ［B， N， T， H］

key_layer

transpose_for_scores

（

key_layer

，

batch_size

，

num_attention_heads

，

to_seq_length

，

size_per_head

）

#將query 和 key進行點乘得到scores

#然後除以根號d tf。multiply是兩個矩陣中對應元素各自相乘，將每個值乘以根號d

#矩陣大小是［batch_size， num_attention_heads， seq_length， seq_length］

# `attention_scores` = ［B， N， F， T］

attention_scores

。

matmul

（

query_layer

，

key_layer

，

transpose_b

True

）

attention_scores

。

multiply

（

attention_scores

，

1。0

math

。

sqrt

（

float

（

size_per_head

）））

#這裡給mask部分一個很大負數，為什麼呢，因為再進行softmax時候，如果改值是0，那麼e得0次冪是1，勢必會影響，如果是一個很大的負數，

#那麼e的負數次冪=0，則相關性求softmax接近0

attention_mask

not

None

：

# `attention_mask` = ［B， 1， F， T］

attention_mask

。

expand_dims

（

attention_mask

，

axis

［

］）

# Since attention_mask is 1。0 for positions we want to attend and 0。0 for

# masked positions， this operation will create a tensor which is 0。0 for

# positions we want to attend and -10000。0 for masked positions。

adder

（

1。0

。

cast

（

attention_mask

，

。

float32

））

10000。0

# Since we are adding it to the raw scores before the softmax， this is

# effectively the same as removing these entirely。

attention_scores

adder

# 歸一化輸入到一個softmax 得到相關程度矩陣

# `attention_probs` = ［B， N， F， T］

attention_probs

。

softmax

（

attention_scores

）

# 是否需要對所有的字元都進行處理，dropout丟失一些

attention_probs

dropout

（

attention_probs

，

attention_probs_dropout_prob

）

#獲取value值

# `value_layer` = ［B， T， N， H］

value_layer

。

reshape

（

value_layer

，

［

batch_size

，

to_seq_length

，

num_attention_heads

，

size_per_head

］）

# `value_layer` = ［B， N， T， H］

value_layer

。

transpose

（

value_layer

，

［

，

］）

#將相關性矩陣［B， N， F， T］和value［B， N， T， H］相乘

# `context_layer` = ［B， N， F， H］

context_layer

。

matmul

（

attention_probs

，

value_layer

）

# `context_layer` = ［B， F， N， H］

context_layer

。

transpose

（

context_layer

，

［

，

］）

#這裡［B， F， N*H］即［batch_size， seq_length， num_attention_heads*hidden_size］

do_return_2d_tensor

：

# `context_layer` = ［B*F， N*H］

context_layer

。

reshape

（

context_layer

，

［

batch_size

from_seq_length

，

num_attention_heads

size_per_head

］）

else

：

# `context_layer` = ［B， F， N*H］

context_layer

。

reshape

（

context_layer

，

［

batch_size

，

from_seq_length

，

num_attention_heads

size_per_head

］）

return

context_layer

（3）pooler

只提取第一個token對應的向量，因為每一個token其實已經學習到其他所有的資訊。

with

。

variable_scope

（

“pooler”

）：

# 只提取第一個token對應的向量，因為每一個token其實已經學習到其他所有的資訊。

# 再連結一個全連線層輸出大小［batch_size， hidden_size］

first_token_tensor

。

squeeze

（

self

。

sequence_output

［：，

：

，

：］，

axis

）

self

。

pooled_output

。

layers

。

dense

（

first_token_tensor

，

config

。

hidden_size

，

activation

。

tanh

，

kernel_initializer

create_initializer

（

config

。

initializer_range

））

小白Bert系列-原始碼解析-modeling.py

win10怎麼設定nvidia為預設顯示卡驅動

楊子和黃聖依是什麼關係

隨便看看

鰱鱅釣多了吃不完怎麼儲存？

汽車補胎去什麼地方？

打算去工地，都有哪些“靠譜”的建議或忠告？

澳系牛品種？

小白Bert系列-原始碼解析-modeling.py

win10怎麼設定nvidia為預設顯示卡驅動

楊子和黃聖依是什麼關係

猜你喜歡

GCC 下 C++ 中 new int[] 記憶體的額外資訊在哪裡？

My Coolest Avatar作文？

PyTorch的Transformer

隨便看看

鰱鱅釣多了吃不完怎麼儲存？

汽車補胎去什麼地方？

打算去工地，都有哪些“靠譜”的建議或忠告？

澳系牛品種？