VOLO: Vision Outlooker for Visual Recognition

本文使用 Zhihu On VSCode 創作併發布

論文連結：

https：//

arxiv。org/abs/2106。1311

程式碼連結：

https：//

github。com/sail-sg/volo

背景

outlook attention。

Vision Transformer模型雖然帶來了視覺任務的一場重大變革，在視覺領域開始挑戰CNN的主導地位，但是在不使用額外的訓練資料的情況下，ViT模型的效能仍然落後於一些基於CNN的模型。

We find a major factor limiting the performance

of ViTs for ImageNet classification is their low

efficacy in encoding fine-level features into the token representations。

作者認為主要的原因是ViT模型在精細特徵的token編碼時效率非常低。

針對這一點，作者提出了一種新的Vision Outlooker模組，僅使用畫素空間相鄰的資訊來生成attention權重。

方法

Outlooker

Outlooker由一個outlook attention layer來編碼空間資訊，而後使用一個MLP來實現不同通道間的資訊的交換。

$對於一個輸入 X \in \mathbb{R}^{C\times H \times W} \$

Outlooker可以由以下公式表達：

$\tilde{\mathbf{X}} = outlookatt(LN(\mathbf{X})) + \mathbf{X} \\$

$\mathbf{Z} = MLP(LN(\tilde{\mathbf{X}})) + \tilde{\mathbf{X}} \\$

outlook attention。

Outlook attention

內部分成兩個分支，經過兩個線性變換（projection）

$\mathbf{W}_A \in \mathbb{R}^{C \times K^4} \\$

$\\ \mathbf{W}_{V} \in \mathbb{R}^{C \times C}$

輸入

將會被分別對映為如下結果：

$\mathbf{A} \in \mathbb{R}^{H \times W \times K^4} \\$

$\\ \mathbf{V} \in \mathbb{R}^{H \times W \times C}$

最終對於每個位置（i，j）視窗內的資料：

$V_{\Delta_{i,j}} = \{ \mathbf{V}_{i+p-\lfloor \frac{K}{2} \rfloor,j+q-\lfloor \frac{K}{2} \rfloor}\}, \quad 0 \leq p,q <K \} \\$

而對於（i，j）位置的權重

$\hat{\mathbf{A}}_{i,j}$

，reshape後維度為

，經過softmax直接與

$V_{\Delta}$

相乘得到：

$\mathbf{Y}_{\Delta_{i,j}} = matmul(softmax(\hat{\mathbf{A}}_{i,j}), \mathbf{V}_{\Delta_{i,j}}) \\$

而後將每個視窗內的值相加，表示最終的輸出

$\tilde{\mathbf{Y}}_{i,j} = \sum_{0 \leq m, n < K} \mathbf{Y}_{\Delta_{i+m-\lfloor \frac{K}{2}\rfloor,j+n-\lfloor \frac{K}{2}\rfloor}}^{i,j} \\$

pytorch版本的程式碼：

# H： height， W： width， K： kernel size

# x： input tensor （H， W， C）

def

init

（）

v_pj

。

Linear

（

，

）

attn

。

Linear

（

，

）

unfold

。

Unfold

（

，

padding

）

fold

。

Fold

（

output_size

（

，

），

，

padding

）

def

outlook_attention

（

）：

# code in forward

v_pj

（

）

。

permute

（

，

）

# Eqn。（3）， embedding set of neighbors

unfold

（

）

。

reshape

（

，

）

。

permute

（

，

）

attn

（

）

。

reshape

（

，

）

# Eqn。（4）， weighted average

。

softmax

（

dim

）

mul

（

，

）

。

permute

（

，

）

。

reshape

（

，

）

# Eqn。（5）

fold

（

）

。

permute

（

，

）

return

其中unfold為滑動視窗操作，對於一個kernel大小為k*k的unfold，會對視窗內的值進行flatten（不做任何計算），再按照通道維度進行依次排序，經過reshape重排後，對同一個i，j位置的資料進行聚集。

Multi-Head Outlook Attention

只需要將輸入的X分組進行Outlook操作再合併起來即可。

引數比較

採用N=5， C=384， K = 3 ，很明顯 NK⁴ < 2C

VOLO整體架構

作者基於LV-ViT實現了VOLO，但是為了實現更好的效果，分為兩個階段來實現

為了得到更精細的特徵，第一步把token個數由原本的14x14個個變為28x28個（也就是將patchsize由原本的16x16變為了8x8），經過Outlookers產生關注

第二階段將token downsample 到14 * 14，適應原本的LV-VIT結構，再經過一系列的global transformer產生最終的輸出結果。

class

PatchEmbed

（

。

Module

）：

“”“

Image to Patch Embedding。

Different with ViT use 1 conv layer， we use 4 conv layers to do patch embedding

”“”

def

__init__

（

self

，

img_size

224

，

stem_conv

False

，

stem_stride

，

patch_size

，

in_chans

，

hidden_dim

，

embed_dim

384

）：

super

（）

。

__init__

（）

assert

patch_size

［

，

］

self

。

stem_conv

：

self

。

conv

。

Sequential

（

。

Conv2d

（

in_chans

，

hidden_dim

，

kernel_size

，

stride

stem_stride

，

padding

，

bias

False

），

# 112x112

。

BatchNorm2d

（

hidden_dim

），

。

ReLU

（

inplace

True

），

。

Conv2d

（

hidden_dim

，

hidden_dim

，

kernel_size

，

stride

，

padding

，

bias

False

），

# 112x112

。

BatchNorm2d

（

hidden_dim

），

。

ReLU

（

inplace

True

），

。

Conv2d

（

hidden_dim

，

hidden_dim

，

kernel_size

，

stride

，

padding

，

bias

False

），

# 112x112

。

BatchNorm2d

（

hidden_dim

），

。

ReLU

（

inplace

True

），

）

self

。

proj

。

Conv2d

（

hidden_dim

，

embed_dim

，

kernel_size

patch_size

stem_stride

，

stride

patch_size

stem_stride

）

# 縮小八倍

self

。

num_patches

（

img_size

patch_size

）

（

img_size

patch_size

）

def

forward

（

self

，

）：

self

。

stem_conv

：

self

。

conv

（

）

self

。

proj

（

）

# B， C， H， W

return

# data = data = torch。randn（（1， 3， 224， 224））

# net = PatchEmbed（stem_conv=True）

# net（data）。shape # torch。Size（［1， 384， 28， 28］）

PatchEmbed，本文使用四個二維卷積實現，將8x8的塊embeding到長度為384的一維tensor

實驗結果

模型的引數數量和一些訓練引數等設定

使用LV-ViT-S作為Baseline，測試不同數量的替換Ts為Os以及使用的head數和解析度等對結果的影響

分類任務上的準確率以及比較

語義分割：

Cityscape資料集

ADE20k資料集

VOLO: Vision Outlooker for Visual Recognition

如何解決新風系統噪音

為什麼說阿凡達是神片

隨便看看

暖腳寶哪個牌子好？

才換的手機觸屏寫字字就發叉橫豎亂叉？

泉城廣場到芙蓉街怎麼走？

人頭蜂蛹怎麼吃？

VOLO: Vision Outlooker for Visual Recognition

如何解決新風系統噪音

為什麼說阿凡達是神片

猜你喜歡

緩解cold start--Deep Q-learning from Demonstrations筆記

King size大床

無任何訓練的孫笑川打得過短暫訓練過後的藥水哥嗎？

隨便看看

暖腳寶哪個牌子好？

才換的手機觸屏寫字字就發叉橫豎亂叉？

泉城廣場到芙蓉街怎麼走？

人頭蜂蛹怎麼吃？