Attention-based Multi-Patch整合的圖片美學評估-ACM MM2018

《Attention-based Multi-Patch Aggregation for Image Aesthetic Assessment》

原文

code

摘要

在人工智慧系統中影象美學評估使用整合結構去整合明確的資訊，例如

影象屬性

和

場景語義

是有效並流行的。然後由於手工標註和專業設計的高消費，有用的資訊可能不能獲得。文中提出了一種新穎的多patch（

multi-path, MP

）整合的方法去進行影象美學評估。該方法只使用了美學標籤（即，只有正例和反例）去端到端的訓練模型，沒有使用其他視覺屬性。文中採用了一種基於注意力的機制在訓練過程中動態的調整每個patch的權重以此來提高學習效率。除此之外，文中提出了一套具有典型三種注意力機制的目標（即，平均（average），最小（minimum），自適應（adaptive）），並且在Aesthetic Visual Analysis（AVA）上進行評估。

方法（Approach）

文中設定

對輸入影象

和其相關的真實美學標籤

$\hat{y}$

做為資料集

$\{ I_i,\hat{y}_i \}_{i=1}^N$

。其中

$\hat{y}_i=1$

表示影象

的美學標籤是正例，

$\hat{y}_i=0$

表示影象

的美學標籤是反例。對於這個資料集，學習影象美學標籤的問題可以用下述公式表示：

$\mathop{argmax}_{\theta} \frac{1}{|\mathcal{P|}} \sum_{\mathcal{p}\in \mathcal{P}} Pr(\tilde{y} =\hat{y} |\mathcal{p},\theta) \tag{1}$

其中

$\mathcal{P}$

是從該資料集影象中

crop

出來方框patch的一個集合，

$\tilde{y}$

指的是預測的模糊分類的標籤，

$\theta$

指的是需要學習的網路引數，公式

中

$Pr(\tilde{y} =\hat{y} |\mathcal{p},\theta)$

表示輸出的的預測正確標籤的機率，在文中網路中是作為最後的

softmax

層的輸出。

直接最佳化公式

可能是會導致過擬合，尤其是隻有美學標籤時。所以文中提出了一種注意力機制去解決這個問題。

基於注意力的目標函式（Attention-based Objective Functions）

為了有效的訓練美學評估模型，文中對不同的影象

patch

指定不同的權重。出於比較的目的，文中定義了3種不同的

加權機制，分別為

$MP_{avg}$

，

$MP_{min}$

和

$MP_{ada}$

。

$MP_{avg}$

：回想一下琴生不等式（

Jensen's inequality

）：已知一個實值凹函式（

concave function

）

和與域

中的一組點

$\{x\}$

，琴生不等式（

Jensen's inequality

）可以表示為：

$f \left ( \frac{1}{|S|}\sum_{x\in S}x \right) \geq \frac{1}{|S|}\sum_{x\in S} f(x) \tag{2}$

其中等號只有當

$x_i=x_j(\forall x_i\in \ S)$

或者

是線性的時候成立。根據琴生不等式（

Jensen's inequality

），

$MP_{avg}$

能夠被提出表示成下式：

$\log \left( \frac{1}{\mathcal{|P|}} \sum_{\mathcal{p}\in \mathcal{P}}Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta) \right) \geq \underbrace{ \frac{1}{|\mathcal{P}|} \sum_{\mathcal{p}\in \mathcal{P}} \log \left( Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta) \right) }_{MP_{avg}} \tag{3}$

$\frac{\partial MP_{avg}}{\partial \theta}= \frac{1}{|\mathcal{P}|}\sum_{\mathcal{p}\in \mathcal{P}} \underbrace{\frac{1}{Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta)} }_{weights}\cdot \frac{\partial Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta)}{ \partial \theta} \tag{4}$

如果每個影象只選擇一個

patch

，那麼

$MP_{avg}$

機制將共享一個訓練流程和正常的影象分類模型相似。因此可以訓練該方案與現有的

聚合模型相比更有效率。

$MP_{min}$

：在大多數機器學習演算法中，另一種典型的注意力機制是專注於改善結果在中等置信度的點，例如鉸鏈損失（

hingle loss

）

［1］

和硬例挖掘（

hard example mining

）

［2］

。基於上面的方法，文中提出

$MP_{min}$

機制如下：

$\begin{align} \log \left ( \frac{1}{|\mathcal{P}|}\sum_{\mathcal{p}\in\mathcal{P}}Pr (\tilde{y}=\hat{y}|\mathcal{p},\theta) \right) &\geq \min_{\mathcal{p}\in\mathcal{P}}\frac{1}{|\mathcal{P}|} \log\left(Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta) \right)\\ &=\underbrace{ \frac{1}{|\mathcal{P}|}\log \left( Pr(\tilde{y}=\hat{y}|\mathcal{p}^m,\theta) \right)}_{MP_{min}} \end{align}\tag{5}$

其中

$\mathcal{p}^m=\mathop{argmin}_{\mathcal{p}\in\mathcal{P}}Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta)\\ \begin{align} \frac{\partial MP_{min}}{\partial\theta} &=\frac{1}{|\mathcal{P}|}\frac{1}{Pr(\tilde{y}=\hat{y}|\mathcal{p}^m,\theta)}\cdot \frac{\partial Pr(\tilde{y}=\hat{y}|\mathcal{p}^m,\theta)}{\partial \theta}\\ &=\frac{1}{|\mathcal{P}|}\sum_{\mathcal{p}\in\mathcal{P}} \frac{\mathbb{I}(\mathcal{p}=\mathcal{p}^m)}{Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta)} \cdot\frac{\partial Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta)}{\partial \theta}\tag{6} \end{align}$

當

$\mathcal{p}=\mathcal{p}^m$

時

$\mathbb{I}(\cdot)$

等於1否則等於0。正如公式（

）中所示，

$MP_{min}$

僅考慮最低預測置信度的影象

patches

，而忽略來自同一個影象的其他

patches

。實際上文中採用了一種軟（soft）版本的

$MP_{min}$

為了避免訓練不穩定，明確的說就是被選擇的機率

$\mathcal{p}$

是按照

$1-Pr(\tilde{y}=\hat{y}|\mathcal{p},\theta)$

比例選擇的。

$MP_{ada}$

：為了利用

patch

選擇的優點在訓練過程中以一種端到端的方式，文中採用一種自適應的方式去選擇有意義的訓練例項，即當前模型預測不正確的美學標籤，公式表示如下：

$\begin{align} MP_{ada} &=\frac{1}{|\mathcal{P}|}\sum_{\mathcal{p}\in\mathcal{P}}\omega_{\beta} \cdot \log(Pr(\tilde{y}=\hat{y}|p,\theta)) \\ \omega_\beta&= \frac{Pr(\tilde{y}=\hat{y}|p,\theta)^{-\beta}-1}{Pr(\tilde{y}=\hat{y}|p,\theta)^{-\beta}}=1-Pr(\tilde{y}=\hat{y}|p,\theta)^\beta \end{align} \tag{7}$

$\begin{align} \frac{\partial MP_{ada}}{\partial\theta} &=\frac{1}{|\mathcal{P}|} \sum_{\mathcal{p}\in\mathcal{P}}\lambda\cdot \frac{\partial Pr(\tilde{y}=\hat{y}|p,\theta)}{\partial\theta}\\ \lambda&=\frac{1-(1+\beta\cdot\log Pr(\tilde{y}=\hat{y}|p,\theta))\cdot(1-\omega_\beta)}{Pr(\tilde{y}=\hat{y}|p,\theta)} \end{align}\tag{8}$

其中

$\beta$

是一個正整數去控制自適應的權重。鑑於

$Pr(\tilde{y}=\hat{y}|p,\theta)^{-\beta}$

的取值範圍是

$(1,+\infty)$

，文中採用其值減去1，然後再除以本身去歸一化。

與鉸鏈損失（

hinge loss

）忽視已經分類正確的資料點不同，

$MP_{ada}$

機制不斷的給一個正的權重去幫助維持預測的正確性，這個和

focal loss

［3］

很像。

實驗細節

其對輸入資料採用保持比例的resize到短邊256，然後在隨機啊crop出224輸入到網路中，資料增強只是用了隨機水平翻轉。網路採用的是resnet18，修改最後的分類層輸出為2。

實驗結果

訓練過程中權重的變化：

訓練使用的網路結構

總結

該文主要是修改了訓練loss，其

$MP_{ada}$

機制中的loss和focal loss

［3］

特別像。focal loss中控制權重的

$\gamma$

是放在了

$(1-p_t)^\gamma$

這裡，而該文中的

$\beta$

放在了

$(1-p_t^\beta)$

這裡，然後乘上交叉熵

$-\log(p_t)$

。

參考

Ilya Loshchilov and Frank Hutter。 2015。 Online Batch Selection for Faster Training of Neural Networks。 Mathematics （2015）。

https：//arxiv。org/pdf/1511。06343

Abhinav Shrivastava， Abhinav Gupta， and Ross Girshick。 2016。 Training RegionBased Object Detectors with Online Hard Example Mining。 In IEEE Computer Vision and Pattern Recognition。 IEEE， 761–769。

https：//arxiv。org/pdf/1604。03540

Tsung Yi Lin， Priya Goyal， Ross Girshick， Kaiming He， and Piotr Dollar。 2017。 Focal Loss for Dense Object Detection。 In IEEE International Conference on Computer Vision。 IEEE， 2999–3007。

https：//arxiv。org/pdf/1708。02002

Attention-based Multi-Patch整合的圖片美學評估-ACM MM2018

五月天主唱什麼梗？

《一世傾城冷宮棄妃》之“愛和美的毀滅”——裴元修和顏輕盈

隨便看看

世界上有什麼樣的沙子？

筒子骨湯要燉多長時間？

蒸九肚魚的做法？

壓榨機底梳調節原理？

Attention-based Multi-Patch整合的圖片美學評估-ACM MM2018

五月天主唱什麼梗？

《一世傾城冷宮棄妃》之“愛和美的毀滅”——裴元修和顏輕盈

猜你喜歡

imax眼鏡和普通3d眼鏡通用嗎？

小米輸入法怎麼換行？

彩電高壓包加速極怎麼調？

隨便看看

世界上有什麼樣的沙子？

筒子骨湯要燉多長時間？

蒸九肚魚的做法？

壓榨機底梳調節原理？