深度強化學習-求解組合最佳化問題

1 Pointer Networks

paper

github

Vinyals的這篇論文提出了PointerNetwork（PN），求解了一些經典的組合最佳化問題，比如旅行商問題（TSP）和揹包問題（Knapsack problem）。他們使用注意力機制計算Softmax機率值，將其當做指標（Pointer）指向輸入序列中的元素，對輸入序列進行組合，最後使用有監督方法對模型進行訓練。這篇論文是後面幾篇論文的基礎

在Seq2Seq的結構中，原來的Attention機制為：

$\begin{array}{l}{u_{j}^{i}=v^{T} \tanh \left(W_{1} e_{j}+W_{2} d_{i}\right) j \in(1, \ldots, n)} \\ {a_{j}^{i}=\operatorname{softmax}\left(u_{j}^{i}\right) \quad j \in(1, \ldots, n)} \\ {d_{i}^{\prime}=\sum_{j=1}^{n} a_{j}^{i} e_{j}}\end{array} \\$

在PointerNetwork中，Attention機制變為：

$\begin{aligned} u_{j}^{i} &=v^{T} \tanh \left(W_{1} e_{j}+W_{2} d_{i}\right) \quad j \in(1, \ldots, n) \\ p\left(C_{i} | C_{1}, \ldots, C_{i-1}, \mathcal{P}\right) &=\operatorname{softmax}\left(u^{i}\right) \end{aligned} \\$

2 Neural Combinatorial Optimization with Reinforcement Learning

paper

github

PointerNetwork是基於參考演算法提供的標籤資料進行的有監督訓練，所以演算法的效能不會優於參考演算法。高質量的標籤難以獲得。當節點n>50後，效果變差。針對於有監督訓練中訓練資料獲取困難、精度不足的問題，Bello等使用Actor Critc強化學習演算法對PN進行訓練，在節點長度n=100的TSP問題上獲得了近似最優解

路徑長度為：

$L(\pi | s)=\left\|\mathbf{x}_{\pi(n)}-\mathbf{x}_{\pi(1)}\right\|_{2}+\sum^{n-1}\left\|\mathbf{x}_{\pi(i)}-\mathbf{x}_{\pi(i+1)}\right\|_{2}$

Actor-Critic演算法：

Policy gradient：

$\nabla_{\theta} J(\theta) \approx \frac{1}{B} \sum_{i=1}^{B}\left(L\left(\pi_{i} | s_{i}\right)-b\left(s_{i}\right)\right) \nabla_{\theta} \log p_{\theta}\left(\pi_{i} | s_{i}\right)$

Critic：

$\mathcal{L}\left(\theta_{v}\right)=\frac{1}{B} \sum_{i=1}^{B}\left\|b_{\theta_{v}}\left(s_{i}\right)-L\left(\pi_{i} | s_{i}\right)\right\|_{2}^{2}$

Pointing mechanism：

$u_{i}=\left\{\begin{array}{ll}{v^{\top} \cdot \tanh \left(W_{r e f} \cdot r_{i}+W_{q} \cdot q\right)} & {\text { if } i \neq \pi(j) \text { for all } j<i} \\ {-\infty} & {\text { otherwise }}\end{array}\right.$

$A\left(r e f, q ; W_{r e f}, W_{q}, v\right) \stackrel{\text { def }}{=} \operatorname{softmax}(u)$

$p(\pi(j) | \pi(<j), s) \stackrel{\text { def }}{=} A\left(e n c_{1: n}, \operatorname{dec}_{j}\right)$

加入了兩個trick：

Softmax temperature：

$A\left(r e f, q, T ; W_{r e f}, W_{q}, v\right) \stackrel{\text { def }}{=} \operatorname{softmax}(u / T) \\$

Logit clipping：

$A\left(r e f, q ; W_{r e f}, W_{q}, v\right) \stackrel{\text { def }}{=} \operatorname{softmax}(C \tanh (u)) \\$

3 Reinforcement Learning for Solving the Vehicle Routing Problem

paper

github

PointerNetwork結構存在的問題在於，當輸入序列中的動態元素髮生變化時，要對整個Encoder進行更新才能進行下一個節點的預測，增加了計算量。Nazari等使用一個Embedding Layer對PN的Encoder部分進行了替換，這樣一來，當輸入序列中的動態元素髮生變化時，不必對Encoder的RNN部分進行完全更新

Embedding Layer：直接把輸入對映為一個D維的向量空間，對encoder的RNN進行簡化，不同時刻的輸入共享同一個input結構

Attention Layer with Glimpse：

$u_{t}^{i}=v_{a}^{T} \tanh \left(W_{a}\left[\bar{x}_{t}^{i} ; h_{t}\right]\right) \\$

$a_{t}=a_{t}\left(\bar{x}_{t}^{i}, h_{t}\right)=\operatorname{softmax}\left(u_{t}\right) \\$

$c_{t}=\sum_{i=1}^{M} a_{t}^{i} \bar{x}_{t}^{i} \\$

$\tilde{u}_{t}^{i}=v_{c}^{T} \tanh \left(W_{c}\left[\bar{x}_{t}^{i} ; c_{t}\right]\right) \\$

$P\left(y_{t+1} | Y_{t}, X_{t}\right)=\operatorname{softmax}\left(\tilde{u}_{t}^{i}\right) \\$

訓練演算法的虛擬碼如下：

A3C演算法的虛擬碼：

4 Learning Combinatorial Optimization Algorithms over Graphs

paper

github

該演算法更適合解決基於圖論的組合最佳化問題，上圖所示為求解Minimum Vertex Cover（最小頂點覆蓋問題）的過程，整體上採用Structure2Vec+Q learning的結構

Graph embedding vector：

$\mu_{v}^{(t+1)} \leftarrow \operatorname{relu}\left(\theta_{1} x_{v}+\theta_{2} \sum_{u \in \mathcal{N}(v)} \mu_{u}^{(t)}+\theta_{3} \sum_{u \in \mathcal{N}(v)} \operatorname{relu}\left(\theta_{4} w(v, u)\right)\right) \\$

Q function：

$\widehat{Q}(h(S), v ; \Theta)=\theta_{5}^{\top} \operatorname{relu}\left(\left[\theta_{6} \sum_{u \in V} \mu_{u}^{(T)}, \theta_{7} \mu_{v}^{(T)}\right]\right) \\$

Q target：

$y=\sum_{i=0}^{n-1} r\left(S_{t+i}, v_{t+i}\right)+\gamma \max _{v^{\prime}} \widehat{Q}\left(h\left(S_{t+n}\right), v^{\prime} ; \Theta\right) \\$

Minimize squared loss：

$\left(y-\widehat{Q}\left(h\left(S_{t}\right), v_{t} ; \Theta\right)\right)^{2} \\$

5 Attention, Learn to Solve Routing Problems!

paper

github

這一篇基於Transformer中Self Attention（MHA+FF）建立了一個複雜的Encoder，使用Rollout強化學習演算法對模型訓練，求解了TSP和VRP及其變種OP、PCTSP和SPCTSP

MHA：

Embedding：

$\mathbf{q}_{i}=W^{Q} \mathbf{h}_{i}, \quad \mathbf{k}_{i}=W^{K} \mathbf{h}_{i}, \quad \mathbf{v}_{i}=W^{V} \mathbf{h}_{i} \\$

Dot-product Compatibility：

$u_{i j}=\left\{\begin{array}{ll}{\frac{\mathbf{q}_{i}^{T} \mathbf{k}_{j}}{\sqrt{d_{k}}}} & {\text { if } i \text { adjacent to } j} \\ {-\infty} & {\text { otherwise }}\end{array}\right. \\$

Attention weights：

$a_{i j}=\frac{e^{u_{i j}}}{\sum_{j^{\prime}} e^{u_{i j^{\prime}}}} \\$

Combination vector：

$\mathbf{h}_{i}^{\prime}=\sum_{j} a_{i j} \mathbf{v}_{j} \\$

Multi-head attention （MHA） layer：

$\operatorname{MHA}_{i}\left(\mathbf{h}_{1}, \ldots, \mathbf{h}_{n}\right)=\sum_{m=1}^{M} W_{m}^{O} \mathbf{h}_{i m}^{\prime} \quad d_{\mathbf{k}}=d_{\mathbf{v}}=\frac{d_{\mathbf{n}}}{M}=16 \\$

Batch normalization：

$\mathrm{BN}\left(\mathbf{h}_{i}\right)=\boldsymbol{w}^{\mathrm{bn}} \odot \overline{\mathrm{BN}}\left(\mathbf{h}_{i}\right)+b^{\mathrm{bn}} \\$

Feed-forward （FF） layer：

$\mathrm{FF}\left(\hat{\mathbf{h}}_{i}\right)=W^{\mathrm{ff}, 1} \cdot \operatorname{ReLu}\left(W^{\mathrm{ff}, 0} \hat{\mathbf{h}}_{i}+\boldsymbol{b}^{\mathrm{ff}, 0}\right)+\boldsymbol{b}^{\mathrm{ff}, 1} \\$

Encoder：

Node embedings：

$\mathbf{h}_{i}^{(\ell)}=W^{\mathbf{x}} \mathbf{x}_{i}+\mathbf{b}^{\mathbf{x}} \\$

MHA sublayer：M = 8 heads

Batch normalization （BN）：

$\begin{array}{l}{\hat{\mathbf{h}}_{i}=\mathrm{B} \mathrm{N}^{\ell}\left(\mathbf{h}_{i}^{(\ell-1)}+\mathrm{M} \mathrm{H} \mathrm{A}_{i}^{\ell}\left(\mathbf{h}_{1}^{(\ell-1)}, \ldots, \mathbf{h}_{n}^{(\ell-1)}\right)\right)} \\ {\mathbf{h}_{i}^{(\ell)}=\mathrm{B} \mathrm{N}^{\ell}\left(\hat{\mathbf{h}}_{i}+\mathrm{FF}^{\ell}\left(\hat{\mathbf{h}}_{i}\right)\right)}\end{array} \\$

FF sublayer：512 dimensions and ReLU activation

Graph embeddings：

$\overline{\mathbf{h}}^{(N)}=\frac{1}{n} \sum_{i=1}^{n} \mathbf{h}_{i}^{(N)} \\$

Decoder：

Context embedding：

$\mathbf{h}_{(c)}^{(N)}=\left\{\begin{array}{ll}{\left[\overline{\mathbf{h}}^{(N)}, \mathbf{h}_{\pi_{t-1}}^{(N)}, \mathbf{h}_{\pi_{1}}^{(N)}\right]} & {t>1} \\ {\left[\overline{\mathbf{h}}^{(N)}, \mathbf{v}^{1}, \mathbf{v}^{\mathrm{f}}\right]} & {t=1}\end{array}\right. \\$

MHA to New Context embedding like Glimpse：

$\mathbf{h}_{(c)}^{(N+1)}$

Single query：

$\mathbf{q}_{(c)}=W^{Q} \mathbf{h}_{(c)}$

MHA without Skip，BN，FF Compatibility clipping：

$u_{(c) j}=\left\{\begin{array}{ll}{C \cdot \tanh \left(\frac{\mathbf{q}_{(c)}^{T} \mathbf{k}_{j}}{\sqrt{d_{k}}}\right)} & {\text { if } j \neq \pi_{t^{\prime}} \quad \forall t^{\prime}<t \quad d_{\mathbf{k}}=\frac{d_{\mathbf{h}}}{M}} \\ {-\infty} & {\text { otherwise }}\end{array}\right.$

Softmax log-probability：

$p_{i}=p_{\boldsymbol{\theta}}\left(\pi_{t}=i | s, \boldsymbol{\pi}_{1: t-1}\right)=\frac{e^{u_{(c) i}}}{\sum_{j} e^{u_{(c) j}}} \\$

Finally to Model：

$p_{\theta}(\boldsymbol{\pi} | s)=\prod_{t=1} p_{\boldsymbol{\theta}}\left(\pi_{t} | s, \boldsymbol{\pi}_{1: t-1}\right) \\$

Rollout訓練演算法：

6 Solving a New 3D Bin Packing Problem with Deep Reinforcement LearningMethod

paper

問題建模：

模型：

7 An Efficient Graph Convolutional Network Technique for the Travelling Salesman Problem

paper

github

輸入為一個鄰接矩陣，後經過圖卷積神經網路得到鄰接矩陣的機率heat-map，最後使用beam search演算法對路徑進行搜尋。演算法是基於有監督訓練的

Input layer：

Node input feature：

$\alpha_{i}=A_{1} x_{i}+b_{1} \\$

The edge input feature：

$\beta_{i j}=A_{2} d_{i j}+b_{2} \| A_{3} \delta_{i j}^{k-N N} \\$

GCN：

Node feature vector：

$x_{i}^{\ell+1}=x_{i}^{\ell}+\operatorname{ReLU}\left(\mathrm{BN}\left(W_{1}^{\ell} x_{i}^{\ell}+\sum_{j \sim i} \eta_{i j}^{\ell} \odot W_{2}^{\ell} x_{j}^{\ell}\right)\right) \text { with } \eta_{i j}^{\ell}=\frac{\sigma\left(e_{i j}^{\ell}\right)}{\sum_{j^{\prime} \sim i} \sigma\left(e_{i j^{\prime}}^{\ell}\right)+\varepsilon} \\$

Edge feature vector：

$e_{i j}^{\ell+1}=e_{i j}^{\ell}+\operatorname{ReLU}\left(\mathrm{BN}\left(W_{3}^{\ell} e_{i j}^{\ell}+W_{4}^{\ell} x_{i}^{\ell}+W_{5}^{\ell} x_{j}^{\ell}\right)\right) \\$

MLP classififier：

$p_{i j}^{\mathrm{TSP}}=\operatorname{MLP}\left(e_{i j}^{L}\right) \\$

參考文獻：

［1］ Vinyals， O。， Fortunato， M。， & Jaitly， N。（2015）。 Pointer Networks， 1–9。 Retrieved from

http：//

arxiv。org/abs/1506。0313

［2］ MohammadReza Nazari， Afshin Oroojlooy， Lawrence Snyder， and Martin Takac。 Reinforcement learning for solving the vehicle routing problem。 In Advances in Neural Information Processing Systems， pp。 9860–9870， 2018。

［3］ Oriol Vinyals， Meire Fortunato， and Navdeep Jaitly。 Pointer networks。 In Advances in Neural Information Processing Systems， pp。 2692–2700， 2015。

［4］ Irwan Bello， Hieu Pham， Quoc V Le， Mohammad Norouzi， and Samy Bengio。 Neural combinatorial optimization with reinforcement learning。 arXiv preprint arXiv：1611。09940， 2016。

［5］ Volodymyr Mnih， Adria Puigdomenech Badia， Mehdi Mirza， Alex Graves， Timothy Lillicrap， Tim Harley， David Silver， and Koray Kavukcuoglu。 Asynchronous methods for deep reinforcement learning。 In International Conference on Machine Learning， pages 1928–1937， 2016。

［6］ Ronald J Williams。 Simple statistical gradient-following algorithms for connectionist reinforcement learning。 Machine learning， 8（3-4）：229–256， 1992。

深度強化學習-求解組合最佳化問題

math和maths發音的區別

如何評價談天音的《皇后策》這本書？

隨便看看

瓷瓶絕緣子結構原理？

甘肅省酒泉市冬天最低氣溫是多少？

夫妻沒有共同語言，交流困難，想離婚，怎麼辦？

孔雀魚背彎能恢復麼？

深度強化學習-求解組合最佳化問題

math和maths發音的區別

如何評價談天音的《皇后策》這本書？

猜你喜歡

跟腱斷裂的康復方法有哪些?

1200萬超廣角和3200萬畫素哪個好？

如何看待商湯的Deformable DETR？能否取代Faster-RCNN正規化？

隨便看看

瓷瓶絕緣子結構原理？

甘肅省酒泉市冬天最低氣溫是多少？

夫妻沒有共同語言，交流困難，想離婚，怎麼辦？

孔雀魚背彎能恢復麼？