強化學習——MDPs求解之動態規劃

學習目標

策略評估（Policy Evaluation）

策略提升（Policy Improvement）

策略迭代（Policy Iteration）

值迭代（Value Iteration）

學習目標

1。理解策略評估（Policy Evaluation）和策略提升（Policy Improvement）；

2。理解策略迭代（Policy Iteration）演算法；

3。理解值迭代（Value Iteration）演算法；

4。理解策略迭代和值迭代的不同之處；

5。動態規劃方法的侷限性；

6。 Python實現格子世界（Gridworld）策略迭代和值迭代。

動態規劃（Dynamic Programming， DP）是一種解決複雜問題的方法，它透過定義問題狀態和狀態之間的關係，將複雜問題拆分成若干較為簡單的子問題，使得問題能夠以遞推（或者說分治）的方式去解決。

所以要能使用動態規劃，這種問題一要能夠分解成許多子問題，二要這些子問題能夠多次被迭代使用。

而馬爾科夫決策過程就正好滿足了這兩個條件，MDPs可以看成是各個狀態之間的轉移，而貝爾曼方程則將這個問題分解成了一個個狀態的遞迴求解問題，而值函式就用於儲存這個求解的結果，得到每一個狀態的最優策略，合在一起以後就完成了整個MDPs的求解。但是DP的使用時建立在我們知道MDP環境的模型的基礎上的，所以也稱其為model based method。

策略評估（Policy Evaluation）

策略評估如其字面意思，就是評價一個策略好不好。計算任意一個策略

$\pi$

的狀態值函式

$v_{\pi}\left( s \right)$

即可，這也叫做預測（Prediction），上一篇文章已經透過backup圖得到了的求解公式，如下：

$v_{\pi}\left( s \right) =\sum_{a\in \mathcal{A}}{\pi \left( a|s \right) \left( \mathcal{R}_{s}^{a}+\gamma \sum_{s$

那這個式子怎麼算呢？狀態

的值函式我也不知道啊。這裡我們會使用高斯-賽德爾迭代演算法來求解，先人為給一個初值，再根據下面的式子迭代求解，可以證明，當k趨於無窮時，最後是會收斂到

$v_{\pi}\left( s \right)$

的。

$v_{k+1}\left( s \right) =\sum_{a\in \mathcal{A}}{\pi \left( a|s \right) \left( \mathcal{R}_{s}^{a}+\gamma \sum_{s$

策略提升（Policy Improvement）

我們已經知道怎麼去評價一個策略好不好，那接下來就要找到那個最好的策略。每到一個狀態，我們可能就會想是不是需要改變一下策略，這樣也許能使回報更大，即選擇一個動作

$a\ne \pi \left( s \right)$

，然後再繼續遵循

$\pi$

，這種方式的值就是動作值函式（還記得在上一篇中提出那個思考嗎，這裡就是一個比較好的回答）：

$q_{\pi}\left( s,a \right) =\mathcal{R}_{s}^{a}+\gamma \sum_{s$

我們用一種貪婪的方式來提升我們策略，即選擇那個能使動作值函式最大的動作：

$\pi$

可以證明，改變了策略

$\pi$

以後，狀態值函式也變大了，即

$v_{\pi$

，具體證明參見學習資料。

策略迭代（Policy Iteration）

說完了策略評估和策略提升，策略迭代就簡單了，就是反覆使用策略評估和策略提升，最後會收斂到最優策略。

其虛擬碼如圖所示

Sutton的書中給了一個Gridworld的例子，如下圖所示，具體規則我就不翻譯了，大致就是說最上角和右下角是終點（終止狀態），每一步的reward都是-1，最終目的是要找到一個最優策略。

我們現在就用這個例子來用Python實現策略迭代。

import

numpy

from

lib。envs。gridworld

import

GridworldEnv

def

policy_eval

（

policy

，

env

，

discount_factor

1。0

，

theta

0。00001

）：

“”“

Evaluate a policy given an environment and a full description of the environment‘s dynamics。

Args：

policy：［S， A］ shaped matrix representing the policy。

env： OpenAI env。 env。P represents the transition probabilities of the environment。

env。P［s］［a］ is a list of transition tuples （prob， next_state， reward， done）。

env。nS is a number of states in the environment。

env。nA is a number of actions in the environment。

theta： We stop evaluation once our value function change is less than theta for all states。

discount_factor： Gamma discount factor。

Returns：

Vector of length env。nS representing the value function。

”“”

# Start with a random （all 0） value function

。

zeros

（

env

。

）

while

True

：

delta

# For each state， perform a “full backup”

for

range

（

env

。

）：

# Look at the possible next actions

for

，

action_prob

enumerate

（

policy

［

］）：

# For each action， look at the possible next states。。。

for

prob

，

next_state

，

reward

，

done

env

。

［

］［

］：

# Calculate the expected value

action_prob

prob

（

reward

discount_factor

［

next_state

］）

# How much our value function changed （across any states）

delta

max

（

delta

，

。

abs

（

［

］））

［

］

# Stop evaluating once our value function change is below a threshold

delta

theta

：

break

return

。

array

（

）

def

policy_improvement

（

env

，

policy_eval_fn

policy_eval

，

discount_factor

1。0

）：

“”“

Policy Improvement Algorithm。 Iteratively evaluates and improves a policy

until an optimal policy is found。

Args：

env： The OpenAI envrionment。

policy_eval_fn： Policy Evaluation function that takes 3 arguments：

policy， env， discount_factor。

discount_factor： gamma discount factor。

Returns：

A tuple （policy， V）。

policy is the optimal policy， a matrix of shape ［S， A］ where each state s

contains a valid probability distribution over actions。

V is the value function for the optimal policy。

”“”

def

one_step_lookahead

（

state

，

）：

“”“

Helper function to calculate the value for all action in a given state。

Args：

state： The state to consider （int）

V： The value to use as an estimator， Vector of length env。nS

Returns：

A vector of length env。nA containing the expected value of each action。

”“”

。

zeros

（

env

。

）

for

range

（

env

。

）：

for

prob

，

next_state

，

reward

，

done

env

。

［

state

］［

］：

［

］

prob

（

reward

discount_factor

［

next_state

］）

return

# Start with a random policy

policy

。

ones

（［

env

。

，

env

。

］）

env

。

while

True

：

# Evaluate the current policy

policy_eval_fn

（

policy

，

env

，

discount_factor

）

# Will be set to false if we make any changes to the policy

policy_stable

True

# For each state。。。

for

range

（

env

。

）：

# The best action we would take under the currect policy

chosen_a

。

argmax

（

policy

［

］）

# Find the best action by one-step lookahead

# Ties are resolved arbitarily

action_values

one_step_lookahead

（

，

）

best_a

。

argmax

（

action_values

）

# Greedily update the policy

chosen_a

！=

best_a

：

policy_stable

False

policy

［

］

。

eye

（

env

。

）［

best_a

］

# If the policy is stable we’ve found an optimal policy。 Return it

policy_stable

：

return

policy

，

env

GridworldEnv

（）

random_policy

。

ones

（［

env

。

，

env

。

］）

env

。

policy_eval

（

random_policy

，

env

）

policy

，

policy_improvement

（

env

）

（

“Policy Probability Distribution：”

）

（

policy

）

（

“”

）

（

“Reshaped Grid Policy （0=up， 1=right， 2=down， 3=left）：”

）

（

。

reshape

（

。

argmax

（

policy

，

axis

），

env

。

shape

））

（

“”

）

（

“Value Function：”

）

（

）

（

“”

）

（

“Reshaped Grid Value Function：”

）

（

。

reshape

（

env

。

shape

））

（

“”

）

得到如下結果：

可以看出，這和書上得到的最優策略時一致的。

值迭代（Value Iteration）

策略迭代有一個缺點，就是每一步都要進行策略評估，當狀態空間很大的時候是非常耗費時間的。值迭代是直接將貝爾曼最最佳化方程拿來迭代計算的，這一點是不同於策略迭代的，我們直接對比兩者的虛擬碼。

所以值迭代會直接收斂到最優值，從而我們就可以得到最優策略，因為它就是一個貪婪的選擇。再反過去看一下策略迭代的過程，策略評估過程是應用貝爾曼方程來計算當前最優策略下的值函式，接著進行策略提升，即在每個狀態都選擇一個最優動作來最大化值函式，以改進策略。但是想一下，在策略評估過程我們一定要等到它收斂到準確的值函式嗎？答案是不一定，我們可以設定一個誤差，中斷這個過程，用一個近似的值函式用以策略提升（格子世界的例子中就可以看出，在迭代到第三步以後，其實最優策略就已經確定了），而我們提出這個方法的時候並不是這麼做的，而是等到策略評價過程收斂，這是一個極端的選擇，相當於在迭代貝爾曼最最佳化方程！所以，換句話說，值迭代其實可以看成是策略迭代一個極端情況。

一般來說，策略迭代的收斂速度更快一些，在狀態空間較小時，最好選用策略迭代方法。當狀態空間較大時，值迭代的計算量更小一些。

同樣，還是以格子世界為例，用Python實現一遍值迭代演算法：

import numpy as np

from lib。envs。gridworld import GridworldEnv

def value_iteration（env， theta=0。0001， discount_factor=1。0）：

“”“

Value Iteration Algorithm。

Args：

env： OpenAI env。 env。P represents the transition probabilities of the environment。

env。P［s］［a］ is a list of transition tuples （prob， next_state， reward， done）。

env。nS is a number of states in the environment。

env。nA is a number of actions in the environment。

theta： We stop evaluation once our value function change is less than theta for all states。

discount_factor： Gamma discount factor。

Returns：

A tuple （policy， V） of the optimal policy and the optimal value function。

”“”

def one_step_lookahead（state， V）：

“”“

Helper function to calculate the value for all action in a given state。

Args：

state： The state to consider （int）

V： The value to use as an estimator， Vector of length env。nS

Returns：

A vector of length env。nA containing the expected value of each action。

”“”

A = np。zeros（env。nA）

for a in range（env。nA）：

for prob， next_state， reward， done in env。P［state］［a］：

A［a］ += prob * （reward + discount_factor * V［next_state］）

return A

V = np。zeros（env。nS）

while True：

# Stopping condition

delta = 0

# Update each state。。。

for s in range（env。nS）：

# Do a one-step lookahead to find the best action

A = one_step_lookahead（s， V）

best_action_value = np。max（A）

# Calculate delta across all states seen so far

delta = max（delta， np。abs（best_action_value - V［s］））

# Update the value function

V［s］ = best_action_value

# Check if we can stop

if delta < theta：

break

# Create a deterministic policy using the optimal value function

policy = np。zeros（［env。nS， env。nA］）

for s in range（env。nS）：

# One step lookahead to find the best action for this state

A = one_step_lookahead（s， V）

best_action = np。argmax（A）

# Always take the best action

policy［s， best_action］ = 1。0

return policy， V

env = GridworldEnv（）

policy， v = value_iteration（env）

print（“Policy Probability Distribution：”）

print（policy）

print（“”）

print（“Reshaped Grid Policy （0=up， 1=right， 2=down， 3=left）：”）

print（np。reshape（np。argmax（policy， axis=1）， env。shape））

print（“”）

print（“Value Function：”）

print（v）

print（“”）

print（“Reshaped Grid Value Function：”）

print（v。reshape（env。shape））

print（“”）

輸出結果與策略迭代一致。

參考

［1］ Reinforcement Learning： An Introduction- Chapter 4： Dynamic Programming

［2］ David Silver‘s RL Course Lecture 3 - Planning by Dynamic Programming（video， slides）

［3］ Quora：

https：//www。

quora。com/How-is-policy

-iteration-different-from-value-iteration

Sergio Valcarcel Macua

［4］策略迭代與值迭代的區別

［5］ github開原始碼

強化學習——MDPs求解之動態規劃

誰能給個WOW中在商人處可購買到的裁縫和FM圖紙（聯盟）

學前班數學教課過程怎麼教?

隨便看看

為什麼有些人會得癌症？

腳手架連牆件的做法有哪些？

家庭環境情況，怎麼寫？

配電箱與乙炔瓶安全距離？

強化學習——MDPs求解之動態規劃

誰能給個WOW中在商人處可購買到的裁縫和FM圖紙（聯盟）

學前班數學教課過程怎麼教?

猜你喜歡

如何用基金組合創造穩定的現金流？

2020有哪些免費的策略手遊？《英雄之城2》的玩法選擇多嗎？

【論文筆記 5】Conservative Q-Learning

隨便看看

為什麼有些人會得癌症？

腳手架連牆件的做法有哪些？

家庭環境情況，怎麼寫？

配電箱與乙炔瓶安全距離？