强化学习笔记3-Python/OpenAI/TensorFlow/ROS-规划博弈

举报
zhangrelay 发表于 2021/07/15 03:31:02 2021/07/15
【摘要】 规划:主要涉及马尔科夫决策(MDP),常用于已知环境求解; 博弈:主要涉及蒙特卡罗方法,常用于未知状态求解。 基础知识点:Markov Decision Processes-MIThttps://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-825-techniques-in-a...

规划:主要涉及马尔科夫决策(MDP),常用于已知环境求解;

博弈:主要涉及蒙特卡罗方法,常用于未知状态求解。



案例分析:

利用价值迭代解决冻湖问题
目标:
想象一下,从你家到办公室有一个冰冻的湖泊,你应该在冰冻的湖面上走到你的办公室。 但是哎呀! 在它们之间的冰冻湖中会有一个洞,因此在冰冻的湖中行走时要小心,以免被困在洞中。 看下图,其中:

S是起始位置(家)
F是可以步行的冰湖
H是必须非常小心的洞
G是目标(办公室)

好的,现在让我们使用我们的智能体而不是您来找到到达办公室的正确方法。 智能体的目标是找到从S到G的最佳路径,而不会被困在H.如何代理可以实现这一目标? 如果它正确地在冰冻的湖面上行走,我们给予代理+1点作为奖励,如果它落入洞中则给0点。 因此,智能体可以确定哪个是正确的操作。 智能体现在将尝试找到最优策略。 最优政策意味着采取最大化智能体奖励的正确途径。 如果智能体正在最大化奖励,显然智能体正在学习跳过洞并到达目的地。

具体实现代码如下:


  
  1. import gym
  2. import numpy as np
  3. env = gym.make('FrozenLake-v0')
  4. env.render()
  5. def value_iteration(env, gamma = 1.0):
  6. # initialize value table with zeros
  7. value_table = np.zeros(env.observation_space.n)
  8. # set number of iterations and threshold
  9. no_of_iterations = 100000
  10. threshold = 1e-20
  11. for i in range(no_of_iterations):
  12. # On each iteration, copy the value table to the updated_value_table
  13. updated_value_table = np.copy(value_table)
  14. # Now we calculate Q Value for each actions in the state
  15. # and update the value of a state with maximum Q value
  16. for state in range(env.observation_space.n):
  17. Q_value = []
  18. for action in range(env.action_space.n):
  19. next_states_rewards = []
  20. for next_sr in env.env.P[state][action]:
  21. trans_prob, next_state, reward_prob, _ = next_sr
  22. next_states_rewards.append((trans_prob * (reward_prob + gamma * updated_value_table[next_state])))
  23. Q_value.append(np.sum(next_states_rewards))
  24. value_table[state] = max(Q_value)
  25. # we will check whether we have reached the convergence i.e whether the difference
  26. # between our value table and updated value table is very small. But how do we know it is very
  27. # small? We set some threshold and then we will see if the difference is less
  28. # than our threshold, if it is less, we break the loop and return the value function as optimal
  29. # value function
  30. if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold):
  31. print ('Value-iteration converged at iteration# %d.' %(i+1))
  32. break
  33. return value_table
  34. def extract_policy(value_table, gamma = 1.0):
  35. # initialize the policy with zeros
  36. policy = np.zeros(env.observation_space.n)
  37. for state in range(env.observation_space.n):
  38. # initialize the Q table for a state
  39. Q_table = np.zeros(env.action_space.n)
  40. # compute Q value for all ations in the state
  41. for action in range(env.action_space.n):
  42. for next_sr in env.env.P[state][action]:
  43. trans_prob, next_state, reward_prob, _ = next_sr
  44. Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))
  45. # select the action which has maximum Q value as an optimal action of the state
  46. policy[state] = np.argmax(Q_table)
  47. return policy
  48. optimal_value_function = value_iteration(env=env,gamma=1.0)
  49. optimal_policy = extract_policy(optimal_value_function, gamma=1.0)
  50. print(optimal_value_function)
  51. print(optimal_policy)

显示结果如下:

SFFF
FHFH
FFFH
HFFG
Value-iteration converged at iteration# 1373.
[0.82352941 0.82352941 0.82352941 0.82352941

0.82352941 0.  0.52941176 0.        

0.82352941 0.82352941 0.76470588 0.
 0.         0.88235294 0.94117647 0.        ]
[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]

使用策略迭代解决冻湖问题

具体实现代码如下:


  
  1. import gym
  2. import numpy as np
  3. env = gym.make('FrozenLake-v0')
  4. env.render()
  5. def compute_value_function(policy, gamma=1.0):
  6. # initialize value table with zeros
  7. value_table = np.zeros(env.env.nS)
  8. # set the threshold
  9. threshold = 1e-10
  10. while True:
  11. # copy the value table to the updated_value_table
  12. updated_value_table = np.copy(value_table)
  13. # for each state in the environment, select the action according to the policy and compute the value table
  14. for state in range(env.env.nS):
  15. action = policy[state]
  16. # build the value table with the selected action
  17. value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
  18. for trans_prob, next_state, reward_prob, _ in env.env.P[state][action]])
  19. if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
  20. break
  21. return value_table
  22. def extract_policy(value_table, gamma = 1.0):
  23. # Initialize the policy with zeros
  24. policy = np.zeros(env.observation_space.n)
  25. for state in range(env.observation_space.n):
  26. # initialize the Q table for a state
  27. Q_table = np.zeros(env.action_space.n)
  28. # compute Q value for all ations in the state
  29. for action in range(env.action_space.n):
  30. for next_sr in env.env.P[state][action]:
  31. trans_prob, next_state, reward_prob, _ = next_sr
  32. Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))
  33. # Select the action which has maximum Q value as an optimal action of the state
  34. policy[state] = np.argmax(Q_table)
  35. return policy
  36. def policy_iteration(env,gamma = 1.0):
  37. # Initialize policy with zeros
  38. old_policy = np.zeros(env.observation_space.n)
  39. no_of_iterations = 200000
  40. for i in range(no_of_iterations):
  41. # compute the value function
  42. new_value_function = compute_value_function(old_policy, gamma)
  43. # Extract new policy from the computed value function
  44. new_policy = extract_policy(new_value_function, gamma)
  45. # Then we check whether we have reached convergence i.e whether we found the optimal
  46. # policy by comparing old_policy and new policy if it same we will break the iteration
  47. # else we update old_policy with new_policy
  48. if (np.all(old_policy == new_policy)):
  49. print ('Policy-Iteration converged at step %d.' %(i+1))
  50. break
  51. old_policy = new_policy
  52. return new_policy
  53. print (policy_iteration(env))

显示结果如下:

SFFF
FHFH
FFFH
HFFG
Policy-Iteration converged at step 7.
[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]

思考:比较值迭代和策略迭代的差异?

掌握:马尔科夫特征和应用,即时奖励和折扣系数,贝尔曼方程的作用,推导Q函数贝尔曼方程,值函数与Q函数关联性。


用蒙特卡罗估计pi的值

代码如下:


  
  1. import numpy as np
  2. import math
  3. import random
  4. import matplotlib.pyplot as plt
  5. #%matplotlib inline
  6. square_size = 1
  7. points_inside_circle = 0
  8. points_inside_square = 0
  9. sample_size = 1000
  10. arc = np.linspace(0, np.pi/2, 100)
  11. def generate_points(size):
  12. x = random.random()*size
  13. y = random.random()*size
  14. return (x, y)
  15. def is_in_circle(point, size):
  16. return math.sqrt(point[0]**2 + point[1]**2) <= size
  17. def compute_pi(points_inside_circle, points_inside_square):
  18. return 4 * (points_inside_circle / points_inside_square)
  19. plt.axes().set_aspect('equal')
  20. plt.plot(1*np.cos(arc), 1*np.sin(arc))
  21. for i in range(sample_size):
  22. point = generate_points(square_size)
  23. plt.plot(point[0], point[1], 'c.')
  24. points_inside_square += 1
  25. if is_in_circle(point, square_size):
  26. points_inside_circle += 1
  27. print("Approximate value of pi is {}"
  28. .format(compute_pi(points_inside_circle, points_inside_square)))

样本2000:

第一次
Approximate value of pi is 3.148
第二次
Approximate value of pi is 3.132
第三次
Approximate value of pi is 3.184

样本10000:

Approximate value of pi is 3.1444

 

BlackJack游戏


  
  1. import gym
  2. import numpy as np
  3. from matplotlib import pyplot
  4. import matplotlib.pyplot as plt
  5. from mpl_toolkits.mplot3d import Axes3D
  6. from collections import defaultdict
  7. from functools import partial
  8. %matplotlib inline
  9. plt.style.use('ggplot')
  10. env = gym.make('Blackjack-v0')
  11. def sample_policy(observation):
  12. score, dealer_score, usable_ace = observation
  13. return 0 if score >= 20 else 1
  14. def generate_episode(policy, env):
  15. # we initialize the list for storing states, actions, and rewards
  16. states, actions, rewards = [], [], []
  17. # Initialize the gym environment
  18. observation = env.reset()
  19. while True:
  20. # append the states to the states list
  21. states.append(observation)
  22. # now, we select an action using our sample_policy function and append the action to actions list
  23. action = sample_policy(observation)
  24. actions.append(action)
  25. # We perform the action in the environment according to our sample_policy, move to the next state
  26. # and receive reward
  27. observation, reward, done, info = env.step(action)
  28. rewards.append(reward)
  29. # Break if the state is a terminal state
  30. if done:
  31. break
  32. return states, actions, rewards
  33. def first_visit_mc_prediction(policy, env, n_episodes):
  34. # First, we initialize the empty value table as a dictionary for storing the values of each state
  35. value_table = defaultdict(float)
  36. N = defaultdict(int)
  37. for _ in range(n_episodes):
  38. # Next, we generate the epsiode and store the states and rewards
  39. states, _, rewards = generate_episode(policy, env)
  40. returns = 0
  41. # Then for each step, we store the rewards to a variable R and states to S, and we calculate
  42. # returns as a sum of rewards
  43. for t in range(len(states) - 1, -1, -1):
  44. R = rewards[t]
  45. S = states[t]
  46. returns += R
  47. # Now to perform first visit MC, we check if the episode is visited for the first time, if yes,
  48. # we simply take the average of returns and assign the value of the state as an average of returns
  49. if S not in states[:t]:
  50. N[S] += 1
  51. value_table[S] += (returns - value_table[S]) / N[S]
  52. return value_table
  53. value = first_visit_mc_prediction(sample_policy, env, n_episodes=500000)
  54. for i in range(10):
  55. print(value.popitem())
  56. def plot_blackjack(V, ax1, ax2):
  57. player_sum = np.arange(12, 21 + 1)
  58. dealer_show = np.arange(1, 10 + 1)
  59. usable_ace = np.array([False, True])
  60. state_values = np.zeros((len(player_sum), len(dealer_show), len(usable_ace)))
  61. for i, player in enumerate(player_sum):
  62. for j, dealer in enumerate(dealer_show):
  63. for k, ace in enumerate(usable_ace):
  64. state_values[i, j, k] = V[player, dealer, ace]
  65. X, Y = np.meshgrid(player_sum, dealer_show)
  66. ax1.plot_wireframe(X, Y, state_values[:, :, 0])
  67. ax2.plot_wireframe(X, Y, state_values[:, :, 1])
  68. for ax in ax1, ax2:
  69. ax.set_zlim(-1, 1)
  70. ax.set_ylabel('player sum')
  71. ax.set_xlabel('dealer showing')
  72. ax.set_zlabel('state-value')
  73. fig, axes = pyplot.subplots(nrows=2, figsize=(5, 8),
  74. subplot_kw={'projection': '3d'})
  75. axes[0].set_title('value function without usable ace')
  76. axes[1].set_title('value function with usable ace')
  77. plot_blackjack(value, axes[0], axes[1])
((7, 1, False), -0.6560846560846558)
((4, 7, False), -0.4481132075471699)
((18, 6, False), -0.6899690515201155)
((6, 7, False), -0.5341246290801197)
((17, 8, True), -0.3760445682451254)
((20, 6, False), 0.7093462992976795)
((8, 9, False), -0.5497553017944529)
((4, 4, False), -0.5536480686695283)
((13, 1, False), -0.6560495938435249)
((12, 10, True), -0.20648648648648643)


 

文章来源: zhangrelay.blog.csdn.net,作者:zhangrelay,版权归原作者所有,如需转载,请联系作者。

原文链接:zhangrelay.blog.csdn.net/article/details/91867331

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。