强化学习笔记5-Python/OpenAI/TensorFlow/ROS-阶段复习

举报
zhangrelay 发表于 2021/07/15 03:40:56 2021/07/15
【摘要】 到目前为止,已经完成了4节课程的学习,侧重OpenAI,分别如下: 基础知识:https://blog.csdn.net/zhangrelay/article/details/91361113程序指令:https://blog.csdn.net/zhangrelay/article/details/91414600规划博弈:https://blog.csdn.net/zha...

到目前为止,已经完成了4节课程的学习,侧重OpenAI,分别如下:

  1. 基础知识https://blog.csdn.net/zhangrelay/article/details/91361113
  2. 程序指令https://blog.csdn.net/zhangrelay/article/details/91414600
  3. 规划博弈https://blog.csdn.net/zhangrelay/article/details/91867331
  4. 时间差分https://blog.csdn.net/zhangrelay/article/details/92012795

这时候,再重新看之前博文,侧重ROS,分别如下:

  1. 安装配置https://blog.csdn.net/zhangrelay/article/details/89702997
  2. 环境构建https://blog.csdn.net/zhangrelay/article/details/89817010
  3. 深度学习https://blog.csdn.net/zhangrelay/article/details/90177162

通过上面一系列探索学习,就能够完全掌握人工智能学工具(OpenAI)和机器人学工具(ROS)。


理解如下环境中,Q学习和SARSA差异:

Q学习-circuit2_turtlebot_lidar_qlearn.py:


  
  1. #!/usr/bin/env python
  2. import gym
  3. from gym import wrappers
  4. import gym_gazebo
  5. import time
  6. import numpy
  7. import random
  8. import time
  9. import qlearn
  10. import liveplot
  11. def render():
  12. render_skip = 0 #Skip first X episodes.
  13. render_interval = 50 #Show render Every Y episodes.
  14. render_episodes = 10 #Show Z episodes every rendering.
  15. if (x%render_interval == 0) and (x != 0) and (x > render_skip):
  16. env.render()
  17. elif ((x-render_episodes)%render_interval == 0) and (x != 0) and (x > render_skip) and (render_episodes < x):
  18. env.render(close=True)
  19. if __name__ == '__main__':
  20. env = gym.make('GazeboCircuit2TurtlebotLidar-v0')
  21. outdir = '/tmp/gazebo_gym_experiments'
  22. env = gym.wrappers.Monitor(env, outdir, force=True)
  23. plotter = liveplot.LivePlot(outdir)
  24. last_time_steps = numpy.ndarray(0)
  25. qlearn = qlearn.QLearn(actions=range(env.action_space.n),
  26. alpha=0.2, gamma=0.8, epsilon=0.9)
  27. initial_epsilon = qlearn.epsilon
  28. epsilon_discount = 0.9986
  29. start_time = time.time()
  30. total_episodes = 10000
  31. highest_reward = 0
  32. for x in range(total_episodes):
  33. done = False
  34. cumulated_reward = 0 #Should going forward give more reward then L/R ?
  35. observation = env.reset()
  36. if qlearn.epsilon > 0.05:
  37. qlearn.epsilon *= epsilon_discount
  38. #render() #defined above, not env.render()
  39. state = ''.join(map(str, observation))
  40. for i in range(1500):
  41. # Pick an action based on the current state
  42. action = qlearn.chooseAction(state)
  43. # Execute the action and get feedback
  44. observation, reward, done, info = env.step(action)
  45. cumulated_reward += reward
  46. if highest_reward < cumulated_reward:
  47. highest_reward = cumulated_reward
  48. nextState = ''.join(map(str, observation))
  49. qlearn.learn(state, action, reward, nextState)
  50. env._flush(force=True)
  51. if not(done):
  52. state = nextState
  53. else:
  54. last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
  55. break
  56. if x%100==0:
  57. plotter.plot(env)
  58. m, s = divmod(int(time.time() - start_time), 60)
  59. h, m = divmod(m, 60)
  60. print ("EP: "+str(x+1)+" - [alpha: "+str(round(qlearn.alpha,2))+" - gamma: "+str(round(qlearn.gamma,2))+" - epsilon: "+str(round(qlearn.epsilon,2))+"] - Reward: "+str(cumulated_reward)+" Time: %d:%02d:%02d" % (h, m, s))
  61. #Github table content
  62. print ("\n|"+str(total_episodes)+"|"+str(qlearn.alpha)+"|"+str(qlearn.gamma)+"|"+str(initial_epsilon)+"*"+str(epsilon_discount)+"|"+str(highest_reward)+"| PICTURE |")
  63. l = last_time_steps.tolist()
  64. l.sort()
  65. #print("Parameters: a="+str)
  66. print("Overall score: {:0.2f}".format(last_time_steps.mean()))
  67. print("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))
  68. env.close()

SARSA-circuit2_turtlebot_lidar_sarsa.py:


  
  1. #!/usr/bin/env python
  2. import gym
  3. from gym import wrappers
  4. import gym_gazebo
  5. import time
  6. import numpy
  7. import random
  8. import time
  9. import liveplot
  10. import sarsa
  11. if __name__ == '__main__':
  12. env = gym.make('GazeboCircuit2TurtlebotLidar-v0')
  13. outdir = '/tmp/gazebo_gym_experiments'
  14. env = gym.wrappers.Monitor(env, outdir, force=True)
  15. plotter = liveplot.LivePlot(outdir)
  16. last_time_steps = numpy.ndarray(0)
  17. sarsa = sarsa.Sarsa(actions=range(env.action_space.n),
  18. epsilon=0.9, alpha=0.2, gamma=0.9)
  19. initial_epsilon = sarsa.epsilon
  20. epsilon_discount = 0.9986
  21. start_time = time.time()
  22. total_episodes = 10000
  23. highest_reward = 0
  24. for x in range(total_episodes):
  25. done = False
  26. cumulated_reward = 0 #Should going forward give more reward then L/R ?
  27. observation = env.reset()
  28. if sarsa.epsilon > 0.05:
  29. sarsa.epsilon *= epsilon_discount
  30. #render() #defined above, not env.render()
  31. state = ''.join(map(str, observation))
  32. for i in range(1500):
  33. # Pick an action based on the current state
  34. action = sarsa.chooseAction(state)
  35. # Execute the action and get feedback
  36. observation, reward, done, info = env.step(action)
  37. cumulated_reward += reward
  38. if highest_reward < cumulated_reward:
  39. highest_reward = cumulated_reward
  40. nextState = ''.join(map(str, observation))
  41. nextAction = sarsa.chooseAction(nextState)
  42. #sarsa.learn(state, action, reward, nextState)
  43. sarsa.learn(state, action, reward, nextState, nextAction)
  44. env._flush(force=True)
  45. if not(done):
  46. state = nextState
  47. else:
  48. last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
  49. break
  50. if x%100==0:
  51. plotter.plot(env)
  52. m, s = divmod(int(time.time() - start_time), 60)
  53. h, m = divmod(m, 60)
  54. print ("EP: "+str(x+1)+" - [alpha: "+str(round(sarsa.alpha,2))+" - gamma: "+str(round(sarsa.gamma,2))+" - epsilon: "+str(round(sarsa.epsilon,2))+"] - Reward: "+str(cumulated_reward)+" Time: %d:%02d:%02d" % (h, m, s))
  55. #Github table content
  56. print ("\n|"+str(total_episodes)+"|"+str(sarsa.alpha)+"|"+str(sarsa.gamma)+"|"+str(initial_epsilon)+"*"+str(epsilon_discount)+"|"+str(highest_reward)+"| PICTURE |")
  57. l = last_time_steps.tolist()
  58. l.sort()
  59. #print("Parameters: a="+str)
  60. print("Overall score: {:0.2f}".format(last_time_steps.mean()))
  61. print("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))
  62. env.close()

复习:时间差分https://blog.csdn.net/zhangrelay/article/details/92012795

其中案例出租车demo与上面turtlebot-demo,理解并掌握ROS和OpenAI这两大工具最基本的应用。


 

文章来源: zhangrelay.blog.csdn.net,作者:zhangrelay,版权归原作者所有,如需转载,请联系作者。

原文链接:zhangrelay.blog.csdn.net/article/details/92050001

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。