使用DDPG算法控制小车上山
使用DDPG算法控制小车上山
实验目标
通过本案例的学习和课后作业的练习:
- 了解DDPG算法的基本概念
- 了解如何基于DDPG训练一个控制类小游戏
- 了解强化学习训练推理游戏的整体流程
你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。
案例内容介绍
MountainCarContinuous是连续动作空间的控制游戏,也是OpenAi gym的经典问题。在此游戏中,我们可以向左/向右推动小车,小车若到达山顶,则游戏胜利,越早到达山顶,则分数越高;若999回合后,没有到达山顶,则游戏失败。DDPG全称为Deep Deterministic Policy Gradient,由Google Deepmind发表于ICLR 2016,主要用于解决连续动作空间的问题。 在本案例中,我们将展示如何基于DDPG算法,训练连续动作空间的小车上山问题。
整体流程:安装基础依赖->创建mountain_car环境->构建DDPG算法->训练->推理->可视化效果
DDPG算法介绍
DDPG将深度学习神经网络融合进DPG。相对于DPG的核心改进是:采用卷积神经网络作为策略函数μ和Q函数的模拟,即策略网络和Q网络;然后使用深度学习的方法来训练上述神经网络。DDPG的算法伪代码如下图所示:
MountainCarContinuous-v0环境简介
MountainCarContinuous-v0,是基于gym的经典控制环境。游戏任务为玩家通过向左/向右推动小车,使之最终到达右边山坡的旗帜处。
- 游戏机制
小车初始处于两个山坡中间,而小车的引擎无法提供足够的动力使小车向右直接到达山顶旗帜处。因此,需要玩家在山谷中前后推动小车来积蓄动能,以使小车冲上目标点。小车到达目标点所用动能越少,奖励越高。
注意事项
-
本案例运行环境为 Pytorch-1.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;
-
如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;
-
如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。
实验步骤
1.程序初始化
第1步:安装基础依赖
要确保所有依赖都安装成功后,再执行之后的代码。如果某些模块因为网络原因导致安装失败,直接重试一次即可。
!pip install gym
!pip install pandas
Requirement already satisfied: gym in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages
Requirement already satisfied: pyglet<=1.5.0,>=1.4.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)
Requirement already satisfied: Pillow<=7.2.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)
Requirement already satisfied: numpy>=1.10.4 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)
Requirement already satisfied: scipy in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)
Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)
Requirement already satisfied: future in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pyglet<=1.5.0,>=1.4.0->gym)
[33mYou are using pip version 9.0.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Requirement already satisfied: pandas in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages
Requirement already satisfied: python-dateutil>=2 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: pytz>=2011k in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: numpy>=1.9.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from python-dateutil>=2->pandas)
[33mYou are using pip version 9.0.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
第2步:导入相关的库
%matplotlib inline
import sys
import logging
import imp
import itertools
import copy
import moxing as mox
import numpy as np
np.random.seed(0)
import pandas as pd
import gym
import matplotlib.pyplot as plt
import torch
torch.manual_seed(0)
import torch.nn as nn
import torch.optim as optim
imp.reload(logging)
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s [%(levelname)s] %(message)s',
stream=sys.stdout, datefmt='%H:%M:%S')
INFO:root:Using MoXing-v1.17.3-
INFO:root:Using OBS-Python-SDK-3.20.7
11:28:18 [DEBUG] backend module://ipykernel.pylab.backend_inline version unknown
2. 参数设置
本案例通过设置 目标奖励值 来判断训练是否结束,"target_reward"为 90 时,模型已达到较好的水平。训练耗时约为10分钟。
opt={
"replayer" : 100000, # 经验池大小
"gama" : 0.99, # Q值估计的折扣率
"learn_start" : 5000, # 经验数达到{}时开始学习
"a_lr" : 0.0001, # actor的学习率
"c_lr" : 0.001, # critic的学习率
"soft_lr" : 0.005, # 软更新的学习率
"target_reward" : 90, # 平均每局目标奖励
"batch_size" : 512, # 经验回放的batch_size
"mu" : 0., # OU的均值
"sigma" : 0.5, # OU的波动率
"theta" : .15 # OU均值回归的速率
}
3. 游戏环境创建
env = gym.make('MountainCarContinuous-v0')
env.seed(0)
for key in vars(env):
logging.info('%s: %s', key, vars(env)[key])
11:28:18 [INFO] env: <Continuous_MountainCarEnv<MountainCarContinuous-v0>>
11:28:18 [INFO] action_space: Box(-1.0, 1.0, (1,), float32)
11:28:18 [INFO] observation_space: Box(-1.2000000476837158, 0.6000000238418579, (2,), float32)
11:28:18 [INFO] reward_range: (-inf, inf)
11:28:18 [INFO] metadata: {'render.modes': ['human', 'rgb_array'], 'video.frames_per_second': 30}
11:28:18 [INFO] _max_episode_steps: 999
11:28:18 [INFO] _elapsed_steps: None
4. DDPG算法构建
Actor-Critic网络构建
class Actor(nn.Module):
def __init__(self, nb_states, nb_actions, hidden1=400, hidden2=300):
super(Actor, self).__init__()
self.fc1 = nn.Linear(nb_states, hidden1)
self.fc2 = nn.Linear(hidden1, hidden2)
self.fc3 = nn.Linear(hidden2, nb_actions)
self.relu = nn.ReLU()
self.tanh = nn.Tanh()
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
out = self.relu(out)
out = self.fc3(out)
out = self.tanh(out)
return out
class Critic(nn.Module):
def __init__(self, nb_states, nb_actions, hidden1=400, hidden2=300):
super(Critic, self).__init__()
self.fc1 = nn.Linear(nb_states, hidden1)
self.fc2 = nn.Linear(hidden1+nb_actions, hidden2)
self.fc3 = nn.Linear(hidden2, 1)
self.relu = nn.ReLU()
def forward(self, xs):
x, a = xs
out = self.fc1(x)
out = self.relu(out)
# debug()
out = self.fc2(torch.cat([out,a],1))
out = self.relu(out)
out = self.fc3(out)
return out
OU噪声构建
class OrnsteinUhlenbeckProcess:
def __init__(self, x0):
self.x = x0
def __call__(self, mu=opt["mu"], sigma=opt["sigma"], theta=opt["theta"], dt=0.01):
n = np.random.normal(size=self.x.shape)
self.x += (theta * (mu - self.x) * dt + sigma * np.sqrt(dt) * n)
return self.x
replay buffer构建
class Replayer:
def __init__(self, capacity):
self.memory = pd.DataFrame(index=range(capacity),
columns=['observation', 'action', 'reward',
'next_observation', 'done'])
self.i = 0
self.count = 0
self.capacity = capacity
def store(self, *args):
self.memory.loc[self.i] = args
self.i = (self.i + 1) % self.capacity
self.count = min(self.count + 1, self.capacity)
def sample(self, size):
indices = np.random.choice(self.count, size=size)
return (np.stack(self.memory.loc[indices, field]) for field in
self.memory.columns)
创建DDPG核心训练部分
class DDPGAgent:
def __init__(self, env):
state_dim = env.observation_space.shape[0]
self.action_dim = env.action_space.shape[0]
self.action_low = env.action_space.low[0]
self.action_high = env.action_space.high[0]
self.gamma = 0.99
mox.file.copy_parallel("obs://modelarts-labs-bj4-v2/course/modelarts/reinforcement_learning/ddpg_mountaincar/model/", "model/")
self.replayer = Replayer(opt["replayer"])
self.actor_evaluate_net = Actor(state_dim,self.action_dim)
self.actor_evaluate_net.load_state_dict(torch.load('model/actor.pkl'))
self.actor_optimizer = optim.Adam(self.actor_evaluate_net.parameters(), lr=opt["a_lr"])
self.actor_target_net = copy.deepcopy(self.actor_evaluate_net)
self.critic_evaluate_net = Critic(state_dim,self.action_dim)
self.critic_evaluate_net.load_state_dict(torch.load('model/critic.pkl'))
self.critic_optimizer = optim.Adam(self.critic_evaluate_net.parameters(), lr=opt["c_lr"])
self.critic_loss = nn.MSELoss()
self.critic_target_net = copy.deepcopy(self.critic_evaluate_net)
def reset(self, mode=None):
self.mode = mode
if self.mode == 'train':
self.trajectory = []
self.noise = OrnsteinUhlenbeckProcess(np.zeros((self.action_dim,)))
def step(self, observation, reward, done):
if self.mode == 'train' and self.replayer.count < opt["learn_start"]:
action = np.random.randint(self.action_low, self.action_high)
else:
state_tensor = torch.as_tensor(observation, dtype=torch.float).reshape(1, -1)
action_tensor = self.actor_evaluate_net(state_tensor)
action = action_tensor.detach().numpy()[0]
if self.mode == 'train':
noise = self.noise(sigma=0.1)
action = (action + noise).clip(self.action_low, self.action_high)
self.trajectory += [observation, reward, done, action]
if len(self.trajectory) >= 8:
state, _, _, act, next_state, reward, done, _ = self.trajectory[-8:]
self.replayer.store(state, act, reward, next_state, done)
if self.replayer.count >= opt["learn_start"]:
self.learn()
return action
def close(self):
pass
def update_net(self, target_net, evaluate_net, learning_rate=opt["soft_lr"]):
for target_param, evaluate_param in zip(
target_net.parameters(), evaluate_net.parameters()):
target_param.data.copy_(learning_rate * evaluate_param.data
+ (1 - learning_rate) * target_param.data)
def learn(self):
# replay
states, actions, rewards, next_states, dones = self.replayer.sample(opt["batch_size"])
state_tensor = torch.as_tensor(states, dtype=torch.float)
action_tensor = torch.as_tensor(actions, dtype=torch.long)
reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
dones = dones.astype(int)
done_tensor = torch.as_tensor(dones, dtype=torch.float)
next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
# learn critic
next_action_tensor = self.actor_target_net(next_state_tensor)
noise_tensor = (0.2 * torch.randn_like(action_tensor, dtype=torch.float))
noisy_next_action_tensor = (next_action_tensor + noise_tensor).clamp(
self.action_low, self.action_high)
next_state_action_tensor = [next_state_tensor, noisy_next_action_tensor,]
#print(next_state_action_tensor)
next_q_tensor = self.critic_target_net(next_state_action_tensor).squeeze(1)
#print(next_q_tensor)
critic_target_tensor = reward_tensor + (1. - done_tensor) * self.gamma * next_q_tensor
critic_target_tensor = critic_target_tensor.detach()
state_action_tensor = [state_tensor.float(), action_tensor.float(),]
critic_pred_tensor = self.critic_evaluate_net(state_action_tensor).squeeze(1)
critic_loss_tensor = self.critic_loss(critic_pred_tensor, critic_target_tensor)
self.critic_optimizer.zero_grad()
critic_loss_tensor.backward()
self.critic_optimizer.step()
# learn actor
pred_action_tensor = self.actor_evaluate_net(state_tensor)
pred_action_tensor = pred_action_tensor.clamp(self.action_low, self.action_high)
pred_state_action_tensor = [state_tensor, pred_action_tensor,]
critic_pred_tensor = self.critic_evaluate_net(pred_state_action_tensor)
actor_loss_tensor = -critic_pred_tensor.mean()
self.actor_optimizer.zero_grad()
actor_loss_tensor.backward()
self.actor_optimizer.step()
self.update_net(self.critic_target_net, self.critic_evaluate_net)
self.update_net(self.actor_target_net, self.actor_evaluate_net)
agent = DDPGAgent(env)
11:28:18 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Ready to call (timestamp=1635305298.8828228): obsClient.getObjectMetadata
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Finish calling (timestamp=1635305299.1487627)
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Ready to call (timestamp=1635305299.1495306): obsClient.listObjects
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Finish calling (timestamp=1635305299.1665623)
11:28:19 [DEBUG] Start to copy 2 files from obs://modelarts-labs-bj4-v2/course/modelarts/reinforcement_learning/ddpg_mountaincar/model to model.
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Ready to call (timestamp=1635305299.187472): obsClient.getObjectMetadata
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Ready to call (timestamp=1635305299.192197): obsClient.getObjectMetadata
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Finish calling (timestamp=1635305299.2651675)
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Ready to call (timestamp=1635305299.2662349): obsClient.getObject
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Finish calling (timestamp=1635305299.2906594)
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Finish calling (timestamp=1635305299.4189594)
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Ready to call (timestamp=1635305299.4199762): obsClient.getObject
11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Finish calling (timestamp=1635305299.430522)
11:28:19 [DEBUG] Copy Successfully.
5. 开始训练
训练至奖励值达到90以上大约需要10分钟。
def play_episode(env, agent, max_episode_steps=None, mode=None):
observation, reward, done = env.reset(), 0., False
agent.reset(mode=mode)
episode_reward, elapsed_steps = 0., 0
while True:
action = agent.step(observation, reward, done)
# 可视化
# env.render()
if done:
break
observation, reward, done, _ = env.step(action)
episode_reward += reward
elapsed_steps += 1
if max_episode_steps and elapsed_steps >= max_episode_steps:
break
agent.close()
return episode_reward, elapsed_steps
logging.info('==== train ====')
episode_rewards = []
for episode in itertools.count():
episode_reward, elapsed_steps = play_episode(env.unwrapped, agent,
max_episode_steps=env._max_episode_steps, mode='train')
episode_rewards.append(episode_reward)
logging.debug('train episode %d: reward = %.2f, steps = %d',
episode, episode_reward, elapsed_steps)
if episode>10 and np.mean(episode_rewards[-10:]) > opt["target_reward"]: #最近的10个reward>90
break
plt.plot(episode_rewards)
11:28:19 [INFO] ==== train ====
/home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages/pandas/core/internals.py:826: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
arr_value = np.array(value)
11:28:25 [DEBUG] train episode 0: reward = -43.16, steps = 999
11:28:32 [DEBUG] train episode 1: reward = -38.43, steps = 999
11:28:38 [DEBUG] train episode 2: reward = -51.41, steps = 999
11:28:44 [DEBUG] train episode 3: reward = -48.34, steps = 999
11:28:51 [DEBUG] train episode 4: reward = -50.17, steps = 999
11:28:56 [DEBUG] train episode 5: reward = 93.68, steps = 69
11:29:02 [DEBUG] train episode 6: reward = 93.60, steps = 67
11:29:08 [DEBUG] train episode 7: reward = 93.85, steps = 66
11:29:16 [DEBUG] train episode 8: reward = 92.45, steps = 85
11:29:22 [DEBUG] train episode 9: reward = 93.77, steps = 67
11:29:32 [DEBUG] train episode 10: reward = 91.62, steps = 103
11:29:38 [DEBUG] train episode 11: reward = 94.28, steps = 67
11:29:44 [DEBUG] train episode 12: reward = 93.65, steps = 68
11:29:50 [DEBUG] train episode 13: reward = 93.66, steps = 66
11:29:57 [DEBUG] train episode 14: reward = 93.73, steps = 65
[<matplotlib.lines.Line2D at 0x7fd0f48ef550>]
6. 使用模型推理
由于本内核可视化依赖于OpenGL,需要窗口显示,但当前环境暂不支持弹窗,因此无法可视化,请将代码下载到本地,取消 env.render() 这行代码的注释,可查看可视化效果。
logging.info('==== test ====')
episode_rewards = []
for episode in range(100):
episode_reward, elapsed_steps = play_episode(env, agent)
episode_rewards.append(episode_reward)
logging.debug('test episode %d: reward = %.2f, steps = %d',
episode, episode_reward, elapsed_steps)
logging.info('average episode reward = %.2f ± %.2f',
np.mean(episode_rewards), np.std(episode_rewards))
env.close()
11:29:57 [INFO] ==== test ====
11:29:57 [DEBUG] test episode 0: reward = 93.60, steps = 66
11:29:57 [DEBUG] test episode 1: reward = 93.45, steps = 68
11:29:57 [DEBUG] test episode 2: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 3: reward = 92.87, steps = 87
11:29:57 [DEBUG] test episode 4: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 5: reward = 93.58, steps = 66
11:29:57 [DEBUG] test episode 6: reward = 92.58, steps = 86
11:29:57 [DEBUG] test episode 7: reward = 93.46, steps = 68
11:29:57 [DEBUG] test episode 8: reward = 93.60, steps = 66
11:29:57 [DEBUG] test episode 9: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 10: reward = 93.58, steps = 66
11:29:57 [DEBUG] test episode 11: reward = 93.59, steps = 66
11:29:57 [DEBUG] test episode 12: reward = 88.82, steps = 126
11:29:57 [DEBUG] test episode 13: reward = 93.53, steps = 67
11:29:57 [DEBUG] test episode 14: reward = 93.17, steps = 72
11:29:57 [DEBUG] test episode 15: reward = 93.32, steps = 70
11:29:57 [DEBUG] test episode 16: reward = 93.40, steps = 69
11:29:57 [DEBUG] test episode 17: reward = 93.53, steps = 67
11:29:57 [DEBUG] test episode 18: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 19: reward = 93.46, steps = 68
11:29:57 [DEBUG] test episode 20: reward = 93.47, steps = 68
11:29:57 [DEBUG] test episode 21: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 22: reward = 93.39, steps = 69
11:29:57 [DEBUG] test episode 23: reward = 93.54, steps = 67
11:29:57 [DEBUG] test episode 24: reward = 93.18, steps = 72
11:29:57 [DEBUG] test episode 25: reward = 93.39, steps = 69
11:29:57 [DEBUG] test episode 26: reward = 93.66, steps = 65
11:29:57 [DEBUG] test episode 27: reward = 92.42, steps = 86
11:29:57 [DEBUG] test episode 28: reward = 93.25, steps = 71
11:29:57 [DEBUG] test episode 29: reward = 93.66, steps = 65
11:29:57 [DEBUG] test episode 30: reward = 93.17, steps = 72
11:29:57 [DEBUG] test episode 31: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 32: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 33: reward = 93.52, steps = 67
11:29:57 [DEBUG] test episode 34: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 35: reward = 91.86, steps = 120
11:29:57 [DEBUG] test episode 36: reward = 88.23, steps = 131
11:29:57 [DEBUG] test episode 37: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 38: reward = 93.31, steps = 70
11:29:57 [DEBUG] test episode 39: reward = 93.31, steps = 70
11:29:57 [DEBUG] test episode 40: reward = 93.31, steps = 70
11:29:57 [DEBUG] test episode 41: reward = 93.60, steps = 66
11:29:57 [DEBUG] test episode 42: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 43: reward = 93.66, steps = 65
11:29:57 [DEBUG] test episode 44: reward = 93.20, steps = 72
11:29:57 [DEBUG] test episode 45: reward = 93.67, steps = 65
11:29:57 [DEBUG] test episode 46: reward = 88.05, steps = 133
11:29:57 [DEBUG] test episode 47: reward = 88.83, steps = 126
11:29:58 [DEBUG] test episode 48: reward = 93.61, steps = 66
11:29:58 [DEBUG] test episode 49: reward = 93.52, steps = 67
11:29:58 [DEBUG] test episode 50: reward = 91.99, steps = 103
11:29:58 [DEBUG] test episode 51: reward = 93.58, steps = 66
11:29:58 [DEBUG] test episode 52: reward = 93.39, steps = 69
11:29:58 [DEBUG] test episode 53: reward = 93.67, steps = 65
11:29:58 [DEBUG] test episode 54: reward = 93.59, steps = 66
11:29:58 [DEBUG] test episode 55: reward = 91.40, steps = 105
11:29:58 [DEBUG] test episode 56: reward = 93.59, steps = 66
11:29:58 [DEBUG] test episode 57: reward = 92.72, steps = 86
11:29:58 [DEBUG] test episode 58: reward = 93.38, steps = 69
11:29:58 [DEBUG] test episode 59: reward = 93.45, steps = 68
11:29:58 [DEBUG] test episode 60: reward = 93.54, steps = 67
11:29:58 [DEBUG] test episode 61: reward = 93.32, steps = 70
11:29:58 [DEBUG] test episode 62: reward = 93.66, steps = 65
11:29:58 [DEBUG] test episode 63: reward = 93.58, steps = 66
11:29:58 [DEBUG] test episode 64: reward = 93.60, steps = 66
11:29:58 [DEBUG] test episode 65: reward = 93.03, steps = 87
11:29:58 [DEBUG] test episode 66: reward = 93.58, steps = 66
11:29:58 [DEBUG] test episode 67: reward = 92.55, steps = 86
11:29:58 [DEBUG] test episode 68: reward = 93.37, steps = 69
11:29:58 [DEBUG] test episode 69: reward = 93.61, steps = 66
11:29:58 [DEBUG] test episode 70: reward = 93.61, steps = 66
11:29:58 [DEBUG] test episode 71: reward = 93.44, steps = 68
11:29:58 [DEBUG] test episode 72: reward = 93.59, steps = 66
11:29:58 [DEBUG] test episode 73: reward = 93.46, steps = 68
11:29:58 [DEBUG] test episode 74: reward = 93.54, steps = 67
11:29:58 [DEBUG] test episode 75: reward = 93.31, steps = 70
11:29:58 [DEBUG] test episode 76: reward = 89.16, steps = 124
11:29:58 [DEBUG] test episode 77: reward = 92.82, steps = 77
11:29:58 [DEBUG] test episode 78: reward = 93.37, steps = 69
11:29:58 [DEBUG] test episode 79: reward = 93.60, steps = 66
11:29:58 [DEBUG] test episode 80: reward = 93.67, steps = 65
11:29:58 [DEBUG] test episode 81: reward = 93.46, steps = 68
11:29:58 [DEBUG] test episode 82: reward = 93.68, steps = 65
11:29:58 [DEBUG] test episode 83: reward = 93.54, steps = 67
11:29:58 [DEBUG] test episode 84: reward = 93.19, steps = 72
11:29:58 [DEBUG] test episode 85: reward = 92.95, steps = 87
11:29:58 [DEBUG] test episode 86: reward = 93.33, steps = 70
11:29:58 [DEBUG] test episode 87: reward = 93.60, steps = 66
11:29:58 [DEBUG] test episode 88: reward = 93.68, steps = 65
11:29:58 [DEBUG] test episode 89: reward = 92.96, steps = 75
11:29:58 [DEBUG] test episode 90: reward = 93.38, steps = 69
11:29:58 [DEBUG] test episode 91: reward = 93.59, steps = 66
11:29:58 [DEBUG] test episode 92: reward = 93.10, steps = 73
11:29:58 [DEBUG] test episode 93: reward = 93.19, steps = 72
11:29:58 [DEBUG] test episode 94: reward = 93.66, steps = 65
11:29:58 [DEBUG] test episode 95: reward = 91.91, steps = 103
11:29:58 [DEBUG] test episode 96: reward = 93.53, steps = 67
11:29:58 [DEBUG] test episode 97: reward = 93.52, steps = 67
11:29:58 [DEBUG] test episode 98: reward = 93.60, steps = 66
11:29:58 [DEBUG] test episode 99: reward = 92.68, steps = 79
11:29:58 [INFO] average episode reward = 93.12 ± 1.12
7.可视化效果
下面的视频为target_reward设置为90时,模型的推理效果,该动图演示了小车在能量消耗极小的情况下到达目标点。
8. 作业
- 点赞
- 收藏
- 关注作者
评论(0)