使用DDPG算法控制小车上山

举报
HWCloudAI 发表于 2022/11/28 20:28:15 2022/11/28
【摘要】 使用DDPG算法控制小车上山 实验目标通过本案例的学习和课后作业的练习:了解DDPG算法的基本概念了解如何基于DDPG训练一个控制类小游戏了解强化学习训练推理游戏的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。 案例内容介绍MountainCarContinuous是连续动作空间的控制游戏,也是Op...

使用DDPG算法控制小车上山

实验目标

通过本案例的学习和课后作业的练习:

  1. 了解DDPG算法的基本概念
  2. 了解如何基于DDPG训练一个控制类小游戏
  3. 了解强化学习训练推理游戏的整体流程

你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档

案例内容介绍

MountainCarContinuous是连续动作空间的控制游戏,也是OpenAi gym的经典问题。在此游戏中,我们可以向左/向右推动小车,小车若到达山顶,则游戏胜利,越早到达山顶,则分数越高;若999回合后,没有到达山顶,则游戏失败。DDPG全称为Deep Deterministic Policy Gradient,由Google Deepmind发表于ICLR 2016,主要用于解决连续动作空间的问题。 在本案例中,我们将展示如何基于DDPG算法,训练连续动作空间的小车上山问题。

整体流程:安装基础依赖->创建mountain_car环境->构建DDPG算法->训练->推理->可视化效果

DDPG算法介绍

DDPG将深度学习神经网络融合进DPG。相对于DPG的核心改进是:采用卷积神经网络作为策略函数μ和Q函数的模拟,即策略网络和Q网络;然后使用深度学习的方法来训练上述神经网络。DDPG的算法伪代码如下图所示:

MountainCarContinuous-v0环境简介

MountainCarContinuous-v0,是基于gym的经典控制环境。游戏任务为玩家通过向左/向右推动小车,使之最终到达右边山坡的旗帜处。

  • 游戏机制

小车初始处于两个山坡中间,而小车的引擎无法提供足够的动力使小车向右直接到达山顶旗帜处。因此,需要玩家在山谷中前后推动小车来积蓄动能,以使小车冲上目标点。小车到达目标点所用动能越少,奖励越高。

注意事项

  1. 本案例运行环境为 Pytorch-1.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;

  2. 如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;

  3. 如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。

实验步骤

1.程序初始化

第1步:安装基础依赖

要确保所有依赖都安装成功后,再执行之后的代码。如果某些模块因为网络原因导致安装失败,直接重试一次即可。

!pip install gym
!pip install pandas
Requirement already satisfied: gym in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages

Requirement already satisfied: pyglet<=1.5.0,>=1.4.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)

Requirement already satisfied: Pillow<=7.2.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)

Requirement already satisfied: numpy>=1.10.4 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)

Requirement already satisfied: scipy in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)

Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym)

Requirement already satisfied: future in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pyglet<=1.5.0,>=1.4.0->gym)

You are using pip version 9.0.1, however version 21.3.1 is available.

You should consider upgrading via the 'pip install --upgrade pip' command.

Requirement already satisfied: pandas in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages

Requirement already satisfied: python-dateutil>=2 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pandas)

Requirement already satisfied: pytz>=2011k in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pandas)

Requirement already satisfied: numpy>=1.9.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pandas)

Requirement already satisfied: six>=1.5 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from python-dateutil>=2->pandas)

You are using pip version 9.0.1, however version 21.3.1 is available.

You should consider upgrading via the 'pip install --upgrade pip' command.

第2步:导入相关的库

%matplotlib inline

import sys
import logging
import imp
import itertools
import copy
import moxing as mox

import numpy as np
np.random.seed(0)
import pandas as pd
import gym
import matplotlib.pyplot as plt
import torch
torch.manual_seed(0)
import torch.nn as nn
import torch.optim as optim

imp.reload(logging)
logging.basicConfig(level=logging.DEBUG,
        format='%(asctime)s [%(levelname)s] %(message)s',
        stream=sys.stdout, datefmt='%H:%M:%S')
INFO:root:Using MoXing-v1.17.3-

INFO:root:Using OBS-Python-SDK-3.20.7


11:28:18 [DEBUG] backend module://ipykernel.pylab.backend_inline version unknown

2. 参数设置

本案例通过设置 目标奖励值 来判断训练是否结束,"target_reward"为 90 时,模型已达到较好的水平。训练耗时约为10分钟。

opt={
    "replayer" : 100000,   # 经验池大小
    "gama" : 0.99,         # Q值估计的折扣率
    "learn_start" : 5000,  # 经验数达到{}时开始学习
    "a_lr" : 0.0001,       # actor的学习率
    "c_lr" : 0.001,        # critic的学习率
    "soft_lr" : 0.005,     # 软更新的学习率
    "target_reward" : 90,  # 平均每局目标奖励
    "batch_size" : 512,    # 经验回放的batch_size
    "mu" : 0.,             # OU的均值
    "sigma" : 0.5,         # OU的波动率
    "theta" : .15          # OU均值回归的速率
} 

3. 游戏环境创建

env = gym.make('MountainCarContinuous-v0')
env.seed(0)
for key in vars(env):
    logging.info('%s: %s', key, vars(env)[key])
11:28:18 [INFO] env: <Continuous_MountainCarEnv<MountainCarContinuous-v0>>

11:28:18 [INFO] action_space: Box(-1.0, 1.0, (1,), float32)

11:28:18 [INFO] observation_space: Box(-1.2000000476837158, 0.6000000238418579, (2,), float32)

11:28:18 [INFO] reward_range: (-inf, inf)

11:28:18 [INFO] metadata: {'render.modes': ['human', 'rgb_array'], 'video.frames_per_second': 30}

11:28:18 [INFO] _max_episode_steps: 999

11:28:18 [INFO] _elapsed_steps: None

4. DDPG算法构建

Actor-Critic网络构建

class Actor(nn.Module):
    def __init__(self, nb_states, nb_actions, hidden1=400, hidden2=300):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(nb_states, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, nb_actions)
        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.relu(out)
        out = self.fc3(out)
        out = self.tanh(out)
        return out

class Critic(nn.Module):
    def __init__(self, nb_states, nb_actions, hidden1=400, hidden2=300):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(nb_states, hidden1)
        self.fc2 = nn.Linear(hidden1+nb_actions, hidden2)
        self.fc3 = nn.Linear(hidden2, 1)
        self.relu = nn.ReLU()
    
    def forward(self, xs):
        x, a = xs
        out = self.fc1(x)
        out = self.relu(out)
        # debug()
        out = self.fc2(torch.cat([out,a],1))
        out = self.relu(out)
        out = self.fc3(out)
        return out

OU噪声构建

class OrnsteinUhlenbeckProcess:
    def __init__(self, x0):
        self.x = x0

    def __call__(self, mu=opt["mu"], sigma=opt["sigma"], theta=opt["theta"], dt=0.01):
        n = np.random.normal(size=self.x.shape)
        self.x += (theta * (mu - self.x) * dt + sigma * np.sqrt(dt) * n)
        return self.x

replay buffer构建

class Replayer:
    def __init__(self, capacity):
        self.memory = pd.DataFrame(index=range(capacity),
                columns=['observation', 'action', 'reward',
                'next_observation', 'done'])
        self.i = 0
        self.count = 0
        self.capacity = capacity

    def store(self, *args):
        self.memory.loc[self.i] = args
        self.i = (self.i + 1) % self.capacity
        self.count = min(self.count + 1, self.capacity)

    def sample(self, size):
        indices = np.random.choice(self.count, size=size)
        return (np.stack(self.memory.loc[indices, field]) for field in
                self.memory.columns)  

创建DDPG核心训练部分

class DDPGAgent:
    def __init__(self, env):
        state_dim = env.observation_space.shape[0]
        self.action_dim = env.action_space.shape[0]
        self.action_low = env.action_space.low[0]
        self.action_high = env.action_space.high[0]
        self.gamma = 0.99
        mox.file.copy_parallel("obs://modelarts-labs-bj4-v2/course/modelarts/reinforcement_learning/ddpg_mountaincar/model/", "model/")
        self.replayer = Replayer(opt["replayer"])
        self.actor_evaluate_net = Actor(state_dim,self.action_dim)    
        self.actor_evaluate_net.load_state_dict(torch.load('model/actor.pkl'))
        self.actor_optimizer = optim.Adam(self.actor_evaluate_net.parameters(), lr=opt["a_lr"])
        self.actor_target_net = copy.deepcopy(self.actor_evaluate_net)
        
        self.critic_evaluate_net = Critic(state_dim,self.action_dim)
        self.critic_evaluate_net.load_state_dict(torch.load('model/critic.pkl'))                                  
        self.critic_optimizer = optim.Adam(self.critic_evaluate_net.parameters(), lr=opt["c_lr"])
        self.critic_loss = nn.MSELoss()
        self.critic_target_net = copy.deepcopy(self.critic_evaluate_net)

    def reset(self, mode=None):
        self.mode = mode
        if self.mode == 'train':
            self.trajectory = []
            self.noise = OrnsteinUhlenbeckProcess(np.zeros((self.action_dim,)))

    def step(self, observation, reward, done):
        if self.mode == 'train' and self.replayer.count < opt["learn_start"]:
            action = np.random.randint(self.action_low, self.action_high)
        else:
            state_tensor = torch.as_tensor(observation, dtype=torch.float).reshape(1, -1)
            action_tensor = self.actor_evaluate_net(state_tensor)
            action = action_tensor.detach().numpy()[0]
        if self.mode == 'train':
            noise = self.noise(sigma=0.1)
            action = (action + noise).clip(self.action_low, self.action_high)

            self.trajectory += [observation, reward, done, action]
            if len(self.trajectory) >= 8:
                state, _, _, act, next_state, reward, done, _ = self.trajectory[-8:]
                self.replayer.store(state, act, reward, next_state, done)

            if self.replayer.count >= opt["learn_start"]:
                self.learn()
        return action

    def close(self):
        pass

    def update_net(self, target_net, evaluate_net, learning_rate=opt["soft_lr"]):
        for target_param, evaluate_param in zip(
                target_net.parameters(), evaluate_net.parameters()):
            target_param.data.copy_(learning_rate * evaluate_param.data
                    + (1 - learning_rate) * target_param.data)

    def learn(self):
        # replay
        states, actions, rewards, next_states, dones = self.replayer.sample(opt["batch_size"])
        state_tensor = torch.as_tensor(states, dtype=torch.float)
        action_tensor = torch.as_tensor(actions, dtype=torch.long)
        reward_tensor = torch.as_tensor(rewards, dtype=torch.float)
        dones = dones.astype(int)     
        done_tensor = torch.as_tensor(dones, dtype=torch.float)
        next_state_tensor = torch.as_tensor(next_states, dtype=torch.float)
        
        # learn critic
        next_action_tensor = self.actor_target_net(next_state_tensor)
        noise_tensor = (0.2 * torch.randn_like(action_tensor, dtype=torch.float))
        noisy_next_action_tensor = (next_action_tensor + noise_tensor).clamp(
                self.action_low, self.action_high)
        next_state_action_tensor = [next_state_tensor, noisy_next_action_tensor,]
        #print(next_state_action_tensor)
        next_q_tensor = self.critic_target_net(next_state_action_tensor).squeeze(1)
        #print(next_q_tensor)
        critic_target_tensor = reward_tensor + (1. - done_tensor) * self.gamma * next_q_tensor
        critic_target_tensor = critic_target_tensor.detach()

        state_action_tensor = [state_tensor.float(), action_tensor.float(),]
        critic_pred_tensor = self.critic_evaluate_net(state_action_tensor).squeeze(1)
        critic_loss_tensor = self.critic_loss(critic_pred_tensor, critic_target_tensor)
        self.critic_optimizer.zero_grad()
        critic_loss_tensor.backward()
        self.critic_optimizer.step()

        # learn actor
        pred_action_tensor = self.actor_evaluate_net(state_tensor)
        pred_action_tensor = pred_action_tensor.clamp(self.action_low, self.action_high)
        pred_state_action_tensor = [state_tensor, pred_action_tensor,]
        critic_pred_tensor = self.critic_evaluate_net(pred_state_action_tensor)
        actor_loss_tensor = -critic_pred_tensor.mean()
        self.actor_optimizer.zero_grad()
        actor_loss_tensor.backward()
        self.actor_optimizer.step()

        self.update_net(self.critic_target_net, self.critic_evaluate_net)
        self.update_net(self.actor_target_net, self.actor_evaluate_net)


agent = DDPGAgent(env)
11:28:18 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Ready to call (timestamp=1635305298.8828228): obsClient.getObjectMetadata

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Finish calling (timestamp=1635305299.1487627)

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Ready to call (timestamp=1635305299.1495306): obsClient.listObjects

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Finish calling (timestamp=1635305299.1665623)

11:28:19 [DEBUG] Start to copy 2 files from obs://modelarts-labs-bj4-v2/course/modelarts/reinforcement_learning/ddpg_mountaincar/model to model.

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Ready to call (timestamp=1635305299.187472): obsClient.getObjectMetadata

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Ready to call (timestamp=1635305299.192197): obsClient.getObjectMetadata

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Finish calling (timestamp=1635305299.2651675)

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Ready to call (timestamp=1635305299.2662349): obsClient.getObject

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Finish calling (timestamp=1635305299.2906594)

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Finish calling (timestamp=1635305299.4189594)

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Ready to call (timestamp=1635305299.4199762): obsClient.getObject

11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Finish calling (timestamp=1635305299.430522)

11:28:19 [DEBUG] Copy Successfully.

5. 开始训练

训练至奖励值达到90以上大约需要10分钟。

def play_episode(env, agent, max_episode_steps=None, mode=None):
    observation, reward, done = env.reset(), 0., False
    agent.reset(mode=mode)
    episode_reward, elapsed_steps = 0., 0
    while True:
        action = agent.step(observation, reward, done)
        # 可视化
        # env.render()
        if done:
            break
        observation, reward, done, _ = env.step(action)
        episode_reward += reward
        elapsed_steps += 1
        if max_episode_steps and elapsed_steps >= max_episode_steps:
            break
    agent.close()
    return episode_reward, elapsed_steps


logging.info('==== train ====')
episode_rewards = []
for episode in itertools.count():
    episode_reward, elapsed_steps = play_episode(env.unwrapped, agent,
            max_episode_steps=env._max_episode_steps, mode='train')
    episode_rewards.append(episode_reward)
    logging.debug('train episode %d: reward = %.2f, steps = %d',
            episode, episode_reward, elapsed_steps)
    if episode>10 and np.mean(episode_rewards[-10:]) > opt["target_reward"]: #最近的10个reward>90
        break
plt.plot(episode_rewards)
11:28:19 [INFO] ==== train ====


/home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages/pandas/core/internals.py:826: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray

  arr_value = np.array(value)


11:28:25 [DEBUG] train episode 0: reward = -43.16, steps = 999

11:28:32 [DEBUG] train episode 1: reward = -38.43, steps = 999

11:28:38 [DEBUG] train episode 2: reward = -51.41, steps = 999

11:28:44 [DEBUG] train episode 3: reward = -48.34, steps = 999

11:28:51 [DEBUG] train episode 4: reward = -50.17, steps = 999

11:28:56 [DEBUG] train episode 5: reward = 93.68, steps = 69

11:29:02 [DEBUG] train episode 6: reward = 93.60, steps = 67

11:29:08 [DEBUG] train episode 7: reward = 93.85, steps = 66

11:29:16 [DEBUG] train episode 8: reward = 92.45, steps = 85

11:29:22 [DEBUG] train episode 9: reward = 93.77, steps = 67

11:29:32 [DEBUG] train episode 10: reward = 91.62, steps = 103

11:29:38 [DEBUG] train episode 11: reward = 94.28, steps = 67

11:29:44 [DEBUG] train episode 12: reward = 93.65, steps = 68

11:29:50 [DEBUG] train episode 13: reward = 93.66, steps = 66

11:29:57 [DEBUG] train episode 14: reward = 93.73, steps = 65





[<matplotlib.lines.Line2D at 0x7fd0f48ef550>]

png

6. 使用模型推理

由于本内核可视化依赖于OpenGL,需要窗口显示,但当前环境暂不支持弹窗,因此无法可视化,请将代码下载到本地,取消 env.render() 这行代码的注释,可查看可视化效果。

logging.info('==== test ====')
episode_rewards = []
for episode in range(100):
    episode_reward, elapsed_steps = play_episode(env, agent)
    episode_rewards.append(episode_reward)
    logging.debug('test episode %d: reward = %.2f, steps = %d',
            episode, episode_reward, elapsed_steps)
logging.info('average episode reward = %.2f ± %.2f',
        np.mean(episode_rewards), np.std(episode_rewards))
env.close()
11:29:57 [INFO] ==== test ====

11:29:57 [DEBUG] test episode 0: reward = 93.60, steps = 66

11:29:57 [DEBUG] test episode 1: reward = 93.45, steps = 68

11:29:57 [DEBUG] test episode 2: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 3: reward = 92.87, steps = 87

11:29:57 [DEBUG] test episode 4: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 5: reward = 93.58, steps = 66

11:29:57 [DEBUG] test episode 6: reward = 92.58, steps = 86

11:29:57 [DEBUG] test episode 7: reward = 93.46, steps = 68

11:29:57 [DEBUG] test episode 8: reward = 93.60, steps = 66

11:29:57 [DEBUG] test episode 9: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 10: reward = 93.58, steps = 66

11:29:57 [DEBUG] test episode 11: reward = 93.59, steps = 66

11:29:57 [DEBUG] test episode 12: reward = 88.82, steps = 126

11:29:57 [DEBUG] test episode 13: reward = 93.53, steps = 67

11:29:57 [DEBUG] test episode 14: reward = 93.17, steps = 72

11:29:57 [DEBUG] test episode 15: reward = 93.32, steps = 70

11:29:57 [DEBUG] test episode 16: reward = 93.40, steps = 69

11:29:57 [DEBUG] test episode 17: reward = 93.53, steps = 67

11:29:57 [DEBUG] test episode 18: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 19: reward = 93.46, steps = 68

11:29:57 [DEBUG] test episode 20: reward = 93.47, steps = 68

11:29:57 [DEBUG] test episode 21: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 22: reward = 93.39, steps = 69

11:29:57 [DEBUG] test episode 23: reward = 93.54, steps = 67

11:29:57 [DEBUG] test episode 24: reward = 93.18, steps = 72

11:29:57 [DEBUG] test episode 25: reward = 93.39, steps = 69

11:29:57 [DEBUG] test episode 26: reward = 93.66, steps = 65

11:29:57 [DEBUG] test episode 27: reward = 92.42, steps = 86

11:29:57 [DEBUG] test episode 28: reward = 93.25, steps = 71

11:29:57 [DEBUG] test episode 29: reward = 93.66, steps = 65

11:29:57 [DEBUG] test episode 30: reward = 93.17, steps = 72

11:29:57 [DEBUG] test episode 31: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 32: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 33: reward = 93.52, steps = 67

11:29:57 [DEBUG] test episode 34: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 35: reward = 91.86, steps = 120

11:29:57 [DEBUG] test episode 36: reward = 88.23, steps = 131

11:29:57 [DEBUG] test episode 37: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 38: reward = 93.31, steps = 70

11:29:57 [DEBUG] test episode 39: reward = 93.31, steps = 70

11:29:57 [DEBUG] test episode 40: reward = 93.31, steps = 70

11:29:57 [DEBUG] test episode 41: reward = 93.60, steps = 66

11:29:57 [DEBUG] test episode 42: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 43: reward = 93.66, steps = 65

11:29:57 [DEBUG] test episode 44: reward = 93.20, steps = 72

11:29:57 [DEBUG] test episode 45: reward = 93.67, steps = 65

11:29:57 [DEBUG] test episode 46: reward = 88.05, steps = 133

11:29:57 [DEBUG] test episode 47: reward = 88.83, steps = 126

11:29:58 [DEBUG] test episode 48: reward = 93.61, steps = 66

11:29:58 [DEBUG] test episode 49: reward = 93.52, steps = 67

11:29:58 [DEBUG] test episode 50: reward = 91.99, steps = 103

11:29:58 [DEBUG] test episode 51: reward = 93.58, steps = 66

11:29:58 [DEBUG] test episode 52: reward = 93.39, steps = 69

11:29:58 [DEBUG] test episode 53: reward = 93.67, steps = 65

11:29:58 [DEBUG] test episode 54: reward = 93.59, steps = 66

11:29:58 [DEBUG] test episode 55: reward = 91.40, steps = 105

11:29:58 [DEBUG] test episode 56: reward = 93.59, steps = 66

11:29:58 [DEBUG] test episode 57: reward = 92.72, steps = 86

11:29:58 [DEBUG] test episode 58: reward = 93.38, steps = 69

11:29:58 [DEBUG] test episode 59: reward = 93.45, steps = 68

11:29:58 [DEBUG] test episode 60: reward = 93.54, steps = 67

11:29:58 [DEBUG] test episode 61: reward = 93.32, steps = 70

11:29:58 [DEBUG] test episode 62: reward = 93.66, steps = 65

11:29:58 [DEBUG] test episode 63: reward = 93.58, steps = 66

11:29:58 [DEBUG] test episode 64: reward = 93.60, steps = 66

11:29:58 [DEBUG] test episode 65: reward = 93.03, steps = 87

11:29:58 [DEBUG] test episode 66: reward = 93.58, steps = 66

11:29:58 [DEBUG] test episode 67: reward = 92.55, steps = 86

11:29:58 [DEBUG] test episode 68: reward = 93.37, steps = 69

11:29:58 [DEBUG] test episode 69: reward = 93.61, steps = 66

11:29:58 [DEBUG] test episode 70: reward = 93.61, steps = 66

11:29:58 [DEBUG] test episode 71: reward = 93.44, steps = 68

11:29:58 [DEBUG] test episode 72: reward = 93.59, steps = 66

11:29:58 [DEBUG] test episode 73: reward = 93.46, steps = 68

11:29:58 [DEBUG] test episode 74: reward = 93.54, steps = 67

11:29:58 [DEBUG] test episode 75: reward = 93.31, steps = 70

11:29:58 [DEBUG] test episode 76: reward = 89.16, steps = 124

11:29:58 [DEBUG] test episode 77: reward = 92.82, steps = 77

11:29:58 [DEBUG] test episode 78: reward = 93.37, steps = 69

11:29:58 [DEBUG] test episode 79: reward = 93.60, steps = 66

11:29:58 [DEBUG] test episode 80: reward = 93.67, steps = 65

11:29:58 [DEBUG] test episode 81: reward = 93.46, steps = 68

11:29:58 [DEBUG] test episode 82: reward = 93.68, steps = 65

11:29:58 [DEBUG] test episode 83: reward = 93.54, steps = 67

11:29:58 [DEBUG] test episode 84: reward = 93.19, steps = 72

11:29:58 [DEBUG] test episode 85: reward = 92.95, steps = 87

11:29:58 [DEBUG] test episode 86: reward = 93.33, steps = 70

11:29:58 [DEBUG] test episode 87: reward = 93.60, steps = 66

11:29:58 [DEBUG] test episode 88: reward = 93.68, steps = 65

11:29:58 [DEBUG] test episode 89: reward = 92.96, steps = 75

11:29:58 [DEBUG] test episode 90: reward = 93.38, steps = 69

11:29:58 [DEBUG] test episode 91: reward = 93.59, steps = 66

11:29:58 [DEBUG] test episode 92: reward = 93.10, steps = 73

11:29:58 [DEBUG] test episode 93: reward = 93.19, steps = 72

11:29:58 [DEBUG] test episode 94: reward = 93.66, steps = 65

11:29:58 [DEBUG] test episode 95: reward = 91.91, steps = 103

11:29:58 [DEBUG] test episode 96: reward = 93.53, steps = 67

11:29:58 [DEBUG] test episode 97: reward = 93.52, steps = 67

11:29:58 [DEBUG] test episode 98: reward = 93.60, steps = 66

11:29:58 [DEBUG] test episode 99: reward = 92.68, steps = 79

11:29:58 [INFO] average episode reward = 93.12 ± 1.12

7.可视化效果

下面的视频为target_reward设置为90时,模型的推理效果,该动图演示了小车在能量消耗极小的情况下到达目标点。

8. 作业

  1. 请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。