- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

使用PPO算法玩“超级马里奥兄弟”

HWCloudAI 发表于 2022/11/22 20:01:21 2022/11/22

【摘要】在此教程中，我们利用PPO算法来玩“Super Mario Bros”（超级马里奥兄弟）。目前来看，对于绝大部分关卡，智能体都可以在1500个episode内学会过关，您可以在超参数栏输入您想要的游戏关卡和训练算法超参数。

PPO算法的基本结构

PPO算法有两种主要形式：PPO-Penalty和PPO-Clip(PPO2)。在这里，我们讨论PPO-Clip（OpenAI使用的主要形式）。 PPO的主要特点如下：

PPO属于on-policy算法

PPO同时适用于离散和连续的动作空间

损失函数 PPO-Clip算法最精髓的地方就是加入了一项比例用以描绘新老策略的差异,通过超参数ϵ限制策略的更新步长：

更新策略：

探索策略 PPO采用随机探索策略。

优势函数表示在状态s下采取动作a，相较于其他动作有多少优势，如果>0,则当前动作比平均动作好，反之，则差

PPO论文

1. 程序初始化

第1步：安装基础依赖

!pip install -U pip
!pip install gym==0.19.0
!pip install tqdm==4.48.0
!pip install nes-py==8.1.0
!pip install gym-super-mario-bros==7.3.2

import os
import shutil
import subprocess as sp
from collections import deque

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.multiprocessing as _mp
from torch.distributions import Categorical
import torch.multiprocessing as mp
from nes_py.wrappers import JoypadSpace
import gym_super_mario_bros
from gym.spaces import Box
from gym import Wrapper
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT, COMPLEX_MOVEMENT, RIGHT_ONLY
import cv2
import matplotlib.pyplot as plt
from IPython import display

import moxing as mox

2. 训练参数初始化

该部分参数可以自己调整，以训练出更好的效果

opt={
    "world": 1,                # 可选大关：1,2,3,4,5,6,7,8
    "stage": 1,                # 可选小关：1,2,3,4 
    "action_type": "simple",   # 动作类别："simple"，"right_only", "complex"
    'lr': 1e-4,                # 建议学习率：1e-3，1e-4, 1e-5，7e-5
    'gamma': 0.9,              # 奖励折扣
    'tau': 1.0,                # GAE参数
    'beta': 0.01,              # 熵系数
    'epsilon': 0.2,            # PPO的Clip系数
    'batch_size': 16,          # 经验回放的batch_size
    'max_episode':10,          # 最大训练局数
    'num_epochs': 10,          # 每条经验回放次数
    "num_local_steps": 512,    # 每局的最大步数
    "num_processes": 8,        # 训练进程数，一般等于训练机核心数
    "save_interval": 5,        # 每{}局保存一次模型
    "log_path": "./log",       # 日志保存路径
    "saved_path": "./model",   # 训练模型保存路径
    "pretrain_model": True,    # 是否加载预训练模型，目前只提供1-1关卡的预训练模型，其他需要从零开始训练
    "episode":5
}

3. 创建环境

结束标志：

胜利：mario到达本关终点
失败：mario受到敌人的伤害、坠入悬崖或者时间用完

奖励函数：

得分：收集金币、踩扁敌人、结束时夺旗
扣分：受到敌人伤害、掉落悬崖、结束时未夺旗

# 创建环境
def create_train_env(world, stage, actions, output_path=None):
    # 创建基础环境
    env = gym_super_mario_bros.make("SuperMarioBros-{}-{}-v0".format(world, stage))

    env = JoypadSpace(env, actions)
    # 对环境自定义
    env = CustomReward(env, world, stage, monitor=None)
    env = CustomSkipFrame(env)
    return env


# 对原始环境进行修改，以获得更好的训练效果
class CustomReward(Wrapper):
    def __init__(self, env=None, world=None, stage=None, monitor=None):
        super(CustomReward, self).__init__(env)
        self.observation_space = Box(low=0, high=255, shape=(1, 84, 84))
        self.curr_score = 0
        self.current_x = 40
        self.world = world
        self.stage = stage
        if monitor:
            self.monitor = monitor
        else:
            self.monitor = None

    def step(self, action):
        state, reward, done, info = self.env.step(action)
        if self.monitor:
            self.monitor.record(state)
        state = process_frame(state)
        reward += (info["score"] - self.curr_score) / 40.
        self.curr_score = info["score"]
        if done:
            if info["flag_get"]:
                reward += 50
            else:
                reward -= 50
        if self.world == 7 and self.stage == 4:
            if (506 <= info["x_pos"] <= 832 and info["y_pos"] > 127) or (
                    832 < info["x_pos"] <= 1064 and info["y_pos"] < 80) or (
                    1113 < info["x_pos"] <= 1464 and info["y_pos"] < 191) or (
                    1579 < info["x_pos"] <= 1943 and info["y_pos"] < 191) or (
                    1946 < info["x_pos"] <= 1964 and info["y_pos"] >= 191) or (
                    1984 < info["x_pos"] <= 2060 and (info["y_pos"] >= 191 or info["y_pos"] < 127)) or (
                    2114 < info["x_pos"] < 2440 and info["y_pos"] < 191) or info["x_pos"] < self.current_x - 500:
                reward -= 50
                done = True
        if self.world == 4 and self.stage == 4:
            if (info["x_pos"] <= 1500 and info["y_pos"] < 127) or (
                    1588 <= info["x_pos"] < 2380 and info["y_pos"] >= 127):
                reward = -50
                done = True

        self.current_x = info["x_pos"]
        return state, reward / 10., done, info

    def reset(self):
        self.curr_score = 0
        self.current_x = 40
        return process_frame(self.env.reset())


class MultipleEnvironments:
    def __init__(self, world, stage, action_type, num_envs, output_path=None):
        self.agent_conns, self.env_conns = zip(*[mp.Pipe() for _ in range(num_envs)])
        if action_type == "right_only":
            actions = RIGHT_ONLY
        elif action_type == "simple":
            actions = SIMPLE_MOVEMENT
        else:
            actions = COMPLEX_MOVEMENT
        self.envs = [create_train_env(world, stage, actions, output_path=output_path) for _ in range(num_envs)]
        self.num_states = self.envs[0].observation_space.shape[0]
        self.num_actions = len(actions)
        for index in range(num_envs):
            process = mp.Process(target=self.run, args=(index,))
            process.start()
            self.env_conns[index].close()

    def run(self, index):
        self.agent_conns[index].close()
        while True:
            request, action = self.env_conns[index].recv()
            if request == "step":
                self.env_conns[index].send(self.envs[index].step(action.item()))
            elif request == "reset":
                self.env_conns[index].send(self.envs[index].reset())
            else:
                raise NotImplementedError


def process_frame(frame):
    if frame is not None:
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(frame, (84, 84))[None, :, :] / 255.
        return frame
    else:
        return np.zeros((1, 84, 84))
    

class CustomSkipFrame(Wrapper):
    def __init__(self, env, skip=4):
        super(CustomSkipFrame, self).__init__(env)
        self.observation_space = Box(low=0, high=255, shape=(skip, 84, 84))
        self.skip = skip
        self.states = np.zeros((skip, 84, 84), dtype=np.float32)

    def step(self, action):
        total_reward = 0
        last_states = []
        for i in range(self.skip):
            state, reward, done, info = self.env.step(action)
            total_reward += reward
            if i >= self.skip / 2:
                last_states.append(state)
            if done:
                self.reset()
                return self.states[None, :, :, :].astype(np.float32), total_reward, done, info
        max_state = np.max(np.concatenate(last_states, 0), 0)
        self.states[:-1] = self.states[1:]
        self.states[-1] = max_state
        return self.states[None, :, :, :].astype(np.float32), total_reward, done, info

    def reset(self):
        state = self.env.reset()
        self.states = np.concatenate([state for _ in range(self.skip)], 0)
        return self.states[None, :, :, :].astype(np.float32)

4. 定义神经网络

神经网络结构包含四层卷积网络和一层全连接网络，提取的特征输入critic层和actor层，分别输出value值和动作概率分布。

class Net(nn.Module):
    def __init__(self, num_inputs, num_actions):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(num_inputs, 32, 3, stride=2, padding=1)
        self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1)
        self.linear = nn.Linear(32 * 6 * 6, 512)
        self.critic_linear = nn.Linear(512, 1)
        self.actor_linear = nn.Linear(512, num_actions)
        self._initialize_weights()

    def _initialize_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
                nn.init.orthogonal_(module.weight, nn.init.calculate_gain('relu'))
                nn.init.constant_(module.bias, 0)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        x = self.linear(x.view(x.size(0), -1))
        return self.actor_linear(x), self.critic_linear(x)

6. 训练模型

训练10 Episode，耗时约5分钟

train(opt)

加载预训练模型



Episode: 1. Total loss: 1.1230244636535645



Episode: 2. Total loss: 2.553663730621338



Episode: 3. Total loss: 1.768389344215393



Episode: 4. Total loss: 1.6962862014770508



Episode: 5. Total loss: 1.0912611484527588



Episode: 6. Total loss: 1.6626232862472534



Episode: 7. Total loss: 1.9952025413513184



Episode: 8. Total loss: 1.2410558462142944



Episode: 9. Total loss: 1.3711413145065308



Episode: 10. Total loss: 1.2155205011367798

7. 使用模型推理游戏

定义推理函数

def infer(opt):
    if torch.cuda.is_available():
        torch.cuda.manual_seed(123)
    else:
        torch.manual_seed(123)
    if opt['action_type'] == "right":
        actions = RIGHT_ONLY
    elif opt['action_type'] == "simple":
        actions = SIMPLE_MOVEMENT
    else:
        actions = COMPLEX_MOVEMENT
    env = create_train_env(opt['world'], opt['stage'], actions)
    model = Net(env.observation_space.shape[0], len(actions))
    if torch.cuda.is_available():
        model.load_state_dict(torch.load("{}/ppo_super_mario_bros_{}_{}_{}".format(opt['saved_path'],opt['world'], opt['stage'],opt['episode'])))
        model.cuda()
    else:
        model.load_state_dict(torch.load("{}/ppo_super_mario_bros_{}_{}_{}".format(opt['saved_path'], opt['world'], opt['stage'],opt['episode']),
                                         map_location=torch.device('cpu')))
    model.eval()
    state = torch.from_numpy(env.reset())
    
    plt.figure(figsize=(10,10))
    img = plt.imshow(env.render(mode='rgb_array'))
    
    while True:
        if torch.cuda.is_available():
            state = state.cuda()
        logits, value = model(state)
        policy = F.softmax(logits, dim=1)
        action = torch.argmax(policy).item()
        state, reward, done, info = env.step(action)
        state = torch.from_numpy(state)
        
        img.set_data(env.render(mode='rgb_array')) # just update the data
        display.display(plt.gcf())
        display.clear_output(wait=True)
        
        if info["flag_get"]:
            print("World {} stage {} completed".format(opt['world'], opt['stage']))
            break
            
        if done and info["flag_get"] is False:
            print('Game Failed')
            break

infer(opt)

8. 作业¶

请你调整步骤2中的训练参数，重新训练一个模型，使它在游戏中获得更好的表现。

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

使用PPO算法玩“超级马里奥兄弟”

案例内容介绍

PPO算法的基本结构

超级马里奥兄弟游戏环境简介

注意事项

实验步骤

1. 程序初始化

2. 训练参数初始化

3. 创建环境

4. 定义神经网络

6. 训练模型

7. 使用模型推理游戏

8. 作业¶

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

使用PPO算法玩“超级马里奥兄弟”

案例内容介绍

PPO算法的基本结构

超级马里奥兄弟游戏环境简介

注意事项

实验步骤

1. 程序初始化

2. 训练参数初始化

3. 创建环境

4. 定义神经网络

6. 训练模型

7. 使用模型推理游戏

8. 作业¶

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品