使用Notebook保存的镜像启动训练任务
关键知识点:
1.Notebook的镜像有多个conda env及多个python,需要定位到你需要的python环境。
2.训练会有一系列的环境变量,需熟练使用'env'等shell命令来确认环境情况。
3.训练可以支持自动从OBS获取代码,并映射到容器环境的对应目录。
4.训练可以支持自动从OBS获取训练数据,并以类似'--data-path'的参数进行命令拼接。
迁移方法:
1.Notebook部分
首先,Notebook里会存在多个conda环境,请在对应验证的环境中,确认python的路径。
方法:在ipynb的cell里输入"!which python"并记录对应路径,如截图。
然后,使用通过命令行来确认对应的任务的参数配置是可执行的。
方法:在terminal里使用绝对路径来启动任务,命令和训练作业里要保持一致。注意,python脚本必须能够支持解析"--data-path=xxx"参数。详细如截图:
最后,保存Notebook镜像(建议在验证前保存,以免因中间文件导致镜像过大),方法如截图:
注意!
Notebook实例的'/home/ma-user/work'和'/data'目录是挂载的,相应内容无法保存到镜像里。
训练作业会将本地硬盘(有较大的空间)挂载到'/cache'目录,相应内容可能会被覆盖。
镜像构建完以后的大小要小于10G,否则可能无法正常拉起。建议不要把数据放到镜像里,并使用预制的Notebook镜像并仅安装必要的Python库。
2.创建训练任务
首先,对应能力是白名单开放的,如果对应账号无相应能力,请找华为云接口人申请。具体创建方法如截图:
然后,可以通过特殊命令运行任务,并在日志里查看运行中实例的详情。例如,通过"whoami && env && "可以获取当前用户名及所有环境变量,通过"ls -l xxx &&"可以获取对应目录的文件信息。
代码目录配置的OBS地址,会被同步到训练容器的"${MA_JOB_DIR}/code/"目录;训练输入的目录会被同步到训练容器的"/home/ma-user/modelarts/inputs/xxx_0"目录,具体的地址可以在日志里查看。日志示例如下:
[ModelArts Service Log][INFO][2022/02/24 22:26:04]: caching the content of [data-path] inputs
[ModelArts Service Log]2022-02-24 22:26:04,849 - modelarts-downloader.py[line:264] - INFO: Main: modelarts-downloader starting with Namespace(dst='./', recursive=True, skip_creating_dir=True, src='s3://ma-sa/yangzilong/demos/dog_cat_split/dog_cat_1w/', trace=False, type='common', verbose=False)
[ModelArts Service Log]2022-02-24 22:26:05,204 - file_io.py[line:1276] - INFO: Listing OBS: 1000
[ModelArts Service Log]2022-02-24 22:26:05,324 - file_io.py[line:1276] - INFO: Listing OBS: 2000
[ModelArts Service Log]2022-02-24 22:26:05,432 - file_io.py[line:1276] - INFO: Listing OBS: 3000
[ModelArts Service Log]2022-02-24 22:26:05,534 - file_io.py[line:1276] - INFO: Listing OBS: 4000
[ModelArts Service Log]2022-02-24 22:26:05,636 - file_io.py[line:1276] - INFO: Listing OBS: 5000
[ModelArts Service Log]2022-02-24 22:26:05,733 - file_io.py[line:1276] - INFO: Listing OBS: 6000
[ModelArts Service Log]2022-02-24 22:26:05,839 - file_io.py[line:1276] - INFO: Listing OBS: 7000
[ModelArts Service Log]2022-02-24 22:26:05,943 - file_io.py[line:1276] - INFO: Listing OBS: 8000
[ModelArts Service Log]2022-02-24 22:26:06,050 - file_io.py[line:1276] - INFO: Listing OBS: 9000
[ModelArts Service Log]2022-02-24 22:26:06,059 - file_io.py[line:1276] - INFO: Listing OBS: 10000
[ModelArts Service Log]2022-02-24 22:26:06,061 - consumer.py[line:107] - INFO: MoXing Local Track Mode.
[ModelArts Service Log]2022-02-24 22:28:29,179 - file_io.py[line:1276] - INFO: Listing OBS: 1000
[ModelArts Service Log]2022-02-24 22:28:29,261 - file_io.py[line:1276] - INFO: Listing OBS: 2000
[ModelArts Service Log]2022-02-24 22:28:29,351 - file_io.py[line:1276] - INFO: Listing OBS: 3000
[ModelArts Service Log]2022-02-24 22:28:29,438 - file_io.py[line:1276] - INFO: Listing OBS: 4000
[ModelArts Service Log]2022-02-24 22:28:29,527 - file_io.py[line:1276] - INFO: Listing OBS: 5000
[ModelArts Service Log]2022-02-24 22:28:29,609 - file_io.py[line:1276] - INFO: Listing OBS: 6000
[ModelArts Service Log]2022-02-24 22:28:29,691 - file_io.py[line:1276] - INFO: Listing OBS: 7000
[ModelArts Service Log]2022-02-24 22:28:29,781 - file_io.py[line:1276] - INFO: Listing OBS: 8000
[ModelArts Service Log]2022-02-24 22:28:29,862 - file_io.py[line:1276] - INFO: Listing OBS: 9000
[ModelArts Service Log]2022-02-24 22:28:29,869 - file_io.py[line:1276] - INFO: Listing OBS: 10000
[ModelArts Service Log]2022-02-24 22:28:30,940 - file_io.py[line:2465] - INFO: pid: None. 1000/10000
[ModelArts Service Log]2022-02-24 22:28:31,829 - file_io.py[line:2465] - INFO: pid: None. 2000/10000
[ModelArts Service Log]2022-02-24 22:28:32,715 - file_io.py[line:2465] - INFO: pid: None. 3000/10000
[ModelArts Service Log]2022-02-24 22:28:33,672 - file_io.py[line:2465] - INFO: pid: None. 4000/10000
[ModelArts Service Log]2022-02-24 22:28:34,569 - file_io.py[line:2465] - INFO: pid: None. 5000/10000
[ModelArts Service Log]2022-02-24 22:28:35,485 - file_io.py[line:2465] - INFO: pid: None. 6000/10000
[ModelArts Service Log]2022-02-24 22:28:36,423 - file_io.py[line:2465] - INFO: pid: None. 7000/10000
[ModelArts Service Log]2022-02-24 22:28:37,363 - file_io.py[line:2465] - INFO: pid: None. 8000/10000
[ModelArts Service Log]2022-02-24 22:28:38,228 - file_io.py[line:2465] - INFO: pid: None. 9000/10000
[ModelArts Service Log]2022-02-24 22:28:39,102 - file_io.py[line:2465] - INFO: pid: None. 10000/10000
[ModelArts Service Log][INFO][2022/02/24 22:28:39]: cache the content of [data-path] inputs successfully
[ModelArts Service Log][INFO][2022/02/24 22:28:39]: it can be accessed at local dir [/home/ma-user/modelarts/inputs/data-path_0]
注意,如果是用了训练输入,训练环境会默认在命令后增加一段参数(如," --data-path=xxxxx"),请消除对应参数影响。此配置后续可能会改成通过环境变量提供
最后,可以成功运行的参数如下:
/home/ma-user/anaconda3/envs/PyTorch-1.8/bin/python ${MA_JOB_DIR}/code/main.py -a resnet50 -b 128 --epochs 5
由于该任务配置了输入目录,所以训练任务会自动拼接"--data-path=/home/ma-user/modelarts/inputs/data-path_0"。(后续可能整改)
- 点赞
- 收藏
- 关注作者
评论(0)