- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

【ModelArts Course7-1】Custom Image to Train & Deploy from Github

Emma_Liu 发表于 2023/03/14 19:32:02 2023/03/14

【摘要】 This article explains how to use an open source algorithm from Github to develop on the Huawei Cloud ModelArts platform on the cloud. - Part 1

ModelArts - Using Custom Image to Train and Deploy a Model from Github

This article explains how to use an open source algorithm from Github to develop on the Huawei Cloud ModelArts platform on the cloud,the following is a mind map of the main steps of the development process

Part 1: Create Training Job Using Custom Image

The following will explain detail for per step in operation steps.

1. Prepare the Python Environment

click This link to enter the ModelArts management console, click DevEnviron -> Notebook to enter the notebook list page, then click Create in the upper left corner of the page, create a notebook, and set parameter, as shown below.

After configuring the parameters, click Next, confirming the product specifications, and then click Submit to complete the creation of the notebook.

Return to the notebook list page, after the status of the new Notebook changes to Running, click Operation -> Open to access the notebook.

On the Notebook page,click Launcher -> Terminal, as shown below.

You can run the command conda info -e to view the information about the installed Python environment.

Click this link CodeHub -DINO. The following uses the open-source algorithm as an example to demonstrate how to quickly run it on Huawei Cloud Notebook. For more information about the algorithm, please refer to README.md.

Run the following command in the terminal to clone repository:
```
git clone https://codehub.devcloud.cn-north-4.huaweicloud.com/DINO00002/DINO.git
cd DINO
```
As shown above, code cloning is complete. Click the refresh button on the top left of the task bar to view the code.
View Pytorch Version
```
pip list | grep torch
```
Installing Other Required Packages
```
pip install -r requirements.txt
```

Compiling the CUDA Operator

cd models/dino/ops
python setup.py build install
# Unit test (should check all numerical gradient as True)
python test.py
cd ../../..  # Back to the code home directory

2. Prepare Data and Pre-Training Parameter files

Back to the Huawei Cloud Modelarts management console, move the cursor to the left sidebar, and in the pop-up menu bar,click Service List -> Storage -> Object Storage Service, as shown below.

Click Create Bucket, The Create Bucket Page is displayed.

Now let’s start creating the OBS bucket, with the following parameters:
● Region: AP-Singapore
● Bucket Name: user-defined, which will be used in subsequent steps.
● Data Redundancy Policy: Single-AZ storage
● Default Storage Class: Standard
● Bucket Policy: Private
● Default Encryption: Keep Default

Click Create Now -> OK to complete the creation of the Bucket.
We will use a subset of the COCO 2017. The micro-dataset includes train (100), val (100), and annotation files.
```
COCO2017_subset100/
  ├── train2017/
  ├── val2017/
  └── annotations/
  	├── instances_train2017.json
  	└── instances_val2017.json
```
It is stored in Public Read OBS bucket, the path is obs://modelarts-case-dev-sg/opensource/coco/coco2017_subset/.

Back to the Notebook page, create an ipynb file, copy and complete the following code into cell and run it. After the execution is complete, click Refresh above task bar to view the imported dataset,as shown below.
```
import moxing as mox
mox.file.copy_parallel(${obs_path},${notebook_path})
```
NOTICE:

● ${obs_path} indicates the location of OBS Public Read dataset.
● ${notebook_path} indicates the dataset storage path(./coco2017_subset) in the notebook, which is at the same level as the DINO directory.
Open the terminal and run the following command to download the DINO model checkpoint0011_4scale.pth. After the download is complete, click the refresh button on the left to view the ckpts folder for storing the downloaded checkpoint.
```
wget -P ckpts https://sandbox-expriment-files.obs.cn-north-1.myhuaweicloud.com:443/20221228/checkpoint0011_4scale.pth
```

3. Evaluate the Pre-Training Model

Running the following command to evaluate the pre-training model. You can expect the final AP of about 49.

bash scripts/DINO_eval.sh /path/to/your/COCODIR /path/to/your/checkpoint

NOTICE:

/path/to/your/COCODIR is the storage path of the dataset in the Notebook.

/path/to/your/checkpoint is the path for storing the model in the Notebook.

As shown below:

The following figure shows the running result.

4. Save the Image

In 1 Prepare the Python Environment, create a notebook instance using a built-in system image, install custom dependencies on the instance, and save the running instance as a container image.

In the saved image, the installed dependency package is retained, but the data stored in home/ma-user/work for persistent storage will not be stored. In remote development through VS Code, the plug-ins installed on the server are retained in the saved image.

Back to the ModelArts management console and choose DevEnviron -> Notebook in the navigation pane on the left to switch to notebook of the new version.
In the notebook list, select the target notebook instance and choose Save Image from the More drop-down list in the Operation column. The Save Image dialog box is displayed.
In the Save Image dialog box, configure parameters. Click OK to save the image.

Choose an organization from the Organization drop-down list. If no organization is available, click Create on the right to create one. Users in an organization can share all images in the organization. For details about how to create an organization, refer to Creating an Organization.
The image will be saved as a snapshot, and it will take about 5 minutes. During this period of time, do not perform any operations on the instance. (You can still perform operations on the accessed JupyterLab page and local IDE.)

NOTICE:

The time required for saving an image as a snapshot will be counted in the instance running duration. If the instance running duration expires before the snapshot is saved, saving the image will fail.
After the image is saved, the instance status changes to Running. View the image on the Image Management page.
Click the name of the image to view its details including the image version/ID, status, resource type, image size, and SWR address.
Choose Image Management in the navigation pane on the left to view the image list and details, as shown in the following figure.

5. Upload Dataset and Training Code

Back to the Notebook page, enter the following code in the created ipynb file to upload the dataset: coco2017_subset to the OBS bucket.

mox.file.copy_parallel('coco2017_subset', 'obs://${your bucket_name}/coco2017_subset')

NOTICE:

● ${your bucket_name} indicates the OBS bucket name created in Step 1.

Insert a cell below, enter the following code to upload the code to the OBS bucket.

mox.file.copy_parallel("./DINO/","obs://${your bucket_name}/DINO")

6. Create Training Job

Choose Training Management -> Training Jobs from the menu bar on the left to enter the training job list page, and click Create Training Job in the upper right corner,as shown below.
Parameter Configuration

● Created By: Custom algorithms

● Boot Mode: Custom images

● Image: click Select, In the Select Image dialog box, click your Image Name created, choose Image Version, and then click OK.

● Code Directory: click Select on the right, select the path: DINO of the code to be uploaded to OBS, and then click OK.

● Boot Command: Command for booting an image. The boot command will be automatically executed after the code directory is downloaded. If the training startup script is a .py file, train.py for example, the boot command can be python train.py. If the training startup script is a .sh file, main.sh for example, the boot command can be bash main.sh. Semicolons (;) and ampersands (&&) can be used to combine multiple boot commands, but line breaks are not supported.
```
python main.py -c config/DINO/DINO_4scale.py --options dn_scalar=100 embed_init_tgt=TRUE dn_label_coef=1.0 dn_bbox_coef=1.0 use_ema=False dn_box_noise_scale=1.0
```
● Local Code Directory: Default

● Work Directory: click Select -> Yes

● Input: Select data for training. It will be downloaded to the training container. Then, parameters will be parsed to obtain the data path.

click Add Training Input,setting input name: coco_path, click Data path, choose the path of dataset: coco2017_subset in OBS, and obtained from Hyperparameters.

● Output: Select an OBS path for storing the training output. An empty directory is recommended.

click Add Training Input,setting output name: output_dir, click Data path -> Create Folder to create a new folder: training_output for storing training output data in OBS, and obtained from Hyperparameters.

● Resource Pool: Public resource pools

● Resource Type: GPU

● Instance Flavor: GPU: 1*NVIDIA-V100-pcie-32gb(32GB) | CPU: 8 vCPUs 64GB 780GB

● Compute Nodes: 1

● Persistent Log Saving: Optional
- Job Log Path: select an empty OBS path for storing training logs. Ensure that you have read and write permissions to the selected OBS directory.
● Event Notification: Optional. Whether to subscribe to event notifications. After this function is enabled, you will be notified of specific events, such as job status changes or suspected suspensions, via an SMS or email. If you enable this function, configure the following parameters as required:
- Topic: topic of event notifications. You can create a topic on the SMN console.
- Event: type of the event to subscribe to. Options: JobStarted, JobCompleted, JobFailed, JobTerminated, and JobHanged.
● Auto Stop: Enable; 1 hour; check the agreement. After this parameter is enabled and the auto stop time is set, a training job automatically stops at the specified time.

Retain Default for other parameters.
After the parameters are set, click Submit, confirm the training information, and click OK.

A training job generally runs for a period of time. To view the real-time status and basic information of a training job, switch to the training job list.
- In the training job list, Status of the newly created training job is Pending.
- When the status of a training job changes to Completed, the training job is complete, and the generated model is stored in the corresponding training output path.
- If the status is Failed or Abnormal, click the job name to go to the job details page and view logs for troubleshooting. For detail, see Training Job Details.
After the training task is complete, you can edit the code online in the directory. After the code is saved,you can train the model again, as shown below.

7. Training Output

After the training task is complete, you can view the training result in the configured OBS training output path.

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

【ModelArts Course7-1】Custom Image to Train & Deploy from Github

ModelArts - Using Custom Image to Train and Deploy a Model from Github

Part 1: Create Training Job Using Custom Image

1. Prepare the Python Environment

2. Prepare Data and Pre-Training Parameter files

3. Evaluate the Pre-Training Model

4. Save the Image

5. Upload Dataset and Training Code

6. Create Training Job

7. Training Output

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

【ModelArts Course7-1】Custom Image to Train & Deploy from Github

ModelArts - Using Custom Image to Train and Deploy a Model from Github

Part 1: Create Training Job Using Custom Image

1. Prepare the Python Environment

2. Prepare Data and Pre-Training Parameter files

3. Evaluate the Pre-Training Model

4. Save the Image

5. Upload Dataset and Training Code

6. Create Training Job

7. Training Output

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品