【ModelArts Course7-1】Custom Image to Train & Deploy from Github

举报
Emma_Liu 发表于 2023/03/14 19:32:02 2023/03/14
【摘要】 This article explains how to use an open source algorithm from Github to develop on the Huawei Cloud ModelArts platform on the cloud. - Part 1

ModelArts - Using Custom Image to Train and Deploy a Model from Github

This article explains how to use an open source algorithm from Github to develop on the Huawei Cloud ModelArts platform on the cloud,the following is a mind map of the main steps of the development process

image

Part 1: Create Training Job Using Custom Image

The following will explain detail for per step in operation steps.

1. Prepare the Python Environment

click This link to enter the ModelArts management console, click DevEnviron -> Notebook to enter the notebook list page, then click Create in the upper left corner of the page, create a notebook, and set parameter, as shown below.

image

After configuring the parameters, click Next, confirming the product specifications, and then click Submit to complete the creation of the notebook.

Return to the notebook list page, after the status of the new Notebook changes to Running, click Operation -> Open to access the notebook.

On the Notebook page,click Launcher -> Terminal, as shown below.

image

You can run the command conda info -e to view the information about the installed Python environment.

Click this link CodeHub -DINO. The following uses the open-source algorithm as an example to demonstrate how to quickly run it on Huawei Cloud Notebook. For more information about the algorithm, please refer to README.md.

  1. Run the following command in the terminal to clone repository:

    git clone https://codehub.devcloud.cn-north-4.huaweicloud.com/DINO00002/DINO.git
    cd DINO
    

    image

    As shown above, code cloning is complete. Click the refresh button on the top left of the task bar to view the code.

  2. View Pytorch Version

    pip list | grep torch
    
  3. Installing Other Required Packages

    pip install -r requirements.txt
    
  4. Compiling the CUDA Operator

    cd models/dino/ops
    python setup.py build install
    # Unit test (should check all numerical gradient as True)
    python test.py
    cd ../../..  # Back to the code home directory
    

    image

2. Prepare Data and Pre-Training Parameter files

  1. Back to the Huawei Cloud Modelarts management console, move the cursor to the left sidebar, and in the pop-up menu bar,click Service List -> Storage -> Object Storage Service, as shown below.

    image

    Click Create Bucket, The Create Bucket Page is displayed.

    image

    Now let’s start creating the OBS bucket, with the following parameters:
    ● Region: AP-Singapore
    ● Bucket Name: user-defined, which will be used in subsequent steps.
    ● Data Redundancy Policy: Single-AZ storage
    ● Default Storage Class: Standard
    ● Bucket Policy: Private
    ● Default Encryption: Keep Default

    Click Create Now -> OK to complete the creation of the Bucket.

    image

  2. We will use a subset of the COCO 2017. The micro-dataset includes train (100), val (100), and annotation files.

    COCO2017_subset100/
      ├── train2017/
      ├── val2017/
      └── annotations/
      	├── instances_train2017.json
      	└── instances_val2017.json
    

    It is stored in Public Read OBS bucket, the path is obs://modelarts-case-dev-sg/opensource/coco/coco2017_subset/.

    Back to the Notebook page, create an ipynb file, copy and complete the following code into cell and run it. After the execution is complete, click Refresh above task bar to view the imported dataset,as shown below.

    import moxing as mox
    mox.file.copy_parallel(${obs_path},${notebook_path})
    

    NOTICE:

    ● ${obs_path} indicates the location of OBS Public Read dataset.
    ● ${notebook_path} indicates the dataset storage path(./coco2017_subset) in the notebook, which is at the same level as the DINO directory.

    image

  3. Open the terminal and run the following command to download the DINO model checkpoint0011_4scale.pth. After the download is complete, click the refresh button on the left to view the ckpts folder for storing the downloaded checkpoint.

    wget -P ckpts https://sandbox-expriment-files.obs.cn-north-1.myhuaweicloud.com:443/20221228/checkpoint0011_4scale.pth
    

    image

3. Evaluate the Pre-Training Model

Running the following command to evaluate the pre-training model. You can expect the final AP of about 49.

bash scripts/DINO_eval.sh /path/to/your/COCODIR /path/to/your/checkpoint

NOTICE:

/path/to/your/COCODIR is the storage path of the dataset in the Notebook.

/path/to/your/checkpoint is the path for storing the model in the Notebook.

As shown below:

image

The following figure shows the running result.

image

4. Save the Image

In 1 Prepare the Python Environment, create a notebook instance using a built-in system image, install custom dependencies on the instance, and save the running instance as a container image.

In the saved image, the installed dependency package is retained, but the data stored in home/ma-user/work for persistent storage will not be stored. In remote development through VS Code, the plug-ins installed on the server are retained in the saved image.

  1. Back to the ModelArts management console and choose DevEnviron -> Notebook in the navigation pane on the left to switch to notebook of the new version.

  2. In the notebook list, select the target notebook instance and choose Save Image from the More drop-down list in the Operation column. The Save Image dialog box is displayed.

    image

  3. In the Save Image dialog box, configure parameters. Click OK to save the image.

    image

    Choose an organization from the Organization drop-down list. If no organization is available, click Create on the right to create one. Users in an organization can share all images in the organization. For details about how to create an organization, refer to Creating an Organization.

  4. The image will be saved as a snapshot, and it will take about 5 minutes. During this period of time, do not perform any operations on the instance. (You can still perform operations on the accessed JupyterLab page and local IDE.)

    image

    NOTICE:

    The time required for saving an image as a snapshot will be counted in the instance running duration. If the instance running duration expires before the snapshot is saved, saving the image will fail.

  5. After the image is saved, the instance status changes to Running. View the image on the Image Management page.

  6. Click the name of the image to view its details including the image version/ID, status, resource type, image size, and SWR address.

  7. Choose Image Management in the navigation pane on the left to view the image list and details, as shown in the following figure.

    image

5. Upload Dataset and Training Code

Back to the Notebook page, enter the following code in the created ipynb file to upload the dataset: coco2017_subset to the OBS bucket.

mox.file.copy_parallel('coco2017_subset', 'obs://${your bucket_name}/coco2017_subset')

NOTICE:

● ${your bucket_name} indicates the OBS bucket name created in Step 1.

Insert a cell below, enter the following code to upload the code to the OBS bucket.

mox.file.copy_parallel("./DINO/","obs://${your bucket_name}/DINO")

image

6. Create Training Job

  1. Choose Training Management -> Training Jobs from the menu bar on the left to enter the training job list page, and click Create Training Job in the upper right corner,as shown below.

    image

  2. Parameter Configuration

    Created By: Custom algorithms

    Boot Mode: Custom images

    Image: click Select, In the Select Image dialog box, click your Image Name created, choose Image Version, and then click OK.

    image

    Code Directory: click Select on the right, select the path: DINO of the code to be uploaded to OBS, and then click OK.

    image

    Boot Command: Command for booting an image. The boot command will be automatically executed after the code directory is downloaded. If the training startup script is a .py file, train.py for example, the boot command can be python train.py. If the training startup script is a .sh file, main.sh for example, the boot command can be bash main.sh. Semicolons (;) and ampersands (&&) can be used to combine multiple boot commands, but line breaks are not supported.

    python main.py -c config/DINO/DINO_4scale.py --options dn_scalar=100 embed_init_tgt=TRUE dn_label_coef=1.0 dn_bbox_coef=1.0 use_ema=False dn_box_noise_scale=1.0
    

    Local Code Directory: Default

    Work Directory: click Select -> Yes

    image

    Input: Select data for training. It will be downloaded to the training container. Then, parameters will be parsed to obtain the data path.

    click Add Training Input,setting input name: coco_path, click Data path, choose the path of dataset: coco2017_subset in OBS, and obtained from Hyperparameters.

    Output: Select an OBS path for storing the training output. An empty directory is recommended.

    click Add Training Input,setting output name: output_dir, click Data path -> Create Folder to create a new folder: training_output for storing training output data in OBS, and obtained from Hyperparameters.

    image

    Resource Pool: Public resource pools

    Resource Type: GPU

    Instance Flavor: GPU: 1*NVIDIA-V100-pcie-32gb(32GB) | CPU: 8 vCPUs 64GB 780GB

    Compute Nodes: 1

    Persistent Log Saving: Optional

    • Job Log Path: select an empty OBS path for storing training logs. Ensure that you have read and write permissions to the selected OBS directory.

    Event Notification: Optional. Whether to subscribe to event notifications. After this function is enabled, you will be notified of specific events, such as job status changes or suspected suspensions, via an SMS or email. If you enable this function, configure the following parameters as required:

    • Topic: topic of event notifications. You can create a topic on the SMN console.
    • Event: type of the event to subscribe to. Options: JobStarted, JobCompleted, JobFailed, JobTerminated, and JobHanged.

    Auto Stop: Enable; 1 hour; check the agreement. After this parameter is enabled and the auto stop time is set, a training job automatically stops at the specified time.

    Retain Default for other parameters.

    image

  3. After the parameters are set, click Submit, confirm the training information, and click OK.

    image

    A training job generally runs for a period of time. To view the real-time status and basic information of a training job, switch to the training job list.

    • In the training job list, Status of the newly created training job is Pending.
    • When the status of a training job changes to Completed, the training job is complete, and the generated model is stored in the corresponding training output path.
    • If the status is Failed or Abnormal, click the job name to go to the job details page and view logs for troubleshooting. For detail, see Training Job Details.

    image

    image

  4. After the training task is complete, you can edit the code online in the directory. After the code is saved,you can train the model again, as shown below.

    image

7. Training Output

After the training task is complete, you can view the training result in the configured OBS training output path.

image

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。