在 PyTorch 中使用 TensorBoard 进行实验跟踪和超参数调整

举报
Q神 发表于 2023/07/02 15:37:54 2023/07/02
【摘要】 介绍在 PyTorch 中使用 TensorBoard 跟踪实验并调整超参数实验跟踪涉及记录和监控机器学习实验数据,TensorBoard 是可视化和分析这些数据的有用工具。它可以帮助研究人员了解实验行为、比较模型并做出明智的决策。超参数调整是寻找影响模型学习的配置设置最佳值的过程。示例包括学习率、批量大小和隐藏层数量。适当的调整可以提高模型性能和泛化能力。超参数调整策略包括手动搜索、网格搜...

介绍

在 PyTorch 中使用 TensorBoard 跟踪实验并调整超参数

实验跟踪涉及记录和监控机器学习实验数据,TensorBoard 是可视化和分析这些数据的有用工具。它可以帮助研究人员了解实验行为、比较模型并做出明智的决策。

超参数调整是寻找影响模型学习的配置设置最佳值的过程。示例包括学习率、批量大小和隐藏层数量。适当的调整可以提高模型性能和泛化能力。

超参数调整策略包括手动搜索、网格搜索、随机搜索、贝叶斯优化和自动化技术。这些方法系统地探索和评估不同的超参数值。

您可以在调整过程中使用准确性或均方误差等评估指标来评估模型性能。有效的超参数调整可以改善未见数据的模型结果。

在本博客中,我们将看到使用网格搜索、FashionMNIST 数据集和自定义 VGG 模型进行超参数调整。请继续关注有关其他调整算法的未来博客。

让我们开始!

在 Colab 中打开


安装并导入依赖项

首先在 Jupyter 或 Google Colab 上打开一个新的 Python 笔记本。在代码块中编写这些命令以安装和导入依赖项。

%pip install -q torchinfo torchmetrics tensorboard

import torch
import torchvision
import os
from torchvision.transforms import Resize, Compose, ToTensor
import matplotlib.pyplot as plt
from torchinfo import summary
import torchmetrics
from tqdm.auto import tqdm
from torch.utils.tensorboard import SummaryWriter



加载数据集和DataLoader

BATCH_SIZE = 64

if not os.path.exists("data"): os.mkdir("data")

train_transform = Compose([Resize((64,64)),
                           ToTensor()
                           ])
test_transform = Compose([Resize((64,64)),
                          ToTensor()
                          ])

training_dataset = torchvision.datasets.FashionMNIST(root = "data",
                                                     download = True,
                                                     train = True,
                                                     transform = train_transform)

test_dataset = torchvision.datasets.FashionMNIST(root = "data",
                                                 download = True,
                                                 train = False,
                                                 transform = test_transform)

train_dataloader = torch.utils.data.DataLoader(training_dataset,
                                          batch_size=BATCH_SIZE,
                                          shuffle=True,
                                          )

test_dataloader = torch.utils.data.DataLoader(test_dataset,
                                              batch_size = BATCH_SIZE,
                                              shuffle = False,
                                              )

  • 在这里,我们将批处理大小设置为 64。通常,您会希望选择 GPU 可以处理且不会出现错误的最大批处理大小cuda out of memory
  • 我们定义将图像转换为张量的变换。
  • 我们从 torchvision 数据集中内置的 FashionMNIST 数据集启动训练数据集和测试数据集。我们将root文件夹设置为data文件夹,download因为True我们要下载数据集以及train训练True数据和False测试数据。
  • 接下来,我们定义训练和测试数据加载器。

我们可以使用此命令查看训练和测试数据集中有多少图像。

print(f"Number of Images in test dataset is {len(test_dataset)}")
print(f"Number of Images in training dataset is {len(training_dataset)}")

[!output]
测试数据集中的图像数量为 10000
训练数据集中的图像数量为 60000


创建 TinyVGG 模型

我正在使用此自定义模型演示实验跟踪。但您可以使用您选择的任何模型。

class TinyVGG(nn.Module):
    """
    A small VGG-like network for image classification.

    Args:
        in_channels (int): The number of input channels.
        n_classes (int): The number of output classes.
        hidden_units (int): The number of hidden units in each convolutional block.
        n_conv_blocks (int): The number of convolutional blocks.
        dropout (float): The dropout rate.
    """

    def __init__(self, in_channels, n_classes, hidden_units, n_conv_blocks, dropout):
        super().__init__()
        self.in_channels = in_channels
        self.out_features = n_classes
        self.dropout = dropout
        self.hidden_units = hidden_units

        # Input block
        self.input_block = nn.Sequential(
            nn.Conv2d(in_channels=in_channels, out_channels=hidden_units, kernel_size=3, padding=0, stride=1),
            nn.Dropout(dropout),
            nn.ReLU(),
        )

        # Convolutional blocks
        self.conv_blocks = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(in_channels=hidden_units, out_channels=hidden_units, kernel_size=3, padding=0, stride=1),
                nn.Dropout(dropout),
                nn.ReLU(),
                nn.MaxPool2d(kernel_size=2, stride=2),
            ) for _ in range(n_conv_blocks)
        ])

        # Classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.LazyLinear(out_features=256),
            nn.Dropout(dropout),
            nn.Linear(in_features=256, out_features=64),
            nn.Linear(in_features=64, out_features=n_classes),
        )

    def forward(self, x):
        """
        Forward pass of the network.

        Args:
            x (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The output tensor.
        """

        x = self.input_block(x)
        for conv_block in self.conv_blocks:
            x = conv_block(x)
        x = self.classifier(x)
        return x


定义训练和测试功能

def train_step(dataloader, model, optimizer, criterion, device, train_acc_metric):
    """
    Perform a single training step.

    Args:
        dataloader (torch.utils.data.DataLoader): The dataloader for the training data.
        model (torch.nn.Module): The model to train.
        optimizer (torch.optim.Optimizer): The optimizer for the model.
        criterion (torch.nn.Module): The loss function for the model.
        device (torch.device): The device to train the model on.
        train_acc_metric (torchmetrics.Accuracy): The accuracy metric for the model.

    Returns:
        The accuracy of the model on the training data.
    """

    for (X, y) in tqdm.tqdm(dataloader):
        # Move the data to the device.
        X = X.to(device)
        y = y.to(device)

        # Forward pass.
        y_preds = model(X)

        # Calculate the loss.
        loss = criterion(y_preds, y)

        # Calculate the accuracy.
        train_acc_metric.update(y_preds, y)

        # Backpropagate the loss.
        loss.backward()

        # Update the parameters.
        optimizer.step()

        # Zero the gradients.
        optimizer.zero_grad()

    return train_acc_metric.compute()


def test_step(dataloader, model, device, test_acc_metric):
    """
    Perform a single test step.

    Args:
        dataloader (torch.utils.data.DataLoader): The dataloader for the test data.
        model (torch.nn.Module): The model to test.
        device (torch.device): The device to test the model on.
        test_acc_metric (torchmetrics.Accuracy): The accuracy metric for the model.

    Returns:
        The accuracy of the model on the test data.
    """

    for (X, y) in tqdm.tqdm(dataloader):
        # Move the data to the device.
        X = X.to(device)
        y = y.to(device)

        # Forward pass.
        y_preds = model(X)

        # Calculate the accuracy.
        test_acc_metric.update(y_preds, y)

    return test_acc_metric.compute()



TensorBoard 摘要编写器

def create_writer(
    experiment_name: str, model_name: str, conv_layers, dropout, hidden_units
) -> SummaryWriter:
    """
    Create a SummaryWriter object for logging the training and test results.

    Args:
        experiment_name (str): The name of the experiment.
        model_name (str): The name of the model.
        conv_layers (int): The number of convolutional layers in the model.
        dropout (float): The dropout rate used in the model.
        hidden_units (int): The number of hidden units in the model.

    Returns:
        SummaryWriter: The SummaryWriter object.
    """

    timestamp = str(datetime.now().strftime("%d-%m-%Y_%H-%M-%S"))
    log_dir = os.path.join(
        "runs",
        timestamp,
        experiment_name,
        model_name,
        f"{conv_layers}",
        f"{dropout}",
        f"{hidden_units}",
    ).replace("\\", "/")
    return SummaryWriter(log_dir=log_dir)


超参数调优

在这里,您可以看到几个超参数 - 学习率、Epoch 数、优化器类型、卷积层数、dropout 和隐藏单元数。我们可以首先固定学习率和epoch数,并尝试找到最佳的卷积层数、dropout数和隐藏单元数。一旦我们有了这些,我们就可以调整纪元数和学习率。

# Fixed Hyper Parameters/
EPOCHS = 10
LEARNING_RATE = 0.0007

"""
This code performs hyperparameter tuning for a TinyVGG model.

The hyperparameters that are tuned are the number of convolutional layers, the dropout rate, and the number of hidden units.

The results of the hyperparameter tuning are logged to a TensorBoard file.
"""

experiment_number = 0

# hyperparameters to tune
hparams_config = {
    "n_conv_layers": [1, 2, 3],
    "dropout": [0.0, 0.25, 0.5],
    "hidden_units": [128, 256, 512],
}

for n_conv_layers in hparams_config["n_conv_layers"]:
    for dropout in hparams_config["dropout"]:
        for hidden_units in hparams_config["hidden_units"]:
            experiment_number += 1
            print(
                f"\nTuning Hyper Parameters || Conv Layers: {n_conv_layers} || Dropout: {dropout} || Hidden Units: {hidden_units} \n"
            )

            # create the model
            model = TinyVGG(
                in_channels=1,
                n_classes=len(training_dataset.classes),
                hidden_units=hidden_units,
                n_conv_blocks=n_conv_layers,
                dropout=dropout,
            ).to(device)

            # create the optimizer and loss function
            optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
            criterion = torch.nn.CrossEntropyLoss()

            # create the accuracy metrics
            train_acc_metric = torchmetrics.Accuracy(
                task="multiclass", num_classes=len(training_dataset.classes)
            ).to(device)
            test_acc_metric = torchmetrics.Accuracy(
                task="multiclass", num_classes=len(training_dataset.classes)
            ).to(device)

            # create the TensorBoard writer
            writer = create_writer(
                experiment_name=f"{experiment_number}",
                model_name="tiny_vgg",
                conv_layers=n_conv_layers,
                dropout=dropout,
                hidden_units=hidden_units,
            )
            model.train()
            # train the model
            for epoch in range(EPOCHS):
                train_step(
                    train_dataloader,
                    model,
                    optimizer,
                    criterion,
                    device,
                    train_acc_metric,
                )
                test_step(test_dataloader, model, device, test_acc_metric)
                writer.add_scalar(
                    tag="Training Accuracy",
                    scalar_value=train_acc_metric.compute(),
                    global_step=epoch,
                )
                writer.add_scalar(
                    tag="Test Accuracy",
                    scalar_value=test_acc_metric.compute(),
                    global_step=epoch,
                )

            # add the hyperparameters and metrics to TensorBoard
            writer.add_hparams(
                {
                    "conv_layers": n_conv_layers,
                    "dropout": dropout,
                    "hidden_units": hidden_units,
                },
                {
                    "train_acc": train_acc_metric.compute(),
                    "test_acc": test_acc_metric.compute(),
                },
            )

这将需要一段时间才能运行,具体取决于您的硬件。


检查 TensorBoard 中的结果

如果您使用的是 Google Colab 或 Jupyter Notebooks,则可以使用此命令查看 TensorBoard Dashboard。

%load_ext tensorboard
%tensorboard --logdir=runs

平行坐标视图

超参数视图

由此,现在您可以找到最佳的超参数。

就是这样。这就是使用 TensorBoard 调整超参数的方法。在这里,为了简单起见,我们使用了网格搜索,但您可以对其他调整算法使用类似的方法,并使用 TensorBoard 来查看这些算法的实时执行情况。

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。