柠檬分类全流程实战(一)
柠檬分类全流程实战
在本次课程中,将以柠檬品相分类比赛为例,为大家详细介绍下图像分类比赛的完整流程。为大家提供从数据处理,到模型搭建,损失函数、优化算法选择,学习率调整策略到模型训练,以及推理输出一条龙服务。每个模块都有很多tricks,在这里我会逐一为大家进行理论介绍以及相应的代码实战。通过本次课程你将了解图像分类网络的搭建,并且能够具备参加图像分类竞赛的能力。(PS:本次课程包教包会,但是,不会也不退票!!!我们的服务宗旨就是不退票。)
目录
- 图像任务中Pipeline的构建(模块化)
- 通用调参技巧与常见思路
- 柠檬分类竞赛项目调优实战
- 建议与总结
图像分类竞赛全流程工具
- 编程语言
python
- 炼丹框架
PaddlePaddle2.0
- 图像预处理库
OpenCV
PIL(pillow)
- 通用库
Numpy
Pandas
Scikit-Learn
Matplotlib
图像分类比赛的一般解题流程
- 数据EDA (Pandas、Matplotlib)
- 数据预处理 (OpenCV、PIL、Pandas、Numpy、Scikit-Learn)
- 根据赛题任务定义好读取方法,即Dataset和Dataloader(PaddlePaddle2.0)
- 选择一个图像分类模型进行训练 (PaddlePaddle2.0)
- 对测试集进行测试并提交结果(PaddlePaddle2.0、Pandas)
一、EDA(Exploratory Data Analysis)与数据预处理
1.1 数据EDA
探索性数据分析(Exploratory Data Analysis,简称EDA),是指对已有的数据(原始数据)进行分析探索,通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律的一种数据分析方法。一般来说,我们最初接触到数据的时候往往是毫无头绪的,不知道如何下手,这时候探索性数据分析就非常有效。
对于图像分类任务,我们通常首先应该统计出每个类别的数量,查看训练集的数据分布情况。通过数据分布情况分析赛题,形成解题思路。(洞察数据的本质很重要。)
数据分析的一些建议
1、写出一系列你自己做的假设,然后接着做更深入的数据分析。
2、记录自己的数据分析过程,防止出现遗忘。
3、把自己的中间的结果给自己的同行看看,让他们能够给你一些更有拓展性的反馈、或者意见。(即open to everybody)
4、可视化分析结果
!cd data/data71799/ && unzip -q lemon_lesson.zip
!cd data/data71799/lemon_lesson && unzip -q train_images.zip
!cd data/data71799/lemon_lesson && unzip -q test_images.zip
# 导入所需要的库
import os
import pandas as pd
import numpy as np
from PIL import Image
import paddle
import paddle.nn as nn
from paddle.io import Dataset
import paddle.vision.transforms as T
import paddle.nn.functional as F
from paddle.metric import Accuracy
import warnings
warnings.filterwarnings("ignore")
# 数据EDA
df = pd.read_csv('data/data71799/lemon_lesson/train_images.csv')
d=df['class_num'].hist().get_figure()
# d.savefig('2.jpg')
柠檬数据集数据分布情况如下;
知识点 图像分类竞赛常见难点
- 类别不均衡
- One-Shot和Few-Shot分类
- 细粒度分类
柠檬分类竞赛难点
限制模型大小
数据量小(训练集1102张图片)
1.2 数据预处理
Compose实现将用于数据集预处理的接口以列表的方式进行组合。
# 定义数据预处理
data_transforms = T.Compose([
T.Resize(size=(32, 32)),
T.Transpose(), # HWC -> CHW
T.Normalize(
mean=[0, 0, 0], # 归一化
std=[255, 255, 255],
to_rgb=True)
])
图像标准化与归一化
最常见的对图像预处理方法有两种,一种叫做图像标准化处理,另外一种方法叫做归一化处理。数据的标准化是指将数据按照比例缩放,使之落入一个特定的区间。将数据通过去均值,实现中心化。处理后的数据呈正态分布,即均值为零。数据归一化是数据标准化的一种典型做法,即将数据统一映射到[0,1]区间上。
作用
- 有利于初始化的进行
- 避免给梯度数值的更新带来数值问题
- 有利于学习率数值的调整
- 加快寻找最优解速度
标准化
归一化
没有归一化前,寻找最优解的过程&归一化后的过程
#什么是数值问题?
421*0.00243 == 0.421*2.43
False
import numpy as np
from PIL import Image
from paddle.vision.transforms import Normalize
normalize_std = Normalize(mean=[127.5, 127.5, 127.5],
std=[127.5, 127.5, 127.5],
data_format='HWC')
fake_img = Image.fromarray((np.random.rand(300, 320, 3) * 255.).astype(np.uint8))
fake_img = normalize_std(fake_img)
# print(fake_img.shape)
print(fake_img)
[[[ 0.8666667 0.78039217 -0.9137255 ]
[-0.46666667 0.14509805 -0.08235294]
[ 0.16078432 0.25490198 0.34117648]
...
[ 0.38039216 0.8666667 0.827451 ]
[-0.16862746 0.49803922 0.3019608 ]
[ 0.06666667 -0.49019608 -0.7019608 ]]
[[ 0.8509804 -0.05882353 0.00392157]
[-0.8666667 0.9137255 0.67058825]
[ 0.16078432 -0.6862745 0.88235295]
...
[ 0.41960785 -0.49803922 0.29411766]
[-0.2627451 0.7019608 0.60784316]
[ 0.13725491 -0.6627451 -0.09803922]]
[[-0.3647059 -0.77254903 0.60784316]
[-0.79607844 0.7647059 -0.23921569]
[ 0.9607843 -0.8901961 0.75686276]
...
[-0.96862745 0.94509804 0.8352941 ]
[ 0.75686276 -0.8745098 0.7176471 ]
[-0.7490196 0.654902 -0.01960784]]
...
[[ 0.5137255 0.41960785 0.67058825]
[-0.06666667 0.5294118 -0.28627452]
[-0.8666667 -0.3254902 0.4117647 ]
...
[-0.1764706 0.6392157 0.75686276]
[-0.27058825 -0.9843137 0.39607844]
[ 0.33333334 -0.05098039 0.75686276]]
[[-0.827451 0.16862746 0.6313726 ]
[-0.99215686 -0.9607843 0.94509804]
[ 0.77254903 0.16862746 -0.94509804]
...
[ 0.81960785 -0.5372549 -0.75686276]
[-0.06666667 -0.81960785 -0.5137255 ]
[ 0.34901962 -0.15294118 0.39607844]]
[[ 0.7176471 0.18431373 0.7411765 ]
[ 0.5372549 0.46666667 -0.4117647 ]
[ 0.01960784 0.23137255 -0.28627452]
...
[ 0.44313726 0.06666667 -0.62352943]
[-0.78039217 0.88235295 -0.34117648]
[ 0.92156863 0.16862746 -0.7254902 ]]]
import numpy as np
from PIL import Image
from paddle.vision.transforms import Normalize
normalize = Normalize(mean=[0, 0, 0],
std=[255, 255, 255],
data_format='HWC')
fake_img = Image.fromarray((np.random.rand(300, 320, 3) * 255.).astype(np.uint8))
fake_img = normalize(fake_img)
# print(fake_img.shape)
print(fake_img)
[[[0.6313726 0.93333334 0.60784316]
[0.6666667 0.67058825 0.72156864]
[0.7647059 0.83137256 0.99215686]
...
[0.12156863 0.07450981 0.75686276]
[0.33333334 0.93333334 0.7058824 ]
[0.8862745 0.42745098 0.8666667 ]]
[[0.49411765 0.58431375 0.41568628]
[0.6509804 0.99215686 0.15294118]
[0.73333335 0.09019608 0.77254903]
...
[0.56078434 0.74509805 0.04313726]
[0.91764706 0.74509805 0.64705884]
[0.92941177 0.80784315 0.57254905]]
[[0.12156863 0.3137255 0.9372549 ]
[0.42352942 0.6862745 0.0627451 ]
[0.62352943 0.6 0.30980393]
...
[0.09411765 0.01176471 0.9372549 ]
[0.57254905 0.7294118 0.5254902 ]
[0.40784314 0.43137255 0.2627451 ]]
...
[[0.21176471 0.3372549 0.04705882]
[0.5647059 0.42352942 0.36862746]
[0.3254902 0.99607843 0.3254902 ]
...
[0.9607843 0.48235294 0.5921569 ]
[0.04705882 0.13725491 0.8 ]
[0.9254902 0.54509807 0.77254903]]
[[0.79607844 0.2509804 0.09411765]
[0.6392157 0.09019608 0.64705884]
[0.2901961 0.07843138 0.45882353]
...
[0.30588236 0.01176471 0.29803923]
[0.09803922 0.6784314 0.03529412]
[0.69803923 0.89411765 0.75686276]]
[[0.35686275 0.7294118 0.24705882]
[0.8392157 0.18431373 0.9647059 ]
[0.3372549 0.92941177 0.5294118 ]
...
[0.79607844 0.9254902 0.5921569 ]
[0.24705882 0.03921569 0.12941177]
[0.52156866 0.34117648 0.00392157]]]
数据集划分
# 读取数据
train_images = pd.read_csv('data/data71799/lemon_lesson/train_images.csv', usecols=['id','class_num'])
# 划分训练集和校验集
all_size = len(train_images)
print(all_size)
train_size = int(all_size * 0.8)
train_image_path_list = train_images[:train_size]
val_image_path_list = train_images[train_size:]
print(len(train_image_path_list))
print(len(val_image_path_list))
1102
881
221
# 构建Dataset
class MyDataset(paddle.io.Dataset):
"""
步骤一:继承paddle.io.Dataset类
"""
def __init__(self, train_list, val_list, mode='train'):
"""
步骤二:实现构造函数,定义数据读取方式
"""
super(MyDataset, self).__init__()
self.data = []
# 借助pandas读取csv文件
self.train_images = train_list
self.test_images = val_list
if mode == 'train':
# 读train_images.csv中的数据
for row in self.train_images.itertuples():
self.data.append(['data/data71799/lemon_lesson/train_images/'+getattr(row, 'id'), getattr(row, 'class_num')])
else:
# 读test_images.csv中的数据
for row in self.test_images.itertuples():
self.data.append(['data/data71799/lemon_lesson/train_images/'+getattr(row, 'id'), getattr(row, 'class_num')])
def load_img(self, image_path):
# 实际使用时使用Pillow相关库进行图片读取即可,这里我们对数据先做个模拟
image = Image.open(image_path).convert('RGB')
return image
def __getitem__(self, index):
"""
步骤三:实现__getitem__方法,定义指定index时如何获取数据,并返回单条数据(训练数据,对应的标签)
"""
image = self.load_img(self.data[index][0])
label = self.data[index][1]
return data_transforms(image), np.array(label, dtype='int64')
def __len__(self):
"""
步骤四:实现__len__方法,返回数据集总数目
"""
return len(self.data)
数据加载器定义
#train_loader
train_dataset = MyDataset(train_list=train_image_path_list, val_list=val_image_path_list, mode='train')
train_loader = paddle.io.DataLoader(train_dataset, places=paddle.CPUPlace(), batch_size=128, shuffle=True, num_workers=0)
#val_loader
val_dataset =MyDataset(train_list=train_image_path_list, val_list=val_image_path_list, mode='test')
val_loader = paddle.io.DataLoader(val_dataset, places=paddle.CPUPlace(), batch_size=128, shuffle=True, num_workers=0)
print('=============train dataset=============')
for image, label in train_dataset:
print('image shape: {}, label: {}'.format(image.shape, label))
break
for batch_id, data in enumerate(train_loader()):
x_data = data[0]
y_data = data[1]
print(x_data)
print(y_data)
break
二、Baseline选择
理想情况中,模型越大拟合能力越强,图像尺寸越大,保留的信息也越多。在实际情况中模型越复杂训练时间越长,图像输入尺寸越大训练时间也越长。
比赛开始优先使用最简单的模型(如ResNet),快速跑完整个训练和预测流程;分类模型的选择需要根据任务复杂度来进行选择,并不是精度越高的模型越适合比赛。
在实际的比赛中我们可以逐步增加图像的尺寸,比如先在64 * 64的尺寸下让模型收敛,进而将模型在128 * 128的尺寸下训练,进而到224 * 224的尺寸情况下,这种方法可以加速模型的收敛速度。
Baseline应遵循以下几点原则:
- 复杂度低,代码结构简单。
- Loss收敛正确,评价指标(metric)出现相应提升(如accuracy/AUC之类的)
- 迭代快速,没有很复杂(Fancy)的模型结构/Loss function/图像预处理方法之类的
- 编写正确并简单的测试脚本,能够提交submission后获得正确的分数
知识点
模型组网方式
对于组网方式,飞桨框架统一支持 Sequential 或 SubClass 的方式进行模型的组建。我们根据实际的使用场景,来选择最合适的组网方式。如针对顺序的线性网络结构我们可以直接使用 Sequential ,相比于 SubClass ,Sequential 可以快速的完成组网。 如果是一些比较复杂的网络结构,我们可以使用 SubClass 定义的方式来进行模型代码编写,在 init 构造函数中进行 Layer 的声明,在 forward 中使用声明的 Layer 变量进行前向计算。通过这种方式,我们可以组建更灵活的网络结构。
使用 SubClass 进行组网
#定义卷积神经网络
class MyNet(paddle.nn.Layer):
def __init__(self, num_classes=4):
super(MyNet, self).__init__()
self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3), stride=1, padding = 1)
# self.pool1 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)
self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3), stride=2, padding = 0)
# self.pool2 = paddle.nn.MaxPool2D(kernel_size=2, stride=2)
self.conv3 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 0)
self.conv4 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 1)
self.flatten = paddle.nn.Flatten()
self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
# x = self.pool1(x)
# print(x.shape)
x = self.conv2(x)
x = F.relu(x)
# x = self.pool2(x)
# print(x.shape)
x = self.conv3(x)
x = F.relu(x)
# print(x.shape)
x = self.conv4(x)
x = F.relu(x)
# print(x.shape)
x = self.flatten(x)
x = self.linear1(x)
x = F.relu(x)
x = self.linear2(x)
return x
使用 Sequential 进行组网
# Sequential形式组网
MyNet = nn.Sequential(
nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3), stride=1, padding = 1),
nn.ReLU(),
nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3), stride=2, padding = 0),
nn.ReLU(),
nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 0),
nn.ReLU(),
nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(in_features=50176, out_features=64),
nn.ReLU(),
nn.Linear(in_features=64, out_features=4)
)
# 模型封装
model = paddle.Model(MyNet())
网络结构可视化
通过summary打印网络的基础结构和参数信息。
model.summary((1, 3, 32, 32))
---------------------------------------------------------------------------
Layer (type) Input Shape Output Shape Param #
===========================================================================
Conv2D-1 [[1, 3, 32, 32]] [1, 32, 32, 32] 896
Conv2D-2 [[1, 32, 32, 32]] [1, 64, 15, 15] 18,496
Conv2D-3 [[1, 64, 15, 15]] [1, 64, 7, 7] 36,928
Conv2D-4 [[1, 64, 7, 7]] [1, 64, 4, 4] 36,928
Flatten-1 [[1, 64, 4, 4]] [1, 1024] 0
Linear-1 [[1, 1024]] [1, 64] 65,600
Linear-2 [[1, 64]] [1, 4] 260
===========================================================================
Total params: 159,108
Trainable params: 159,108
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 0.40
Params size (MB): 0.61
Estimated Total Size (MB): 1.02
---------------------------------------------------------------------------
{'total_params': 159108, 'trainable_params': 159108}
知识点–特征图尺寸计算
# 模型封装
# model = MyNet(num_classes=2)
# # model = mnist
# model = paddle.Model(model)
# 定义优化器
optim = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())
# 配置模型
model.prepare(
optim,
paddle.nn.CrossEntropyLoss(),
Accuracy()
)
# 调用飞桨框架的VisualDL模块,保存信息到目录中。
# callback = paddle.callbacks.VisualDL(log_dir='visualdl_log_dir')
from visualdl import LogReader, LogWriter
args={
'logdir':'./vdl',
'file_name':'vdlrecords.model.log',
'iters':0,
}
# 配置visualdl
write = LogWriter(logdir=args['logdir'], file_name=args['file_name'])
#iters 初始化为0
iters = args['iters']
#自定义Callback
class Callbk(paddle.callbacks.Callback):
def __init__(self, write, iters=0):
self.write = write
self.iters = iters
def on_train_batch_end(self, step, logs):
self.iters += 1
#记录loss
self.write.add_scalar(tag="loss",step=self.iters,value=logs['loss'][0])
#记录 accuracy
self.write.add_scalar(tag="acc",step=self.iters,value=logs['acc'])
`./vdl/vdlrecords.model.log` is exists, VisualDL will add logs to it.
# 模型训练与评估
model.fit(train_loader,
val_loader,
log_freq=1,
epochs=5,
callbacks=Callbk(write=write, iters=iters),
verbose=1,
)
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/5
step 7/7 [==============================] - loss: 0.9400 - acc: 0.4926 - 948ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 2/2 [==============================] - loss: 0.9144 - acc: 0.5837 - 715ms/step
Eval samples: 221
Epoch 2/5
step 7/7 [==============================] - loss: 0.6421 - acc: 0.6913 - 847ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 2/2 [==============================] - loss: 0.7182 - acc: 0.7240 - 729ms/step
Eval samples: 221
Epoch 3/5
step 7/7 [==============================] - loss: 0.4278 - acc: 0.7911 - 810ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 2/2 [==============================] - loss: 0.5868 - acc: 0.7873 - 724ms/step
Eval samples: 221
Epoch 4/5
step 7/7 [==============================] - loss: 0.3543 - acc: 0.8547 - 792ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 2/2 [==============================] - loss: 0.4153 - acc: 0.8597 - 738ms/step
Eval samples: 221
Epoch 5/5
step 7/7 [==============================] - loss: 0.2989 - acc: 0.8956 - 831ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 2/2 [==============================] - loss: 0.4725 - acc: 0.8235 - 724ms/step
Eval samples: 221
# 保存模型参数
# model.save('Hapi_MyCNN') # save for training
model.save('Hapi_MyCNN1', False) # save for inference
扩展知识点:训练过程可视化
然后我们调用VisualDL工具,在命令行中输入: visualdl --logdir ./visualdl_log_dir --port 8080
,打开浏览器,输入网址 http://127.0.0.1:8080 就可以在浏览器中看到相关的训练信息,具体如下:
调参,训练,记录曲线,分析结果。
三、模型预测
import os, time
import matplotlib.pyplot as plt
import paddle
from PIL import Image
import numpy as np
def load_image(img_path):
'''
预测图片预处理
'''
img = Image.open(img_path).convert('RGB')
plt.imshow(img) #根据数组绘制图像
plt.show() #显示图像
#resize
img = img.resize((32, 32), Image.BILINEAR) #Image.BILINEAR双线性插值
img = np.array(img).astype('float32')
# HWC to CHW
img = img.transpose((2, 0, 1))
#Normalize
img = img / 255 #像素值归一化
# mean = [0.31169346, 0.25506335, 0.12432463]
# std = [0.34042713, 0.29819837, 0.1375536]
# img[0] = (img[0] - mean[0]) / std[0]
# img[1] = (img[1] - mean[1]) / std[1]
# img[2] = (img[2] - mean[2]) / std[2]
return img
def infer_img(path, model_file_path, use_gpu):
'''
模型预测
'''
paddle.set_device('gpu:0') if use_gpu else paddle.set_device('cpu')
model = paddle.jit.load(model_file_path)
model.eval() #训练模式
#对预测图片进行预处理
infer_imgs = []
infer_imgs.append(load_image(path))
infer_imgs = np.array(infer_imgs)
label_list = ['0:優良', '1:良', '2:加工品', '3:規格外']
for i in range(len(infer_imgs)):
data = infer_imgs[i]
dy_x_data = np.array(data).astype('float32')
dy_x_data = dy_x_data[np.newaxis,:, : ,:]
img = paddle.to_tensor(dy_x_data)
out = model(img)
print(out[0])
print(paddle.nn.functional.softmax(out)[0]) # 若模型中已经包含softmax则不用此行代码。
lab = np.argmax(out.numpy()) #argmax():返回最大数的索引
print("样本: {},被预测为:{}".format(path, label_list[lab]))
print("*********************************************")
image_path = []
for root, dirs, files in os.walk('work/'):
# 遍历work/文件夹内图片
for f in files:
image_path.append(os.path.join(root, f))
for i in range(len(image_path)):
infer_img(path=image_path[i], use_gpu=True, model_file_path="Hapi_MyCNN")
# time.sleep(0.5) #防止输出错乱
break
baseline选择技巧
- 模型:复杂度小的模型可以快速迭代。
- optimizer:推荐Adam,或者SGD
- Loss Function: 多分类Cross entropy;
- metric:以比赛的评估指标为准。
- 数据增强:数据增强其实可为空,或者只有一个HorizontalFlip即可。
- 图像分辨率:初始最好就用小图,如224*224之类的。
如何提升搭建baseline的能力
- 鲁棒的baseline,等价于好的起点,意味着成功了一半。
- 阅读top solution的开源代码,取其精华,去其糟粕。
- 积累经验,多点实践,模仿他人,最后有着属于自己风格的一套。
- 点赞
- 收藏
- 关注作者
评论(0)