生成式AI研究聚焦:揭开基于扩散的模型的神秘面纱
生成式AI研究聚焦:揭开基于扩散的模型的神秘面纱
引言:从“噪声”到“杰作”
扩散模型(Diffusion Models)在过去三年里几乎重塑了生成式AI的版图:Stable Diffusion、DALL·E 2、Imagen、Sora……这些耳熟能详的名字背后是一套统一的数学框架——逐步去噪的马尔可夫链。本文将用一条可运行的完整代码主线,带你亲手训练一个 64×64 图像扩散模型,并拆解其理论细节。读完你将:
- 理解前向/反向过程的数学推导;
- 掌握 DDPM、DDIM 与 Classifier-Free Guidance 的核心实现;
- 获得可复现的 PyTorch 源码(单张 A100 30 min 即可收敛)。
扩散模型基础:前向、反向与 ELBO
1.1 前向过程(加噪)
给定数据分布 (x_0 \sim q(x)),前向过程定义:
[
q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)
]
通过重参数化,(x_t) 可一步采样:
[
x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \quad \bar{\alpha}t = \prod{s=1}^t (1-\beta_s)
]
1.2 反向过程(去噪)
反向过程学习参数化的高斯转移:
[
p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))
]
DDPM 将 (\Sigma) 设为常数,仅学习 (\mu_\theta),等价于预测噪声 (\epsilon_\theta(x_t, t))。
1.3 训练目标
变分下界 ELBO 简化为:
[
\mathcal{L}{\text{simple}} = \mathbb{E}{x_0,\epsilon,t}\bigl[|\epsilon - \epsilon_\theta(x_t, t)|^2\bigr]
]
实战:64×64 图像扩散模型完整代码
2.1 环境与数据
pip install torch==2.3.0 torchvision==0.18.0 diffusers==0.29.0 accelerate
使用 CIFAR-10(5W 张 32×32)并上采样到 64×64:
from torchvision.datasets import CIFAR10
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize(64),
transforms.ToTensor(),
transforms.Normalize((0.5,)*3, (0.5,)*3)
])
dataset = CIFAR10(root='./data', download=True, transform=transform)
2.2 网络架构:U-Net with Attention
import torch.nn as nn
from diffusers import UNet2DModel
unet = UNet2DModel(
sample_size=64,
in_channels=3,
out_channels=3,
layers_per_block=2,
block_out_channels=(128, 256, 512, 1024),
attention_head_dim=8,
norm_num_groups=32,
)
2.3 DDPM 调度器与训练循环
from diffusers import DDPMScheduler
from torch.utils.data import DataLoader
import torch
noise_scheduler = DDPMScheduler(num_train_timesteps=1000, beta_schedule='cosine')
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=4)
optimizer = torch.optim.AdamW(unet.parameters(), lr=4e-4)
for epoch in range(100):
for x0, _ in loader:
x0 = x0.cuda()
noise = torch.randn_like(x0)
timesteps = torch.randint(0, 1000, (x0.size(0),), device=x0.device).long()
xt = noise_scheduler.add_noise(x0, noise, timesteps)
pred_noise = unet(xt, timesteps).sample
loss = nn.functional.mse_loss(pred_noise, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch {epoch} | Loss {loss.item():.4f}')
2.4 采样与可视化
from diffusers import DDPMPipeline
import matplotlib.pyplot as plt
pipeline = DDPMPipeline(unet=unet, scheduler=noise_scheduler)
images = pipeline(num_inference_steps=50, batch_size=8).images
fig, ax = plt.subplots(2, 4, figsize=(8,4))
for i, img in enumerate(images):
ax[i//4, i%4].imshow(img); ax[i//4, i%4].axis('off')
plt.show()
深入:DDIM 与确定性采样
DDIM 通过非马尔可夫过程将采样步数从 1000 压缩到 20 以内,核心公式[^1]:
[
x_{t-1} = \sqrt{\bar{\alpha}{t-1}}\left(\frac{x_t - \sqrt{1-\bar{\alpha}t}\epsilon\theta}{\sqrt{\bar{\alpha}t}}\right) + \sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2}\epsilon\theta + \sigma_t z
]
当 (\sigma_t=0) 时为完全确定性采样。
3.1 代码:20 步 DDIM 采样
from diffusers import DDIMScheduler
ddim = DDIMScheduler.from_config(noise_scheduler.config)
pipeline_ddim = DDPMPipeline(unet=unet, scheduler=ddim)
images = pipeline_ddim(num_inference_steps=20, batch_size=8).images
进阶:Classifier-Free Guidance 提升文本一致性
Classifier-Free Guidance(CFG)通过联合训练条件与无条件模型,在采样时线性插值:
[
\hat{\epsilon}\theta(x_t, c) = \epsilon\theta(x_t, \emptyset) + s\bigl(\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \emptyset)\bigr)
]
4.1 改造 U-Net 以支持标签
unet_cfg = UNet2DModel(
sample_size=64, in_channels=3, out_channels=3,
class_embed_type='identity', num_class_embeds=10,
block_out_channels=(128, 256, 512, 1024),
)
训练时随机丢弃 10% 标签:
cond_drop = torch.rand(x0.size(0)) < 0.1
labels = torch.where(cond_drop, 10, labels) # 10 代表空标签
pred_noise = unet_cfg(xt, timesteps, class_labels=labels).sample
采样阶段设置 guidance scale (s=7.5):
from diffusers import DDIMScheduler
scheduler = DDIMScheduler(num_train_timesteps=1000)
from pipeline_cfg import CFGPipeline # 自定义
pipe = CFGPipeline(unet=unet_cfg, scheduler=scheduler)
img = pipe('cat', num_inference_steps=20, guidance_scale=7.5).images[0]
性能与可扩展性:潜空间扩散 & DiT
- 潜空间扩散(LDM):先用 VAE 将 256×256×3 压缩到 32×32×4,计算量降低 64×;Stable Diffusion 即采用此思路[^2]。
- Diffusion Transformer (DiT):用 ViT 替换 U-Net,在 ImageNet 上 FID 2.27,参数量仅 600M[^3]。
5.1 潜空间训练伪代码
from diffusers import AutoencoderKL
vae = AutoencoderKL.from_pretrained('stabilityai/sd-vae-ft-mse').cuda()
z0 = vae.encode(x0).latent_dist.sample() * 0.18215
# 在潜空间训练 U-Net
结语:扩散模型的下一个前沿
扩散模型正朝着更高分辨率(4K/8K)、更长视频(≥1 min)、多模态控制(姿态/深度/音频)三大方向演进。随着流匹配(Flow Matching)[^4]与**整流蒸馏(Rectified Flow Distillation)**的出现,采样步数已可压缩至 1–4 步,实时生成指日可待。
- 点赞
- 收藏
- 关注作者
评论(0)