- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

EM算法理解的第一层境界：期望E和最大化M（二）

人工智障研究员发表于 2022/06/20 23:04:56 2022/06/20

【摘要】本文介绍了EM算法的第一层境界的第二部分内容：形式上理解EM算法的期望和最大化，以经典的三硬币问题作为例子，通过数学推导介绍了EM算法的形式和内容。

前言：学生时代入门机器学习的时候接触的EM算法，当时感觉这后面应该有一套数学逻辑来约束EM算法的可行性。最近偶然在知乎上拜读了史博大佬的《EM算法理解的九层境界》^[1]，顿时感觉自己还是局限了。重新学习思考了一段时间，对EM算法有了更深的理解。这里顺着大佬的境界路径，看自己能够探索到第几层。

EM算法理解的第一层境界：期望E和最大化M（一）

四、数学推导理解EM算法的形式

接下来我们通过严格的数学推导来理解EM算法形式式子的含义。接着上面给出的EM算法的式子：

$\theta^{(t+1)}=\argmax_{\theta}Q(\theta|\theta^{(t)})=\argmax_{\theta}E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)]$

下面我们一步一步进行推导。
首先隐变量 $Z=[z_1, z_2, ..., z_n], z_i\in\{0,1\}$ ，分量 $z_i=1$ 表示 $i$ 轮次使用的是硬币 $A$ ，分量 $z_i=0$ 表示 $i$ 轮次使用的是硬币 $B$ ，一共进行了 $n$ 轮次的投币行为。那么隐变量 $Z$ 的变量空间为 $\{0, 1\}^n$ ，一共 $2^n$ 个离散点组成的离散空间。于是我们有：

$E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)]=\sum_{Z \in \{0, 1\}^n}[P(Z|X, \theta^{(t)}) * \log L(\theta;X,Z)]$

其中，隐变量 $z_i$ 之间是独立同分布的，所以我们有：

$P(Z|X, \theta^{(t)}) = \prod_{i=1}^{n}P(z_i|x_i, \theta^{(t)})$

对于每一个轮次 $i$ 而言，在给定观测变量 $x_i \in [0, \delta]$ 和硬币的概率参数 $\theta^{(t)}$ （即 $\theta_A^{(t)}$ 、 $\theta_B^{(t)}$ 和 $\theta_C^{(t)}$ ）之后，我们有：

$\begin{aligned} P(z_i|x_i, \theta^{(t)}) & = \frac{P(z_i, x_i | \theta^{(t)})}{P(x_i| \theta^{(t)})} \\ & = \frac{P(z_i, x_i|\theta^{(t)})}{\sum_{z_i\in{0,1}} P(z_i, x_i| \theta^{(t)})} \end{aligned}$

其中，给定 $\theta^{(t)}$ 之后， $(z_i, x_i)$ 的联合概率分布可以直接给出来， $P(z_i=1, x_i | \theta^{(t)})$ 为第 $i$ 轮次是硬币 $A$ 且抛出了 $x_i$ 个正面的概率：

$\begin{aligned} P(z_i=1, x_i | \theta^{(t)}) & = P(z_i=1 | \theta^{(t))} * P(x_i | z_i=1, \theta^{(t)})\\ & = P(z_i=1 | \theta^{(t)}）* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i} \\ & = \theta_C^{(t)} * (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i} \end{aligned}$

$P(z_i=0, x_i, \theta^{(t)})$ 为第 $i$ 轮次是硬币 $B$ 且抛出了 $x_i$ 个正面的概率：

$\begin{aligned} P(z_i=0, x_i| \theta^{(t)}) & = P(z_i=0|\theta^{(t)}) * P(x_i|z_i=0, \theta^{(t)})\\ & = P(z_i=0| \theta^{(t)}）* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i} \\ & = (1-\theta_C^{(t)}) * (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i} \end{aligned}$

接下来，给定 $x_i, \theta^{(t)}$ 情况下轮次 $i$ 是硬币A的概率 $P(z_i=1|x_i, \theta^{(t)})$ 和是硬币B的概率 $P(z_i=0|x_i, \theta^{(t)})$ 如下：

$\begin{aligned} P(z_i=1|x_i, \theta^{(t)}) & = \frac{P(z_i=1, x_i | \theta^{(t)})}{P(z_i=1, x_i | \theta^{(t)}) + P(z_i=0, x_i | \theta^{(t)})} \\ & = \frac{\theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}}{\theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}+(1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}} \end{aligned}$

$\begin{aligned} P(z_i=0|x_i, \theta^{(t)}) & = \frac{P(z_i=0, x_i|\theta^{(t)})}{P(z_i=1, x_i | \theta^{(t)}) + P(z_i=0, x_i | \theta^{(t)})} \\ & = \frac{(1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}}{\theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}+(1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}} \end{aligned}$

两个式子的分母是一样的，我们记

$\eta_i=\theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}+(1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}$

则有：

$P(z_i=0|x_i, \theta^{(t)})=\frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}$

$P(z_i=1|x_i, \theta^{(t)})=\frac{1}{\eta_i} (1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}$

变量 $(Z|X, \theta^{(t)})$ 的分布我们表示出来了，下一步来求联合分布 $(\theta;X,Z)$ 的log-likelyhood: $\log L(\theta;X,Z)$ 。
对于每一轮次的likelyhood，我们有：

$\begin{aligned} L(\theta; x_i, z_i) & = P(x_i|z_i, \theta) * P(z_i|\theta) * P(\theta) \\ &= (z_i*\theta_A+(1-z_i)\theta_B)^{x_i}*(1-z_i*\theta_A-(1-z_i)\theta_B)^{(\delta-x_i)} * \theta_C^{z_i} * (1-\theta_C)^{1-z_i} * 1 \end{aligned}$

同样，由于每个轮次都是独立同分布的，我们有：

$\begin{aligned} L(\theta;X,Z) & =\prod_{i=1}^{n} L(\theta; x_i, z_i)\\ &=\prod_{i=1}^{n}(z_i*\theta_A+(1-z_i)\theta_B)^{x_i}*(1-z_i*\theta_A-(1-z_i)\theta_B)^{(\delta-x_i)}*\theta_C^{z_i} * (1-\theta_C)^{1-z_i} \end{aligned}$

$\log$ 一下我们有：

$\begin{aligned} \log L(\theta;X,Z) &= \log \prod_{i=1}^{n}(z_i*\theta_A+(1-z_i)\theta_B)^{x_i}*(1-z_i*\theta_A-(1-z_i)\theta_B)^{(\delta-x_i)}*\theta_C^{z_i} * (1-\theta_C)^{1-z_i}\\ &= \sum_{i=1}^{n}[x_i*\log(z_i*\theta_A+(1-z_i)\theta_B) + (\delta-x_i)*\log(1-z_i*\theta_A-(1-z_i)\theta_B)+z_i*\log \theta_C+(1-z_i)*\log (1-\theta_C)] \end{aligned}$

OK，到这一步 $\log L(\theta;X,Z)$ 和 $P(Z|X, \theta^{(t)})$ 我们都表示出来了，接下来就是对他们求个期望：

$\begin{aligned} E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)] & =\sum_{Z \in \{0, 1\}^n}[P(Z|X, \theta^{(t)}) * \log L(\theta;X,Z)] \\ &=\sum_{Z \in \{0, 1\}^n} \prod_{i=1}^{n} P(z_i|x_i, \theta^{(t)})*\sum_{j=1}^{n}l(x_j, z_j, \theta) \end{aligned}$

注意，这里我们为了防止变量重复，把 $\log L(\theta;X,Z)$ 中的遍历变量 $i$ 替换为了 $j$ 。其中

$l(x_j, z_j, Q) = x_j*\log(z_j*\theta_A+(1-z_j)\theta_B) + (\delta-x_j)*\log(1-z_j*\theta_A-(1-z_j)\theta_B)+z_j*\log \theta_C+(1-z_j)*\log (1-\theta_C)$

$P(z_i=0|x_i, \theta^{(t)})=\frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}$

$P(z_i=1|x_i, \theta^{(t)})=\frac{1}{\eta_i} (1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}$

这里得到的 $E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)]$ 式子开起来很吓人，直接使用期望的性质可以将其化简为简单的形式，这里我们不使用独立分布期望的性质，直接使用一些数学技巧来进行化简。

对于所有的变量空间 $Z \in \{0, 1\}^n$ ，我们将其拆分一下， $z_1$ 和 $z2, z3, ..., z_n$ 分别表示为 $z_1$ 和 $Z_2^n$ ，然后将其分为 $z_1=1$ 和 $z_1=0$ 两个部分。那么上面式子可以分解为：

$\begin{aligned} E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)] &= \sum_{z1=1, Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=1}^{n} P(z_i|x_i, \theta^{(t)})*\sum_{j=1}^{n}l(x_j, z_j, \theta) \\ &+ \sum_{z1=0, Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=1}^{n} P(z_i|x_i, \theta^{(t)})*\sum_{j=1}^{n}l(x_j, z_j, \theta) \\ &= \sum_{z1=1, Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=2}^{n} P(z_i|x_i, \theta^{(t)})*P(z_1=1|x_1, \theta^{(t)})*[l(x_1, z_1=1, \theta)+\sum_{j=2}^{n}l(x_j, z_j, \theta)] \\ &+ \sum_{z1=0, Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=2}^{n} P(z_i|x_i, \theta^{(t)})*P(z_0=1|x_1, \theta^{(t)})*[l(x_1, z_1=0, \theta)+\sum_{j=2}^{n}l(x_j, z_j, \theta)] \\ & = P(z_1=1|x_1, \theta^{(t)}) * l(x_1, z_1=1, \theta) * \sum_{Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=2}^{n} P(z_i|x_i, \theta^{(t)}) \\ & + P(z_1=1|x_1, \theta^{(t)}) * \sum_{Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=2}^{n} P(z_i|x_i, \theta^{(t)}) * \sum_{j=2}^{n}l(x_j, z_j, \theta)\\ & + P(z_1=0|x_1, \theta^{(t)}) * l(x_1, z_1=0, \theta) * \sum_{Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=2}^{n} P(z_i|x_i, \theta^{(t)}) \\ & + P(z_1=0|x_1, \theta^{(t)}) * \sum_{Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=2}^{n} P(z_i|x_i, \theta^{(t)}) * \sum_{j=2}^{n}l(x_j, z_j, \theta) \end{aligned}$

由概率分布之和为1的性质，我们有：

$P(z_1=1|x_1, \theta^{(t)})+P(z_1=0|x_1, \theta^{(t)})=1$

$\sum_{Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=2}^{n} P(z_i|x_i, \theta^{(t)})=\sum_{Z_2^n \in \{0, 1\}^{n-1}}P(Z_2^n|X_2^n, \theta^{(t)})=1$

所以我们有：

$\begin{aligned} E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)] &= P(z_1=1|x_1, \theta^{(t)}) * l(x_1, z_1=1, \theta) \\ &+ P(z_1=0|x_1, \theta^{(t)}) * l(x_1, z_1=0, \theta) \\ &+ \sum_{Z_2^n \in \{0, 1\}^{n-1}} \prod_{i=2}^{n} P(z_i|x_i, \theta^{(t)}) * \sum_{j=2}^{n}l(x_j, z_j,\theta)\\ &= P(z_1=1|x_1, \theta^{(t)}) * l(x_1, z_1=1, \theta) \\ &+ P(z_1=0|x_1, \theta^{(t)}) * l(x_1, z_1=0, \theta) \\ &+ P(z_2=1|x_1, \theta^{(t)}) * l(x_2, z_2=1, \theta) \\ &+ P(z_2=0|x_1, \theta^{(t)}) * l(x_2, z_2=0, \theta) \\ &+ \sum_{Z_3^n \in \{0, 1\}^{n-2}} \prod_{i=3}^{n} P(z_i|x_i, \theta^{(t)}) * \sum_{j=3}^{n}l(x_j, z_j, \theta)\\ &=......\\ &= \sum_{i=1}^n P(z_i=1|x_i, \theta^{(t)}) * l(x_i, z_i=1, \theta) + \sum_{i=1}^n P(z_i=0|x_i, \theta^{(t)}) * l(x_i, z_i=0, \theta) \end{aligned}$

最终我们求得了需要maximum的式子：

$E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)] = \sum_{i=1}^n P(z_i=1|x_i, \theta^{(t)}) * l(x_i, z_i=1, Q) + \sum_{i=1}^n P(z_i=0|x_i, \theta^{(t)}) * l(x_i, z_i=0, \theta)$

其中

$l(x_j, z_j, \theta) = x_j*\log(z_j*\theta_A+(1-z_j)\theta_B) + (\delta-x_j)*\log(1-z_j*\theta_A-(1-z_j)\theta_B)+z_j*\log \theta_C+(1-z_j)*\log (1-\theta_C)$

$P(z_i=0|x_i, \theta^{(t)})=\frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}$

$P(z_i=1|x_i, \theta^{(t)})=\frac{1}{\eta_i} (1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}$

把上述式子带入，有：

$\begin{aligned} E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)] & = \sum_{i=1}^n P(z_i=1|x_i, \theta^{(t)}) * l(x_i, z_i=1, \theta) + \sum_{i=1}^n P(z_i=0|x_i, \theta^{(t)}) * l(x_i, z_i=0, \theta) \\ &= \sum_{i=1}^n \frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i} * [x_i*\log \theta_A + (\delta-x_i)*\log(1-\theta_A)+\log \theta_C] \\ &+\sum_{i=1}^n \frac{1}{\eta_i} (1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i} * [x_i*\log \theta_B + (\delta-x_i)*\log(1-\theta_B)+\log (1-\theta_C)] \end{aligned}$

其中,

$\eta_i=\theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}+(1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}$

可以看到这个式子中，我们已经把所有隐变量 $Z$ 积分消除掉了，到这里我们就可以进行求解了。我们需要求解

$\begin{aligned} \theta^{(t+1)}&=\argmax_{\theta}Q(\theta|\theta^{(t)})=\argmax_{\theta}E_{Z|X,\theta^{(t)}}[\log L(\theta;X,Z)]\\ &=\argmax_{\theta} \sum_{i=1}^n \frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i} * [x_i*\log \theta_A + (\delta-x_i)*\log(1-\theta_A)+\log \theta_C] \\ &+\sum_{i=1}^n \frac{1}{\eta_i} (1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i} * [x_i*\log \theta_B + (\delta-x_i)*\log(1-\theta_B)+\log (1-\theta_C)] \end{aligned}$

其中， $(\theta_A, \theta_B)$ 是需要求解的变量，其他 $(\theta_A^{(t)}, \theta_C^{(t)}, \theta_C^{(t)}), \eta_i$ 都是常量。这里我们联立偏导方程进行求解：

$\{ \begin{aligned} \frac{\partial Q}{\partial \theta_A} = 0 \\ \frac{\partial Q}{\partial \theta_B} = 0 \\ \frac{\partial Q}{\partial \theta_C} = 0 \end{aligned}$

一个一个来：

$\begin{aligned} \frac{\partial Q}{\partial \theta_A} &= \sum_{i=1}^n \frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i} * [x_i*\frac{1}{\theta_A} - (\delta-x_i)*\frac{1}{(1-\theta_A)}] \\ &=\sum_{i=1}^n \frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i} * [x_i*\frac{1}{\theta_A} - (\delta-x_i)*\frac{1}{(1-\theta_A)}] \\ &=\sum_{i=1}^n \frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i} * [\frac{x_i-\delta*\theta_A}{\theta_A*(1-\theta_A)}] \end{aligned}$

令 $\frac{\partial Q}{\partial \theta_A}=0$ ，我们有：

$\theta_A^{(t+1)} = \frac{\sum_{i=1}^{n}\frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}*x_i}{\sum_{i=1}^{n}\frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}*\delta}$

同理我们有：

$\theta_B^{(t+1)} = \frac{\sum_{i=1}^{n}\frac{1}{\eta_i} (1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}*x_i}{\sum_{i=1}^{n}\frac{1}{\eta_i} (1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}*\delta}$

$\begin{aligned} \theta_C^{(t+1)} &= \frac{\sum_{i=1}^{n}\frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}}{\sum_{i=1}^{n}\frac{1}{\eta_i} [\theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i}+(1-\theta_C^{(t)})* (\theta_B^{(t)})^{x_i}*(1-\theta_B^{(t)})^{\delta - x_i}]} \\ &=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{\eta_i} \theta_C^{(t)}* (\theta_A^{(t)})^{x_i}*(1-\theta_A^{(t)})^{\delta - x_i} \end{aligned}$

对比下前面直觉的结果，形式上非常接近：

$\begin{aligned} \hat\theta_A^{(t+1)} & = \frac{\sum_{i=1}^{n}x_i * P(z_i=1|*)}{\sum_{i=1}^{n}\delta * P(z_i=1|*)} \end{aligned}$

$\begin{aligned} \hat\theta_B^{(t+1)} & = \frac{\sum_{i=1}^{n}x_i * P(z_i=0|*)}{\sum_{i=1}^{n}\delta * P(z_i=0|*)} \end{aligned}$

$\begin{aligned} \hat\theta_C^{(t+1)} & = \frac{\sum_{i=1}^{n} P(z_i=1|*)}{\sum_{i=1}^{n} [P(z_i=0|*)+P(z_i=1|*)]}=\frac{1}{n}\sum_{i=1}^{n} P(z_i=1|*) \end{aligned}$

结束语

到此为止是为EM算法的第一层境界，我们通过直觉和推导两条线从形式上理解了EM算法是什么，解决了What的问题。接下来我们需要解决的是Why的问题，即EM算法的必要性和可行性的问题，这就是EM算法的第二重境界的问题了：

必要性：原问题无法求解，或者求解难度很大。
可行性：EM算法通过迭代“可以”达到理论最优点。

References

[1] https://www.zhihu.com/question/40797593/answer/275171156
[2] Do C B, Batzoglou S. What is the expectation maximization algorithm?[J]. Nature biotechnology, 2008, 26(8): 897-899.

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

EM算法理解的第一层境界：期望E和最大化M（二）

四、数学推导理解EM算法的形式

结束语

References

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

EM算法理解的第一层境界：期望E和最大化M（二）

四、数学推导理解EM算法的形式

结束语

References

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品