不正经

这个屌丝很懒，什么也没留下！

热门标签

EM求解高斯混合模型GMM 原理+公式推导+代码_使用em算法估计高斯混合模型参数0的推导过程。

作者：不正经 | 2024-05-19 20:59:53

踩

使用em算法估计高斯混合模型参数0的推导过程。

1 简介

EM（Expectation-Maximum）算法也称期望最大化算法，它是为了解决在方程无法获得解析解的情况下，通过迭代给出数值解。

核心：EM算法是一种迭代算法，用于含有隐变量的概率模型参数的极大似然估计（因此在往下面看之前，我希望你对贝叶斯的基本理论有所了解）

2 极大似然估计

（1）问题背景
我们要去调查全校学生的身高分布，因此分别随机抽取男生、女生各100人，总共200人（假设全校的男生、女生一样多，如果不一样多，那就还需要根据男女比例进行抽取，但这并不是我们研究的重点）。我们想要去获取男生的身高分布
（2）问题假设
① 男生、女生的身高符合高斯正态分布（对于非高斯问题，则需要建模处理，它不是统计学科要解决的问题）
② 男生、女生的身高都是独立同分布（ i.i.d ）
（3）模型建立
现在我们已经有了一堆数据 X = { ${x_1,x_2,...,x_n}$ } ，其中 $x_i$ 表示第i个男生的身高，n为男生的个数。待估参数为 $\theta=$ { ${\mu ,\sigma }$ } 。
我们从当前全校男生中抽取一个同学A，其身高为 $y_A$ 的概率可以记为 $P(x_A \mid \theta)$ 。那么从全校男生中抽取100个同学，其身高为上述数据集合Y的概率可以用下列式子表示： $L(\theta)=L\left(x_1, \cdots, x_n ; \theta\right)=\prod_{i=1}^n p\left(x_i ; \theta\right), \theta \in \Theta$
上述式子反应了，在概率密度函数的参数是θ时，得到X这组样本的概率。上述式子中只有 θ 是未知的，因此称为参数θ的似然函数 $L(\theta)$ 。我们的目标就是寻找出最优的 $\theta$ ，使得 $L(\theta)$ 的数值最大，即：找到一个高斯分布，使得我随机抽100人，刚好就是上面的100个男生的概率最大。θ的最大似然估计量记为 $\hat{\theta}=argmax L(\theta)$
为便于分析，我们将 $L(\theta)$ 取其对数，即：
$\ln L(\theta)=\log \prod_{i=1}^n p\left(x_i ; \theta\right)=\sum_{i=1}^n \log p\left(x_i ; \theta\right)$
由于p是高斯分布的概率密度函数（PDF），取对数后exp的指数会变为相加项。对其求导，令导数为0，得到似然方程；解似然方程即可得到 $\theta_{MLE}$ ，即 $\theta$ 最优估计。

3 EM算法推导

对于参数为θ且含有隐变量Z的概率模型，进行n次抽样。假设随机变量 $x$ 的观察值为 $X$ ={ $x_1,x_2,...,x_n$ }，隐变量 $Z$ 的m个可能的取值为 $Z$ ={ $z_1,z_2,...,z_m$ }. （如果对 $z$ 存在疑惑，那么就先将其理解为不同模型的概率，其总和为1。后面在第四部分GMM，会有更加容易被理解的说明）
写出其似然函数：
EM算法的公式：
$\theta^{(t+1)} = \arg\max_{\theta} \int_z \log P(X, z | \theta) \cdot P(z|X, \theta^{(t)}) dz$

$\theta)$ 是数据X的对数联合概率
$\arg\max_{\theta} \int_z \log P(X, z | \theta) \cdot P(z|X, \theta^{(t)}) dz$ 可以表示为 $\mathbb{E}_{z|x,\theta^{(t)}}[\log P(x, z|\theta)]$

EM算法收敛性的证明

目标：证明下一次迭代的参数 $\theta$ ，要比当前时刻的 $\theta$ 好，即证明其收敛性，公式如下：
$\theta^{(t+1)} \gets \theta^{(t)}$ $\log P(X | \theta^{(t+1)})\ge\log P(X | \theta^{(t)})\text{ }（1）$ 利用贝叶斯全概率公式可得：

P (X | θ) = P ( X , θ ) P ( θ ) = P ( X , θ ) P ( X , Z , θ ) \cdot P ( X , Z , θ ) P ( θ ) = P ( X | Z , θ ) P ( Z | X , θ )

$\begin{align*} P(X|\theta) &= \frac{P(X, \theta)}{P(\theta)} \\ &= \frac{P(X, \theta)}{P(X, Z, \theta)} \cdot \frac{P(X, Z, \theta)}{P(\theta)} \\ &= \frac{P(X | Z, \theta)}{P(Z | X, \theta)} \end{align*}$

P (X ∣ θ) = \frac{P ( X , θ )}{P ( θ )} = \frac{P ( X , θ )}{P ( X , Z , θ )} \cdot \frac{P ( X , Z , θ )}{P ( θ )} = \frac{P ( X ∣ Z , θ )}{P ( Z ∣ X , θ )}

因此：

\log P(X | \theta)=\log\frac{ P(X,z | \theta)}{P(z | X,\theta)}=\log{ P(X,z | \theta)}-{P(z | X,\theta)} \text{ }\text{ }（2）

对公式（2）两边同时求期望

\mathbb{E}_{z|x,\theta^{(t)}}[\log P(X|\theta)]=\mathbb{E}_{z|x,\theta^{(t)}}[\log P(X, z|\theta)-\log P(z |X,\theta)]

\begin{align*}左边&= \int_z P(z|X, \theta^{(t)}) \cdot \log P(X|\theta) dz \\&= \log P(X|\theta) \int_z P(z|X, \theta^{(t)}) dz \\&= \log P(X|\theta) \\\end{align*}

注意：\int_z P(z|X, \theta^{(t)}) dz=1，可以将其理解为不同模型的占比，总的和为1.

\begin{align*} 右边&= \int_z P(z|X, \theta^{(t)}) \cdot \log P(X, z|\theta) dz - \int_z P(z|X, \theta^{(t)}) \cdot \log P(z|X, \theta) dz & \end{align*}

令

\text{Q}(\theta, \theta^{(t)})= \int_z P(z|X, \theta^{(t)}) \cdot \log P(X, z|\theta) dz

，

\text{H}(\theta, \theta^{(t)})=\int_z P(z|X, \theta^{(t)}) \cdot \log P(z|X, \theta) dz

，则：

\log P(X|\theta)=\text{Q}(\theta, \theta^{(t)})-\text{H}(\theta, \theta^{(t)}) \text{ }(3)

证明公式（3）成立，即证明：

Q(\theta^{(t)}, \theta^{(t)}) - H(\theta^{(t)}, \theta^{(t)}) \leq Q(\theta^{(t+1)}, \theta^{(t)}) - H(\theta^{(t+1)}, \theta^{(t)})\text{ }(4)

先证明 $Q$ : $Q(\theta^{(t)}, \theta^{(t)}) = \int_z P(z|X, \theta^{(t)}) \cdot \log P(X, z|\theta^{(t)}) dz \\ Q(\theta^{(t+1)}, \theta^{(t)}) = \int_z P(z|X, \theta^{(t)}) \cdot \log P(X, z|\theta^{(t+1)}) dz$ 根据定义 $\theta^{(t+1)} = \arg\max_{\theta} \int_z P(z|X, \theta^{(t)}) \cdot \log P(X, z|\theta) dz$ 因此 $Q(\theta^{(t+1)}, \theta^{(t)}) \geq Q(\theta^{(t)}, \theta^{(t)})$
证明 $H$ $\begin{align*} H(\theta^{(t+1)}, \theta^{(t)}) - H(\theta^{(t)}, \theta^{(t)}) &= \int_z P(z|X, \theta^{(t)}) \cdot \log P(X, z|\theta^{(t+1)}) dz - \int_z P(z|X, \theta^{(t)}) \cdot \log P(X, z|\theta^{(t)}) dz \\ &= \int_z P(z|X, \theta^{(t)}) \cdot (\log P(X, z|\theta^{(t+1)}) - \log P(X, z|\theta^{(t)})) dz \\ &= \int_z P(z|X, \theta^{(t)}) \cdot \log \frac{P(X, z|\theta^{(t+1)})}{P(X, z|\theta^{(t)})} dz \end{align*}$
根据 $[J e n se n] (h ttp s : // z h u an l an . z hih u . co m / p /39315786)$ 不等式原理： $\mathbb{E}[f(x)] \geq f(\mathbb{E}[x])$ 因此： $\begin{align*} H(\theta^{(t+1)}, \theta^{(t)}) - H(\theta^{(t)}, \theta^{(t)}) &= \int_z P(z|X, \theta^{(t)}) \cdot \log \frac{P(X, z|\theta^{(t+1)})}{P(X, z|\theta^{(t)})} dz \\ &\leq \log \int_z P(z|X, \theta^{(t)}) \cdot \frac{P(X, z|\theta^{(t+1)})}{P(X, z|\theta^{(t)})} dz \\ &= \log \int_z P(z|X, \theta^{(t+1)}) dz \\ &= \log 1 \\ &= 0 \end{align*}$ 综上所述，公式（4）得证。EM算法的收敛性得到证明。EM算法的流程为：
设置 $\theta^{(0)}$ 的初值。不同初值迭代出来的结果可能不同。可以观察下面的示意图，如果 $\theta^{(t)}$ 在左边的峰值附近，EM最终就会迭代到左边的局部最优，无法发现右边更大的值，陷入局部最优。（注：该图像来源于:李航老师的《统计学习方法》）
更新 $\theta_{(t)}$ 。这一步要进行两种计算求期望E，求极大化M。
比较 $\theta^t与\theta^{(t+1)}$ 的差异，若变化小于一定阈值则结束迭代；否则，返回第二步

4 高斯混合模型的应用

回到 1 极大似然估计的问题背景中，这时候我们不但想知道身高的分布，还想知道体重的分布情况，因此一个数据 $x_i$ ={ $身高、体重$ }，同时全校的男生、女生混在一起。

4.1 隐变量的引入

我们想一个问题：从全校学生中抽到一个身高为1.7m、体重为60kg的同学的概率是多少？如果明确告诉你这个同学是男生还是女生，那么将其带入高斯模型的中，利用概率密度公式求解，对应的概率就可以直接计算出来。但是，现在你并不知道这个同学的具体性别，这件事情就麻烦了。我们很容易想到用加权的思维去计算这个问题，在这个问题中男生、女生的概率刚好是50%。这里的概率，就是高斯混合模型中的隐变量。隐变量用数学语言做出以下定义： $z_{ij}$ 表示第 $x_i$ 个数据属于第 $j$ 个高斯分布的概率。

4.2 EM-GMM算法推导

$X：observed\text{ }data\longrightarrow$ $x_1,x_2,..,x_N$
$latent\text{ }data\longrightarrow$ $z_{ij}$
$x:observed\text{ }variable$
$z:latent\text{ }variable$
假设高斯混合模型混了 $m$ 个高斯分布，参数 $\theta$ ={ ${\alpha_1,\alpha_2,...,\alpha_m,\mu_1,\mu_2,...,\mu_m,\Sigma _1,\Sigma _2,...,\Sigma _m}$ }，则整个概率密度为： $P(x|\theta)=\sum_{j=1}^{m} \alpha_j \phi(x|\mu_j,\Sigma_k)，where\sum_{j=1}^{m}\alpha _j=1$ 对混合分布抽样n次得到 ${x_1,...,x_n}$ ，则在第k+1次迭代，待优化式为： $\begin{aligned} & \max _\theta Q\left(\theta, \theta^k\right) \\ = & \max _\theta \sum_{x \in {X}} \sum_{z \in{Z}} P\left(z \mid y, \theta^k\right) \log P(x, z \mid \theta) \\ = & \max _\theta \sum_{x \in {X}} \sum_{z \in {Z}} \frac{P\left(z, y \mid \theta^k\right)}{P\left(y \mid \theta^k\right)} \log P(x, z \mid \theta) \\ = & \max _\theta \sum_{i=1}^n \sum_{j=1}^m \frac{\alpha_j^k \phi\left(x_i \mid \theta_j^k\right)}{\sum_{l=1}^m \alpha_l^k \phi\left(x_i \mid \theta_l^k\right)} \log \left[\alpha_j \phi\left(x_i \mid \theta_j\right)\right] \\ = & \max _\theta \sum_{i=1}^n \sum_{j=1}^m \frac{\alpha_j^k \phi\left(x_i \mid \theta_j^k\right)}{\sum_{l=1}^m \alpha_l^k \phi\left(x_i \mid \theta_l^k\right)} \log \left [ \frac{\alpha_j}{(2 \pi)^{D / 2}|\Sigma_j|^{1 / 2}} \exp \left(-\frac{1}{2}(x_i-\mu)^T \Sigma_j^{-1}(x_i-\mu)\right) \right ] \\ = & \max _\theta \sum_{j=1}^m \sum_{i=1}^n \frac{\alpha_j^k \phi\left(x_i \mid \theta_j^k\right)}{\sum_{l=1}^m \alpha_l^k \phi\left(x_i \mid \theta_l^k\right)}\left[\log \alpha_j-\frac{1}{2}\log \Sigma_j-\frac{1}{2}(x_i-\mu)^T \Sigma_j^{-1}(x_i-\mu)\right] \end{aligned}$

4.2.1 求解 $\alpha$

记 $p_j= \frac{\alpha_j^k \phi\left(x_i \mid \theta_j^k\right)}{\sum_{l=1}^m \alpha_l^k \phi\left(x_i \mid \theta_l^k\right)}$
构造拉格朗日方程 $L(p,\lambda )= \sum_{j=1}^m \sum_{i=1}^np_j*\log \alpha_j-\lambda(\sum_{j=1}^{m}p_j-1)$
对 $p_j$ 求导： $\frac{\partial L}{\partial p_j}= \sum_{i=1}^n\frac{1}{\alpha_j}*p_j-\lambda=0$ $\frac{\partial L}{\partial p_j}= \sum_{i=1}^np_j-\alpha_j*\lambda=0$ 对于所有的 $j$ ，即 $j$ 从 $1$ 到 $m$ ，对上式加和：
$\sum_{j=1}^m\sum_{i=1}^np_j-\sum_{i=j}^m\alpha_j*\lambda=0\text{ **}$
因为 $\sum_{j=1}^mp_j=1,\sum_{j=1}^m\alpha_j=1$
所以推出 $\lambda=N$ ,代入(**)式子，得 $\alpha_k=\frac{1}{n}\sum_{i=1}^{n}\frac{\alpha_j^k \phi\left(x_i \mid \theta_j^k\right)}{\sum_{l=1}^m \alpha_l^k \phi\left(x_i \mid \theta_l^k\right)}$

4.2.2 求解 $\mu、\Sigma$

对这两项的求导不进行展开，较为简单。只展示最终结果：

$\begin{aligned} & \mu_j \leftarrow \frac{\sum_{i=1}^Np_j x^{(i)}}{\sum_{i=1}^Np_j} \quad \Sigma_j \leftarrow \frac{\sum_{i=1}^N p_j \cdot\left\{\left(x^{(i)}-\mu_j\right)\left(x^{(i)}-\mu_j\right)^T\right\}}{\sum_{i=1}^Np_j} \end{aligned}$

μ_{j} \leftarrow \frac{\sum _{i = 1}^{N} p _{j} x ^{(i)}}{\sum _{i = 1}^{N} p _{j}} Σ_{j} \leftarrow \frac{\sum _{i = 1}^{N} p _{j} \cdot { ( x ^{(i)} - μ _{j} ) ( x ^{(i)} - μ _{j} ) ^{T} }}{\sum _{i = 1}^{N} p _{j}}

p_j

在4.2.1第一个公式进行说明

python实现代码

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

# 初始化分布参数
MU1 = np.array([1, 2])
SIGMA1 = np.array([[1, 0], [0, 0.5]])
MU2 = np.array([-1, -1])
SIGMA2 = np.array([[1, 0], [0, 1]])

# 生成数据点
data1 = np.random.multivariate_normal(MU1, SIGMA1, 1000)
data2 = np.random.multivariate_normal(MU2, SIGMA2, 1000)

X = np.concatenate([data1, data2])

#对X进行随机打乱，此步复现时不可忽略
np.random.shuffle(X)

# Step 2：请在这里绘制二维散点图。
plt.scatter(data1[:, 0], data1[:, 1], label='Class 1')
plt.scatter(data2[:, 0], data2[:, 1], label='Class 2')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot of Data Points')
plt.legend()
plt.show()

# Step 3：GMMs的代码实现。
def my_fit_GMM(X, K, Max_iters):
    N, D = X.shape  # 获取数据X的维度
    alpha = np.ones(K) / K  # 初始化混合系数
    mu = X[np.random.choice(N, K, False), :]
    # 初始化COV
    cov = [np.cov(X.T) for _ in range(K)]
    E = np.zeros((N, K))  # 初始化后验概率矩阵

    while Max_iters>0:
        Max_iters-=1
        oldmu = mu.copy()    # E-Step
        for i in range(N):
            for j in range(K):
                E[i, j] = alpha[j] * multivariate_gaussian_pdf(X[i], mu[j], cov[j])
        E /= E.sum(axis=1, keepdims=True)

        # M-Step
        sum_E = E.sum(axis=0)
        alpha = sum_E / N

        for j in range(K):
            mu[j] = E[:,j]@X / sum_E[j]
            x_mu = X - mu[j]
            cov[j] = (E[:, j, np.newaxis] * x_mu).T @ x_mu / sum_E[j]
        if abs(np.linalg.det(mu-oldmu))<2.2204e-16:
            break
    return mu, cov,alpha

# 混合高斯的pdf
def multivariate_gaussian_pdf(x, mu, cov):
    k = len(mu)
    cov_det = np.linalg.det(cov)
    cov_inv = np.linalg.inv(cov)
    pdf = (1.0 / np.sqrt((2 * np.pi) ** k * cov_det))*np.exp(-0.5 * (x - mu) @ cov_inv @ (x - mu).T)  # 高斯的公式
    return pdf

# Step 4：请用GMMs拟合散点分布。
means, cov, weights = my_fit_GMM(X,2,100)

# Step 5：绘制GMMs所拟合分布的概率密度函数。画图函数
from scipy.stats import multivariate_normal

def draw_results(m,s,w):
    # 定义两个二维高斯分布的参数
    m1 = m[0]
    s1 = s[0]

    m2 = m[1]
    s2 = s[1]

    # 生成网格点
    x, y = np.mgrid[-5:5:.01, -5:5:.01]
    pos = np.dstack((x, y))

    # 计算每个点的概率密度值
    rv1 = multivariate_normal(m1, s1)
    rv2 = multivariate_normal(m2, s2)
    z1 = rv1.pdf(pos)
    z2 = rv2.pdf(pos)

    # 混合两个高斯分布的概率密度函数
    z = w[0] * z1 + w[1] * z2  # 设置混合比例

    # 绘制3D图
    fig = plt.figure(figsize=(5, 10))
    ax = fig.add_subplot(111, projection='3d')
    ax.plot_surface(x, y, z, cmap='viridis')

    # 设置坐标轴标签和标题
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Probability Density')
    ax.set_title('3D Plot of Mixture Gaussian Probability Density')
    plt.show()

# Step 6：输出估计的均值和方差。 可以与Step 1的数据进行对比
print("Data1.Means:")
print(means[0])
print("Data2.Means:")
print(means[1])

print("Data1.Covariances:")
print(cov[0])
print("Data2.Covariances:")
print(cov[1])

print("Weights:")
print(weights)

draw_results(means, cov, weights)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

在这里插入图片描述

参考链接

https://blog.csdn.net/zouxy09/article/details/8537620
https://www.cnblogs.com/qizhou/p/13100817.html
https://www.bilibili.com/video/BV13b411w7Xj/?p=4&vd_source=395b52d7c4e90a9c96a83181871f36bb

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/不正经/article/detail/594637