weixin_40725706

这个屌丝很懒，什么也没留下！

热门标签

互信息的原理、计算和应用_互信息损失函数

作者：weixin_40725706 | 2024-03-15 05:23:17

踩

互信息损失函数

Mutual Information 互信息

Background

熵 Entropy

在信息论中，熵是给定概率下的最佳编码方法^[1]

\begin{aligned} H (x) & = - E_{x} [\log p (x)] \\ = - \sum_{x} p (x) \log p (x) \\ = - \int p (x) \log p (x) \end{aligned}

$\begin{aligned} H(x)&=-\mathbb{E}_x[\log{p(x)}]\\ &=-\sum_x{p(x)\log{p(x)}}\\ &=-\int{p(x)\log{p(x)}} \end{aligned}$

H (x) = - E_{x} [lo g p (x)] = - x \sum p (x) lo g p (x) = - \int p (x) lo g p (x)

交叉熵 Cross Entropy

从信息论的角度看，交叉熵是用本来为 $q (x)$ 事件编码的规则来对一个全新的事件 $p (x)$

\begin{aligned} H (p, q) & = - E_{x \sim p (x)} [q (x)] \\ = - \sum_{x} p (x) \log q (x) \\ = - \int_{x} p (x) \log q (x) \end{aligned}

$\begin{aligned} H(p,q)&=-\mathbb{E}_{x\sim p(x)}[q(x)]\\ &=-\sum_x{p(x)\log{q(x)}}\\ &=-\int_x{p(x)\log{q(x)}} \end{aligned}$

H (p, q) = - E_{x \sim p (x)} [q (x)] = - x \sum p (x) lo g q (x) = - \int_{x} p (x) lo g q (x)

显然由于对

p (x)

采用了其他的编码方式，有

H (p, p) < H (p, q)

。采用其他编码方式造成的资源浪费就定义为KL-散度(KL-divergence)，也叫相对熵

D_{KL}=H(p,q)-H(p)

条件熵 Conditional Entropy

条件熵是指已知某个变量 $Z$ 之后，变量 $X$ 的熵

\begin{aligned} H (X | Z) & = - E_{x, z} [p (x | z)] \\ = - \sum_{x, z} p (x, z) \log p (x | z) \\ = - \int_{x, z} p (x, z) \log p (x | z) \end{aligned}

$\begin{aligned} H(X|Z)&=-\mathbb{E}_{x,z}[p(x|z)]\\ &=-\sum_{x,z}{p(x,z)\log{p(x|z)}}\\ &=-\int_{x,z}{p(x,z)\log{p(x|z)}} \end{aligned}$

H (X ∣ Z) = - E_{x, z} [p (x ∣ z)] = - x, z \sum p (x, z) lo g p (x ∣ z) = - \int_{x, z} p (x, z) lo g p (x ∣ z)

显然 $H (X) > H (X ∣ Z)$ 。

KL-散度 KL-divergence

\begin{aligned} D_{K L} (p (x) | | q (x)) & = H (p | q) - H (p) \\ = \sum_{x} p (x) \log \frac{p (x)}{q (x)} \\ = \int p (x) \log \frac{p (x)}{q (x)} \end{aligned}

$\begin{aligned} D_{KL}{(p(x)||q(x))} &= H(p|q)-H(p)\\ &=\sum_x{p(x)\log{\frac{p(x)}{q(x)}}}\\ &=\int{p(x)\log{\frac{p(x)}{q(x)}}} \end{aligned}$

D_{K L} (p (x) ∣ ∣ q (x)) = H (p ∣ q) - H (p) = x \sum p (x) lo g \frac{p ( x )}{q ( x )} = \int p (x) lo g \frac{p ( x )}{q ( x )}

KL-散度可以看成是两个概率分布之间的度量，需要注意的是它没有对称性，并且 $D_KL (p(x)||q(x))≥0$ ^[4]。(Appendix A.)

定义

互信息量化了两个随机变量X和Z之间的相关性^[2]

\begin{aligned} I (X; Z) & = D_{K L} (P_{X Z} | | P_{X} \otimes P_{Z}) \\ = \int_{X \times Z} \log \frac{d P_{X Z}}{d P_{X} \otimes P_{Z}} d P_{X Z} \end{aligned}

$\begin{aligned} I(X;Z)&=D_{KL}(\mathbb{P}_{XZ}||\mathbb{P}_X\otimes\mathbb{P}_Z)\\ &=\int_{\mathcal{X}\times\mathcal{Z}}{\log{\frac{d\mathbb{P}_{XZ}}{d\mathbb{P}_X\otimes\mathbb{P}_Z}d\mathbb{P}_{XZ}}} \end{aligned}$

I (X; Z) = D_{K L} (P_{X Z} ∣ ∣ P_{X} \otimes P_{Z}) = \int_{X \times Z} lo g \frac{d P _{X Z}}{d P _{X} \otimes P _{Z}} d P_{X Z}

其中

\mathbb{P}_{XZ}

是联合概率分布，

\mathbb{P}_X=\int_{\mathcal{Z}}{d\mathbb{P}_{XZ}}

与

\mathbb{P}_Z=\int_{\mathcal{X}}{d\mathbb{P}_{XZ}}

边缘分布函数，也就是说互信息可以看成联合概率分布和边缘概率分布之积的距离。

互信息也可以理解为在已知变量 $Z$ 的情况下，表示变量 $X$ 所节省的资源
$I (X; Z) = H (x) - H (X ∣ Z)$
与相关系数不同的是，互信息更倾向于捕捉非线形的关系。但是，一般情况下的互信息是很难计算的，因为变量X和Z的概率分布难以获得。

计算方法

Variational approach^[3]

考虑到 $D_{KL}(p(x|y)||q(x|y))≥0$ ，则
$\sum_x{p(x|y)\log{p(x|y)}-p(x|y)\log{q(x|y)}}\geq0$
那么可以得到

\begin{aligned} I (x, y) & = H (x) - H (x | y) \\ = H (x) - \sum_{y} p (y) \sum_{x} p (x | y) \log p (x | y) \\ \geq H (x) + \sum_{y} p (y) \sum_{x} p (x | y) \log q (x | y) \\ = H (x) + E_{x, y \sim p (x, y)} [\log q (x | y)] \\ \overset{d e f}{=} \tilde{I} (x, y) \end{aligned}

$\begin{aligned} I(x,y)&=H(x)-H(x|y)\\ &=H(x)-\sum_y{p(y)\sum_x{p(x|y)\log{p(x|y)}}}\\ &\geq H(x)+\sum_y{p(y)\sum_x{p(x|y)\log{q(x|y)}}}\\ &=H(x)+\mathbb{E}_{x,y\sim p(x,y)}[\log{q(x|y)}]\\ &\overset{\rm{def}}{=}\tilde{I}(x,y) \end{aligned}$

I (x, y) = H (x) - H (x ∣ y) = H (x) - y \sum p (y) x \sum p (x ∣ y) lo g p (x ∣ y) \geq H (x) + y \sum p (y) x \sum p (x ∣ y) lo g q (x ∣ y) = H (x) + E_{x, y \sim p (x, y)} [lo g q (x ∣ y)] = d e f \tilde{I} (x, y)

只要不断的推进

\tilde{I}(x,y)

的边界，就能得到互信息的估计值。假设估计的目标为

p (y ∣ x, θ)

，与EM算法类似，IM算法进行以下迭代

固定 $q(x|y)，θ^{new}=\arg⁡\max_θ⁡[\tilde{I} (x, y)]$
固定 $θ，q^{new}=\arg⁡\max_{(q(x|y)∈Q)⁡}{\tilde{I}(x,y)}$

其中 $Q$ 是特定的概率分布，可以方便 $E_{(x,y\sim p(x,y)})[\log⁡{q(x|y)}]$ 的计算，一般情况下选为高斯分布。

Mutual Information Neural Estimation, MINE^[5]

根据KL散度的Donsker-Varadhan表示
$D_{KL}(\mathbb{P}||\mathbb{Q})=\sup_{T:\Omega\rightarrow\mathbb{R}}{\mathbb{E}_{\mathbb{P}}[T]-\log{\mathbb{E_{\mathbb{Q}}}[e^T]}}$
不断推进这个函数的上界就可以得到KL-散度的估计值。 (Appendix B.)
在这里插入图片描述

其中 $θ$ 就是神经网络的参数。

DEEP INFOMAX^[6]

同时优化计算互信息的网络参数 $ω$ 和生成器的参数 $ψ$
$(\hat{ω},\hat{\psi})=\arg{\max_{ω,\psi}\hat{I}_{\omega}(X,E_{\psi}(X))}$
因为同时对 $ω, ψ$ 进行优化，因此可以让编码器E和互信息的估计网络T共享某些神经网络层，即 $E_ψ=f_ψ∘C_ψ$ ， $T_{ω,ψ}=D_ω∘g∘(C_ψ,E_ψ)$
可以使用非KL-散度对互信息进行估计，比如Jenson-Shannon估计
$\hat{I}^{(\rm JSD)}_{ω,ψ}:=\mathbb{E}_{\mathbb{P}}[-{\rm sp}(-T_{ω,ψ}(x,E_{\psi}(x)))]-\mathbb{E}_{\mathbb{P}\times\tilde{\mathbb{P}}}[{\rm sp}(T_{ω,ψ}(x,E_{\psi}(x)))]$
其中 ${\rm sp}(z)=\log⁡(1+e^z ),x\sim P,x^′\sim \tilde{P}$ 。

或者噪音对比估计Noise-Contrastive Estimation (NCE)
$\hat{I}^{({\rm info}NCE)}_{ω,ψ}:=\mathbb{E}_{\mathbb{P}}[T_{ω,ψ}(x,E_{\psi}(x))-\mathbb{E}_{\mathbb{\tilde{P}}}[\log{\sum_{x^{\prime}}}{e^{T_{ω,ψ}(x^{\prime},E_{\psi}(x))}}]]$

应用

迁移学习

不仅利用老师网络的logits层和标签进行调整，还利用中间层之间的互信息^[7]。由变分法可知

\begin{aligned} I (t; s) & = H (t) - H (t | s) \\ \geq H (t) + E_{t, s} [\log q (t | s)] \end{aligned}

$\begin{aligned} I(t;s)&=H(t)-H(t|s)\\ &\geq H(t)+\mathbb{E}_{t,s}[\log{q(t|s)}] \end{aligned}$

I (t; s) = H (t) - H (t ∣ s) \geq H (t) + E_{t, s} [lo g q (t ∣ s)]

则模型的损失函数变为

\tilde{\mathcal{L}}=\mathcal{L}_S-\sum_{k=1}^{K}\lambda_k\mathbb{E}_{t^{(k)},s^{(k)}}[\log{q(t^{(k)}|s^{(k)})}]

强化学习+迁移学习^[8]

强化学习迁移的一个阻碍是，两个不同任务之间的动作空间、状态空间等不一致。通过互信息可以对不同任务的空间进行转化，达到迁移强化学习的目的

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MCkxXTNB-1600166773082)(/Users/apple/Library/Application Support/typora-user-images/image-20200915172440586.png)]

最终的损失函数由三部分组成

\begin{aligned} L_{c o u p l i n g} & = - \frac{1}{N_{π}} \sum_{j = 1}^{N_{π}} \log p_{θ}^{j} - \frac{1}{N_{V}} \sum_{j = 1}^{N_{V}} \log p_{ψ}^{j} \\ L_{P P O} & = L_{P P O}^{θ} + L_{P P O}^{ψ} \\ L_{M I (ϕ, ω)} & = - E_{s \sim ρ_{π_{θ}}} [\log q_{ω} (s | ϕ (s))] \end{aligned}

$\begin{aligned} \mathcal{L}_{\rm coupling}&=-\frac{1}{N_{\pi}}\sum_{j=1}^{N_{\pi}}{\log{p_{\theta}^j}}-\frac{1}{N_V}\sum_{j=1}^{N_V}{\log{p_{\psi}^j}}\\ \mathcal{L}_{\rm PPO}&=\mathcal{L}_{\rm PPO}^{\theta}+\mathcal{L}_{\rm PPO}^{\psi}\\ \mathcal{L}_{ {\rm MI}(\phi,\omega)}&=-\mathbb{E}_{s\sim \rho_{\pi_{\theta}}}[\log{q_{\omega}(s|\phi(s))}] \end{aligned}$

L_{c o u p l i n g} L_{P P O} L_{M I (ϕ, ω)} = - \frac{1}{N _{π}} j = 1 \sum N_{π} lo g p_{θ}^{j} - \frac{1}{N _{V}} j = 1 \sum N_{V} lo g p_{ψ}^{j} = L_{P P O}^{θ} + L_{P P O}^{ψ} = - E_{s \sim ρ_{π_{θ}}} [lo g q_{ω} (s ∣ ϕ (s))]

自监督学习^[6]

通过正负采样样本之间的互信息和图片的空间信息，网络在无监督的情况下学习图片深层的信息。

References

http://colah.github.io/posts/2015-09-Visual-Information/
Mutual Information Neural Estimation. https://arxiv.org/pdf/1801.04062.pdf
The IM Algorithm: A variational approach to Information Maximization. http://aivalley.com/Papers/MI_NIPS_final.pdf
https://zhuanlan.zhihu.com/p/39682125
https://arxiv.org/pdf/1801.04062.pdf
Learning Deep Representation By Mutual Information Estimation and Maximization. https://arxiv.org/pdf/1808.06670.pdf
Variational Information Distillation for Knowledge Transfer. https://openaccess.thecvf.com/content_CVPR_2019/papers/Ahn_Variational_Information_Distillation_for_Knowledge_Transfer_CVPR_2019_paper.pdf
Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch. https://arxiv.org/pdf/2006.07041.pdf

Appendix

A.

$D_{KL}(p(x)||q(x))≥0$

设 $f(x)≥0且∫_xf(x)=1$ ， $g$ 为任意可测实函数且 $φ$ 为凸函数，则有Jensen不等式如下
$\varphi\left(\int_x{g(x)f(x)}\right)\leq\int_x{\varphi(g(x))f(x)}$
注意到 $-\ln x$ 为严格凸函数，且 $q(x)≥0,∫_xq(x)=1$ 。令 $φ(x)=-\ln⁡x,g(x)=\frac{q(x)}{p(x)} ,f(x)=p(x)$ ，则
$D_{KL}(p||q)=\int_x{p(x)}\left[ -\ln{\frac{q(x)}{p(x)}}\right]\geq-\ln{\int_x{q(x)}}=0$

B.

$D_{KL}(\mathbb{P}||\mathbb{Q})=\sup_{T:\Omega\rightarrow\mathbb{R}}{\mathbb{E}_{\mathbb{P}}[T]-\log{\mathbb{E_{\mathbb{Q}}}[e^T]}}$

A simple proof goes as follows. For a given function $T$ , consider the Gibbs distribution $\mathbb{G}$ defined by $d\mathbb{G}=\frac{1}{Z}e^Td\mathbb{Q}$ , where $Z=E_{\mathbb{Q}}[e^T]$ . By construction,
$\mathbb{E}_{\mathbb{P}}[T]-\log{Z}=\mathbb{E}_{\mathbb{P}}[\log{\frac{d\mathbb{G}}{d\mathbb{Q}}}](1)$
Let $∆$ be the gap,
$∆:=D_{KL}(\mathbb{P}||\mathbb{Q})-(\mathbb{E}_{\mathbb{P}}[T]-\log{\mathbb{E_{\mathbb{Q}}}[e^T]})(2)$
Using (1), we can write $∆$ as a KL-divergence:
$∆:=\mathbb{E}_{\mathbb{P}}[\log{\frac{d\mathbb{P}}{d\mathbb{Q}}}-\log{\frac{d\mathbb{G}}{d\mathbb{Q}}}]=\mathbb{E}_{\mathbb{P}}[\log{\frac{d\mathbb{P}}{d\mathbb{G}}}]=D_{KL}(\mathbb{P}||\mathbb{G})(3)$
The positivity of the KL-divergence gives $∆ \geq 0$ . We have thus shown that for any $T$ ,
$D_{KL}(\mathbb{P}||\mathbb{Q})=\mathbb{E}_{\mathbb{P}}[T]-\log{(\mathbb{E}_{\mathbb{Q}}[e^T])}$
and the inequality is preserved upon taking the supremum over the right-hand side. Finally, the identity (3) also shows that this bound is tight whenever $\mathbb{G}=\mathbb{P}$ , namely for optimal functions $T^∗$ taking the form $T^∗=\log⁡{\frac{d\mathbb{P}}{d\mathbb{Q}}+C}$ for some constant $C∈\mathbb{R}$ .

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/weixin_40725706/article/detail/239104