赞
踩
Principal Component Analysis 即主成分分析的主要思想是将 n 维特征映射到 k 维上,这 k 维是全新的正交特征也被称为主成分,是在原有 n 维特征的基础上重新构造出来的 k 维特征。PCA 的工作就是从原始的空间中顺序地找一组相互正交的坐标轴,其中,第一个新坐标轴选择是原始数据中方差最大的方向,第二个新坐标轴选取是与第一个坐标轴正交的平面中使得方差最大的,第三个轴是与第 1,2 个轴正交的平面中方差最大的,依次类推,得到 n 个这样的坐标轴。通过这种方式获得的新的坐标轴,大部分方差都包含在前面 k 个坐标轴中,后面的坐标轴所含的方差几乎为 0 。于是,我们可以只保留前 k 维,从而实现对数据特征的降维处理。
PCA 的计算有两种方式:
协方差用来刻画两个随机变量之间的相关性。简单比较一下方差与协方差:
均值: x ˉ = 1 n ∑ i = 1 N x ⃗ i \bar x=\frac1n\sum\limits^N_{i=1}\vec x_i xˉ=n1i=1∑Nx i
样本无偏方差: S 2 = 1 n − 1 ∑ i = 1 N ( x ˉ − x ⃗ i ) 2 S^2=\frac1{n-1}\sum\limits^N_{i=1}(\bar x-\vec x_i)^2 S2=n−11i=1∑N(xˉ−x i)2
样本 X X X 和样本 Y Y Y 的协方差: C o v ( X , Y ) = 1 n − 1 ∑ i = 1 N ( x ˉ − x ⃗ i ) ( y ˉ − y ⃗ i ) Cov(X,Y)=\frac1{n-1}\sum\limits^N_{i=1}(\bar x-\vec x_i)(\bar y-\vec y_i) Cov(X,Y)=n−11i=1∑N(xˉ−x i)(yˉ−y i)
我们可以发现:方差的计算是针对一维特征,而协方差则要求至少二维。方差是协方差的特殊情况。协方差为正时,说明 X , Y X,Y X,Y 是正相关关系,为负时,则说明是负相关关系,为 0 则说明两个样本相互独立。当有多个样本时,两两之间的协方差就构成了协方差矩阵。
协方差矩阵与散度矩阵关系密切,由散度矩阵的定义: S = ∑ i = 1 N ( x ⃗ i − x ˉ ) ( x ⃗ i − x ˉ ) T S=\sum\limits^N_{i=1}(\vec x_i-\bar x)(\vec x_i-\bar x)^T S=i=1∑N(x i−xˉ)(x i−xˉ)T我们发现,对于一组数据 X X X: S = ( n − 1 ) C o v ( X ) S=(n-1)Cov(X) S=(n−1)Cov(X),因此它们有相同的特征值和特征向量。
详细过程可以参考我的另一篇文章:特征分解和奇异值分解
计算出 n 个特征向量后,我们对其排序,并取前 k 个即可。
在 sklearn
库中的 PCA 就是使用此方法计算的。
假设
X
X
X 矩阵是一个样本集,它已经中心化过(PCA 要求数据中心化),那么协方差矩阵就是:
1
n
X
X
T
\frac1nXX^T
n1XXT
我们对
X
X
X 进行 SVD 分解:
X
=
U
Σ
V
T
X=U\Sigma V^T
X=UΣVT,则
X
T
=
V
Σ
U
T
X^T=V\Sigma U^T
XT=VΣUT
因此就有:
1
n
X
X
T
=
U
Σ
2
U
T
\frac1nXX^T=U\Sigma^2U^T
n1XXT=UΣ2UT
因此,我们可以对 X X X 进行 SVD 分解,然后求 Σ \Sigma Σ 的内积来计算特征向量。
花一晚上阅读了 sklearn
的官方代码,大概记录一下。源码篇幅过长,所以仅记录重点部分。
通过 svd_solver
参数,我们可以选择不同方式对数据进行求解,但是基本思路相同,这里仅以 svd_solver == 'full'
也就是使用所有数据进行特征分解这一情况为例。
fit()
方法的实现:
其中,explained_variance_
为方差值,explained_variance_ratio_
为每个方程的贡献率,也就是占比,可以通过这个参数来判断 n 具体取值。
注意,n_components
参数可以取好几种值:
def _fit_full(self, X, n_components): n_samples, n_features = X.shape ... # 对传入的数据进行验证,暂且省略 # Center data self.mean_ = np.mean(X, axis=0) X -= self.mean_ # 使用分解内积矩阵方法求特征向量 U, S, Vt = linalg.svd(X, full_matrices=False) U, Vt = svd_flip(U, Vt) # 由于有时候可能求得负值,所以翻转特征向量符号以强制执行确定性输出 components_ = Vt explained_variance_ = (S**2) / (n_samples - 1) total_var = explained_variance_.sum() explained_variance_ratio_ = explained_variance_ / total_var singular_values_ = S.copy() # Store the singular values. # 根据 n_components 确定取几个特征,也就是 n 的值。 if n_components == "mle": n_components = _infer_dimension(explained_variance_, n_samples) elif 0 < n_components < 1.0: ratio_cumsum = stable_cumsum(explained_variance_ratio_) n_components = np.searchsorted(ratio_cumsum, n_components, side="right") + 1 # Compute noise covariance using Probabilistic PCA model # The sigma2 maximum likelihood (cf. eq. 12.46) if n_components < min(n_features, n_samples): self.noise_variance_ = explained_variance_[n_components:].mean() else: self.noise_variance_ = 0.0 ... # 对参数进行赋值,暂且省略 return U, S, Vt
transform()
方法的实现:
这里运用了白化操作,即对数据进行去相关,以降低输入数据的冗余性,白化后的数据有两个特性:消除了特征之间的相关性,且所有特征的方差都为 1。
def transform(self, X):
... # 对程序和数据进行验证,暂且忽略
if self.mean_ is not None:
X = X - self.mean_
X_transformed = np.dot(X, self.components_.T)
# 白化
if self.whiten:
X_transformed /= np.sqrt(self.explained_variance_)
return X_transformed
fit_transform()
方法的实现:
相关的公式都被作者贴心的写在注释里了。
def fit_transform(self, X, y=None):
U, S, Vt = self._fit(X)
U = U[:, : self.n_components_]
if self.whiten:
# X_new = X * V / S * sqrt(n_samples) = U * sqrt(n_samples)
U *= sqrt(X.shape[0] - 1)
else:
# X_new = X * V = U * S * Vt * V = U * S
U *= S[: self.n_components_]
return U
通过计算对数似然的方法来判断 n 具体该取几,也就是保留几个维度。相关论文可见此链接
def _assess_dimension(spectrum, rank, n_samples): """Compute the log-likelihood of a rank ``rank`` dataset. The dataset is assumed to be embedded in gaussian noise of shape(n, dimf) having spectrum ``spectrum``. Parameters ---------- spectrum : ndarray of shape (n_features,) Data spectrum. rank : int Tested rank value. It should be strictly lower than n_features, otherwise the method isn't specified (division by zero in equation (31) from the paper). n_samples : int Number of samples. Returns ------- ll : float The log-likelihood. """ n_features = spectrum.shape[0] if not 1 <= rank < n_features: raise ValueError("the tested rank should be in [1, n_features - 1]") eps = 1e-15 if spectrum[rank - 1] < eps: # When the tested rank is associated with a small eigenvalue, there's # no point in computing the log-likelihood: it's going to be very # small and won't be the max anyway. Also, it can lead to numerical # issues below when computing pa, in particular in log((spectrum[i] - # spectrum[j]) because this will take the log of something very small. return -np.inf pu = -rank * log(2.0) for i in range(1, rank + 1): pu += ( gammaln((n_features - i + 1) / 2.0) - log(np.pi) * (n_features - i + 1) / 2.0 ) pl = np.sum(np.log(spectrum[:rank])) pl = -pl * n_samples / 2.0 v = max(eps, np.sum(spectrum[rank:]) / (n_features - rank)) pv = -np.log(v) * n_samples * (n_features - rank) / 2.0 m = n_features * rank - rank * (rank + 1.0) / 2.0 pp = log(2.0 * np.pi) * (m + rank) / 2.0 pa = 0.0 spectrum_ = spectrum.copy() spectrum_[rank:n_features] = v for i in range(rank): for j in range(i + 1, len(spectrum)): pa += log( (spectrum[i] - spectrum[j]) * (1.0 / spectrum_[j] - 1.0 / spectrum_[i]) ) + log(n_samples) ll = pu + pl + pv + pp - pa / 2.0 - rank * log(n_samples) / 2.0 return ll def _infer_dimension(spectrum, n_samples): """Infers the dimension of a dataset with a given spectrum. The returned value will be in [1, n_features - 1]. """ ll = np.empty_like(spectrum) ll[0] = -np.inf # we don't want to return n_components = 0 for rank in range(1, spectrum.shape[0]): ll[rank] = _assess_dimension(spectrum, rank, n_samples) return ll.argmax()
通过计算每个样本的对数似然来反应性能。相关论文可见此链接
def get_covariance(self): """Compute data covariance with the generative model. ``cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)`` where S**2 contains the explained variances, and sigma2 contains the noise variances. Returns ------- cov : array of shape=(n_features, n_features) Estimated covariance of data. """ components_ = self.components_ exp_var = self.explained_variance_ if self.whiten: components_ = components_ * np.sqrt(exp_var[:, np.newaxis]) exp_var_diff = np.maximum(exp_var - self.noise_variance_, 0.0) cov = np.dot(components_.T * exp_var_diff, components_) cov.flat[:: len(cov) + 1] += self.noise_variance_ # modify diag inplace return cov def get_precision(self): """Compute data precision matrix with the generative model. Equals the inverse of the covariance but computed with the matrix inversion lemma for efficiency. Returns ------- precision : array, shape=(n_features, n_features) Estimated precision of data. """ n_features = self.components_.shape[1] # handle corner cases first if self.n_components_ == 0: return np.eye(n_features) / self.noise_variance_ if np.isclose(self.noise_variance_, 0.0, atol=0.0): return linalg.inv(self.get_covariance()) # Get precision using matrix inversion lemma components_ = self.components_ exp_var = self.explained_variance_ if self.whiten: components_ = components_ * np.sqrt(exp_var[:, np.newaxis]) exp_var_diff = np.maximum(exp_var - self.noise_variance_, 0.0) precision = np.dot(components_, components_.T) / self.noise_variance_ precision.flat[:: len(precision) + 1] += 1.0 / exp_var_diff precision = np.dot(components_.T, np.dot(linalg.inv(precision), components_)) precision /= -(self.noise_variance_**2) precision.flat[:: len(precision) + 1] += 1.0 / self.noise_variance_ return precision def score_samples(self, X): """Return the log-likelihood of each sample. Parameters ---------- X : array-like of shape (n_samples, n_features) The data. """ ... # 对程序和数据的验证,暂且忽略 Xr = X - self.mean_ n_features = X.shape[1] precision = self.get_precision() log_like = -0.5 * (Xr * (np.dot(Xr, precision))).sum(axis=1) log_like -= 0.5 * (n_features * log(2.0 * np.pi) - fast_logdet(precision)) return log_like def score(self, X, y=None): """Return the average log-likelihood of all samples.""" return np.mean(self.score_samples(X))
2023-8-2 更新
首先需要两个前置知识:
Let A ∈ R n A\in R^n A∈Rn be symmetric, and λ i ∈ R , i ∈ [ 1 , n ] \lambda_i\in R, i\in[1,n] λi∈R,i∈[1,n] be the eigenvalues of A A A. There exists a set of orthonormal vectors u ⃗ i ∈ R n , i ∈ [ 1 , n ] \vec u_i\in R^n,i\in[1,n] u i∈Rn,i∈[1,n], such that A u ⃗ i = λ i u ⃗ i A\vec u_i=\lambda_i\vec u_i Au i=λiu i. Equivalently, there exists an orthogonal matrix U = [ u ⃗ 1 , ⋯ , u ⃗ n ] U=[\vec u_1,\cdots,\vec u_n] U=[u 1,⋯,u n] (i.e. U U T = U T U = I n UU^T=U^TU=I_n UUT=UTU=In) such that: A = U Λ U T = ∑ i = 1 n λ i u ⃗ i u ⃗ i T , Λ = [ λ 1 ⋯ 0 ⋮ ⋱ ⋮ 0 ⋯ λ n ] , u ⃗ i = [ u i 1 ⋮ u i n ] A=U\Lambda U^T=\sum^n_{i=1}\lambda_i\vec u_i\vec u_i^T, \Lambda=[λ1⋯0⋮⋱⋮0⋯λn]
,\vec u_i=[ui1⋮uin]⎡⎣⎢⎢λ1⋮0⋯⋱⋯0⋮λn⎤⎦⎥⎥ A=UΛUT=i=1∑nλiu iu iT,Λ= λ1⋮0⋯⋱⋯0⋮λn ,u i= ui1⋮uin
Given a symmertric matrix A ∈ S n A\in S^n A∈Sn, λ min ( A ) ≤ x ⃗ T A x ⃗ x ⃗ T x ⃗ ≤ λ max ( A ) , ∀ x ⃗ ≠ 0 λ max ( A ) = max x ⃗ T A x ⃗ λ max ( A ) = max x ⃗ T A x ⃗ \lambda_{\min}(A)\le\frac{\vec x^TA\vec x}{\vec x^T\vec x}\le\lambda_{\max}(A),\forall\vec x\ne0\\\lambda_{\max}(A)=\max\vec x^TA\vec x\\\lambda_{\max}(A)=\max\vec x^TA\vec x λmin(A)≤x Tx x TAx ≤λmax(A),∀x =0λmax(A)=maxx TAx λmax(A)=maxx TAx The maximum and minimum are attained for x ⃗ = u ⃗ 1 \vec x=\vec u_1 x =u 1 and for x ⃗ = u ⃗ n \vec x=\vec u_n x =u n, where u ⃗ 1 \vec u_1 u 1 and u ⃗ n \vec u_n u n are the largest and smallest eigenvector of A A A. 其中 x ⃗ \vec x x 要做标准化处理,即 ∥ x ⃗ ∥ 2 = 1 \Vert\vec x\Vert_2=1 ∥x ∥2=1
Proof
对
x
⃗
T
A
x
⃗
\vec x^TA\vec x
x
TAx
使用谱定理:
x
⃗
T
A
x
⃗
=
x
⃗
T
U
Λ
U
T
x
⃗
=
x
⃗
′
T
Λ
x
⃗
′
=
∑
i
=
1
n
λ
i
x
i
′
2
,
x
⃗
′
=
U
T
x
⃗
=
[
x
1
′
⋮
x
n
′
]
\vec x^TA\vec x=\vec x^TU\Lambda U^T\vec x=\vec x^{\prime~T}\Lambda\vec x^\prime=\sum^n_{i=1}\lambda_ix_i^{\prime2},~\vec x^\prime=U^T\vec x=[x′1⋮x′n]
由于
U
U
U 是正交矩阵,所以:
∑
i
=
1
n
x
i
′
2
=
x
⃗
′
T
x
⃗
′
=
(
U
T
x
⃗
)
T
(
U
T
x
⃗
)
=
x
⃗
T
U
U
T
x
⃗
=
x
⃗
T
x
⃗
=
∑
i
=
1
n
x
i
2
\sum^n_{i=1}x_i^{\prime2}=\vec x^{\prime~T}\vec x^\prime=(U^T\vec x)^T(U^T\vec x)=\vec x^TUU^T\vec x=\vec x^T\vec x=\sum^n_{i=1}x_i^2
i=1∑nxi′2=x
′ Tx
′=(UTx
)T(UTx
)=x
TUUTx
=x
Tx
=i=1∑nxi2
所以:
λ
min
x
⃗
T
x
⃗
≤
x
⃗
T
A
x
⃗
≤
λ
max
x
⃗
T
x
⃗
\lambda_{\min}\vec x^T\vec x\le\vec x^TA\vec x\le\lambda_{\max}\vec x^T\vec x
λminx
Tx
≤x
TAx
≤λmaxx
Tx
接下来,我们可以开始进行 PCA 的推导,首先我们要解决如下三个问题:
现在我们分别证明上面两个问题的解决方式是正确的:
kernel PCA 通过使用核函数对数据进行非线性变换使得 PCA 具有了非线性分解的能力。
推导过程如下:
同步更新于:SP-FA 的博客
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。