不正经

这个屌丝很懒，什么也没留下！

热门标签

7. 吴恩达机器学习课程-作业7-Kmeans and PCA

作者：不正经 | 2024-02-07 21:46:29

踩

fork了别人的项目，自己重新填写，我的代码如下

https://gitee.com/fakerlove/machine-learning/tree/master/code
1

代码原链接

文章目录

7. 吴恩达机器学习课程-作业7-Kmeans and PCA

7. 吴恩达机器学习课程-作业7-Kmeans and PCA

在本练习中，您将实现K-means聚类算法和应用它来压缩图像。

在第二部分中，您将使用PCA，找到一个低维的人脸图像表示。

在开始编程练习之前，我们强烈建议大家先看视频课程，并完成相关课程的复习题的话题。

要开始练习，您需要下载启动器代码并将其内容解压缩到您希望完成的目录锻炼。

如果需要，使用Octave/MATLAB中的cd命令更改为这个目录，然后开始这个练习。你也可以在课程网站的“环境设置说明”中找到安装Octave/MATLAB的说明。

7.1 k-means

7.1.1 题目介绍

在本练习中，您将实现K-means算法并使用它图像压缩。

您将首先从一个示例2D数据集开始在本练习中，您将实现K-means算法并使用它图像压缩。您将首先从一个示例2D数据集开始将帮助您对K-means算法如何工作有一个直观的了解。

后你将使用K-means算法进行图像压缩一幅图像中出现的颜色的数量仅为那些颜色最多的在这幅图中很常见。你将使用ex7。M代表这部分练习。

7.1.2 算法流程

输入： $n$ 个样本的集合 $X$

输出：样本集合的聚类 $C^*$

初始化,令 $t = 0$ ,随机选择 $k$ 个样本点作为初始聚类中心

$m^{(0)}=(m_1^{(0)},\dots,m_l^{(0)},\dots,m_k^{(0)})$
对样本进行聚类，对固定的类中心 $m^{(t)}=(m_1^{(t)},\dots,m_l^{(t)},\dots,m_k^{(t)})$ ,其中 $m_k^{(t)}$ 为类 $C_l$ 的中心，

计算每个样本到类中心的距离，将每个样本指派到与其最近的中心的类中，构成聚类结果 $C^(t)$
计算新的类中心，对聚类结果 $C^{(t)}$ ，计算当前各个类中的样本的均值，作为新的类中心

$m^{(t+1)}=(m_1^{(t+1)},\dots,m_l^{(t+1)},\dots,m_k^{(t+1)})$
如果迭代收敛或符合停止条件，输出 $C^*=C^{(t)}$ ,后者令 $t = t + 1$

7.1.3 实现k-means

首先需要实现，为每个点找到最近的簇中心，作为当前点的标签

$C^{(i)}=j\ that \ minizizes \ ||x^{(i)}-\mu_j||^2$

import re
import scipy.io as sio
import numpy as np
import matplotlib.pyplot as plt
from time import *

def find_closet_centroids(X, centroids):
    """
    寻找所属簇,centroids重心的意思，意思就是这个点离所有的中心点最近

    :param X: ndarray,所有点
    :param centroids: ndarray,上一步计算出或初始化的簇中心
    :return: ndarray,每个点所属于的簇
    """
    # 多加了一个0
    res = np.zeros((1,))
    for x in X:
        res = np.append(res, np.argmin(np.sqrt(np.sum((centroids - x) ** 2, axis=1))))

    # 返回结果时，去掉多加的一个0
    return res[1:]


if __name__ == '__main__':
    begin_time = time()
    print("========程序开始============")

    data = sio.loadmat("data\\ex7data2.mat")
    X = np.array(data['X']) # (300,2)
    # 随机初始化中心店
    init_centroids = np.array([[3, 3], [6, 2], [8, 5]])
    idx = find_closet_centroids(X, init_centroids)
    print(idx[0:10]) # [0. 2. 1.]

    end_time = time()
    run_time = end_time - begin_time
    print("========程序结束============")
    print('该循环程序运行时间：', run_time)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

接着实现第二个部分，重新计算簇中心

$\mu_l=\frac{1}{C_k}\sum_{i\in C_k}x^{(i)}$

def compute_centroids(X, idx):
    """
    计算新的簇中心
    :param X: ndarray,所有点
    :param idx: ndarray,每个点对应的簇号
    :return: ndarray,所有新簇中心
    """
    # K是一共分成的类数
    K = int(np.max(idx)) + 1
    # m为X的行数
    m = X.shape[0]
    n = X.shape[-1]
    centroids = np.zeros((K, n))
    counts = np.zeros((K, n))

    for i in range(m):
        centroids[int(idx[i])] += X[i]
        counts[int(idx[i])] += 1
    centroids = centroids / counts
    return centroids
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

将这两步封装成一个完整的K-means算法，这里用到的随机初始化簇中心，在1.3实现，我直接先用一下，具体的函数看1.3。而且算法的迭代轮数可以指定一个很大的轮数保证计算完后算法收敛。我选择每次计算当前的目标值，当目标值不变时，算法收敛退出。
需要先实现损失函数

def cost(X, idx, centrodis):
    c = 0
    for i in range(len(X)):
        c += np.sum((X[i] - centrodis[int(idx[i])]) ** 2)
    c /= len(X)
    return c
1
2
3
4
5
6

定义聚类算法

def random_initialization(X, K):
    """
    随机选择K组数据，作为簇中心
    :param X: ndarray,所有点
    :param K: int,聚类的类数
    :return: ndarray,簇中心
    """
    res = np.zeros((1, X.shape[-1]))
    m = X.shape[0]
    rl = []
    while True:
        index = random.randint(0, m)
        if index not in rl:
            rl.append(index)
        if len(rl) >= K:
            break
    for index in rl:
        res = np.concatenate((res, X[index].reshape(1, -1)), axis=0)
    return res[1:]

def k_means(X, K):
    """
    k-means聚类算法，
    :param X: ndarray,所有的数据
    :param K: int,分成聚类的类数
    :return: tuple,(idx, centroids_all)
                idx,ndarray 为每个数据所属类标签
                centroids_all,[ndarray,...]计算过程中每轮的簇中心
    """
    centroids = random_initialization(X, K)
    centroids_all = [centroids]
    idx = np.zeros((1,))
    last_c = -1
    now_c = -2
    # iterations = 200
    # for i in range(iterations):
    while now_c != last_c:  # 当收敛时结束算法，或者可以利用指定迭代轮数
        # 算出每个点所属簇类
        idx = find_closet_centroids(X, centroids)
        last_c = now_c
        # 计算损失函数
        now_c = cost(X, idx, centroids)
        # 根据重新规划后的簇类，重新规划簇类中心点
        centroids = compute_centroids(X, idx)
        # 记录训练过程中所有的中心点
        centroids_all.append(centroids)
        
    return idx, centroids_all
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

7.1.4 测试

    data = sio.loadmat("data\\ex7data2.mat")
    X = data['X']  # (300,2)
    idx, centroids_all = k_means(X, 3)
1
2
3

这里的聚类算中返回了每轮计算的簇中心，就是为了后面的可视化簇中心变化过程，画出聚类结果和簇中心的移动路线

def visualizing(X, idx, centroids_all):
    """
    可视化聚类结果和簇中心的移动过程
    :param X: ndarray,所有的数据
    :param idx: ndarray,每个数据所属类标签
    :param centroids_all: [ndarray,...]计算过程中每轮的簇中心
    :return: None
    """
    plt.scatter(X[..., 0], X[..., 1], c=idx)
    xx = []
    yy = []
    for c in centroids_all:
        xx.append(c[..., 0])
        yy.append(c[..., 1])

    plt.plot(xx, yy, 'rx--')
    plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

所有代码

import random
import re
import scipy.io as sio
import numpy as np
import matplotlib.pyplot as plt
from time import *


def find_closet_centroids(X, centroids):
    """
    寻找所属簇,centroids重心的意思，意思就是这个点离所有的中心点最近

    :param X: ndarray,所有点
    :param centroids: ndarray,上一步计算出或初始化的簇中心
    :return: ndarray,每个点所属于的簇
    """
    # 多加了一个0
    res = np.zeros((1,))
    for x in X:
        res = np.append(res, np.argmin(np.sqrt(np.sum((centroids - x) ** 2, axis=1))))

    # 返回结果时，去掉多加的一个0
    return res[1:]


def compute_centroids(X, idx):
    """
    计算新的簇中心
    :param X: ndarray,所有点
    :param idx: ndarray,每个点对应的簇号
    :return: ndarray,所有新簇中心
    """
    # K是一共分成的类数
    K = int(np.max(idx)) + 1
    # m为X的行数
    m = X.shape[0]
    n = X.shape[-1]
    centroids = np.zeros((K, n))
    counts = np.zeros((K, n))

    for i in range(m):
        centroids[int(idx[i])] += X[i]
        counts[int(idx[i])] += 1
    centroids = centroids / counts
    return centroids


def cost(X, idx, centrodis):
    """
    计算损失函数
    :param X:
    :param idx:
    :param centrodis:
    :return:
    """
    c = 0
    for i in range(len(X)):
        c += np.sum((X[i] - centrodis[int(idx[i])]) ** 2)
    c /= len(X)
    return c


def random_initialization(X, K):
    """
    随机选择K组数据，作为簇中心
    :param X: ndarray,所有点
    :param K: int,聚类的类数
    :return: ndarray,簇中心
    """
    res = np.zeros((1, X.shape[-1]))
    m = X.shape[0]
    rl = []
    while True:
        index = random.randint(0, m)
        if index not in rl:
            rl.append(index)
        if len(rl) >= K:
            break
    for index in rl:
        res = np.concatenate((res, X[index].reshape(1, -1)), axis=0)
    return res[1:]


def k_means(X, K):
    """
    k-means聚类算法，
    :param X: ndarray,所有的数据
    :param K: int,分成聚类的类数
    :return: tuple,(idx, centroids_all)
                idx,ndarray 为每个数据所属类标签
                centroids_all,[ndarray,...]计算过程中每轮的簇中心
    """
    centroids = random_initialization(X, K)
    centroids_all = [centroids]
    idx = np.zeros((1,))
    last_c = -1
    now_c = -2
    # iterations = 200
    # for i in range(iterations):
    while now_c != last_c:  # 当收敛时结束算法，或者可以利用指定迭代轮数
        # 算出每个点所属簇类
        idx = find_closet_centroids(X, centroids)
        last_c = now_c
        # 计算损失函数
        now_c = cost(X, idx, centroids)
        # 根据重新规划后的簇类，重新规划簇类中心点
        centroids = compute_centroids(X, idx)
        # 记录训练过程中所有的中心点
        centroids_all.append(centroids)

    return idx, centroids_all

def visualizing(X, idx, centroids_all):
    """
    可视化聚类结果和簇中心的移动过程
    :param X: ndarray,所有的数据
    :param idx: ndarray,每个数据所属类标签
    :param centroids_all: [ndarray,...]计算过程中每轮的簇中心
    :return: None
    """
    plt.scatter(X[..., 0], X[..., 1], c=idx)
    xx = []
    yy = []
    for c in centroids_all:
        xx.append(c[..., 0])
        yy.append(c[..., 1])

    plt.plot(xx, yy, 'rx--')
    plt.show()


if __name__ == '__main__':
    begin_time = time()
    print("========程序开始============")

    data = sio.loadmat("data\\ex7data2.mat")
    X = np.array(data['X'])  # (300,2)

    idx, centroids_all = k_means(X, 3)
    visualizing(X, idx, centroids_all)
    end_time = time()
    run_time = end_time - begin_time
    print("========程序结束============")
    print('该循环程序运行时间：', run_time)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144

7.2 图像压缩与k-均值

一般簇中心的初始化是从数据中随机选择K组

def random_initialization(X, K):
    """
    随机选择K组数据，作为簇中心
    :param X: ndarray,所有点
    :param K: int,聚类的类数
    :return: ndarray,簇中心
    """
    res = np.zeros((1, X.shape[-1]))
    m = X.shape[0]
    rl = []
    while True:
        index = random.randint(0, m)
        if index not in rl:
            rl.append(index)
        if len(rl) >= K:
            break
    for index in rl:
        res = np.concatenate((res, X[index].reshape(1, -1)), axis=0)
    return res[1:]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

接下来将聚类算法用于图片压缩上，先讲一讲思路
图片都是由若干像素点构成的，每个像素点都有颜色，这个颜色一般是RGB编码，意思是所有颜色都可以通过Red,Green,Blue来表示，RGB编码下三种颜色每个通过一个8比特的整数来表示强度，因此一个像素点需要24bit来表示颜色。
图片上的每个像素点都需要使用24bit存储颜色，我们的压缩方法是，通过聚类将图片上的颜色分为16种，我们只存储这十六种颜色，每个像素点上只需要4比特存储对应颜色的序号，即可达到有损压缩图片的目的。

进行图片压缩，将rgb颜色作为聚类的点数据，每个颜色的表示是一个列表包含三个数。这里聚类结束后构造新的图片矩阵，不是很规范，应该是直接存储bit。

def compress(image, colors_num):
    """
    压缩图片
    :param image: ndarray,原始图片
    :param colors_num: int,压缩后的颜色数量
    :return: (ndarray,ndarray),第一个每个像素点存储一个值，第二个为颜色矩阵
    """
    # _代表3，d1和d2代表图片的长和宽
    d1, d2, _ = image.shape
    # 变成二维数组
    raw_image = image.reshape(d1 * d2, -1)  # 展开成二维数组
    print(raw_image.shape)
    #
    idx, centroids_all = k_means(raw_image, colors_num)
    # colors 表示最后一次的中心点
    colors = centroids_all[-1]
    # 构造压缩后的图片格式
    compressed_image = np.zeros((1, 1))
    for i in range(d1 * d2):
        compressed_image = np.concatenate((compressed_image, idx[i].reshape(1, -1)), axis=0)
    compressed_image = compressed_image[1:].reshape(d1, d2, -1)
    return compressed_image, colors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

为了可视化效果，还需要将压缩后的图片格式转化为可以显示的标准格式

def compressed_format_to_normal_format(compressed_image, colors):
    """
    将压缩后的图片转为正常可以显示的图片格式
    :param compressed_image: ndarray,压缩后的图片,存储颜色序号
    :param colors: ndarray,颜色列表
    :return: ndarray,正常的rgb格式图片
    """
    d1, d2, _ = compressed_image.shape
    normal_format_image = np.zeros((1, len(colors[0])))
    compressed_image = compressed_image.reshape(d1 * d2, -1)
    for i in range(d1 * d2):
        normal_format_image = np.concatenate((normal_format_image, colors[int(compressed_image[i][0])].reshape(1, -1)),
                                             axis=0)
    normal_format_image = normal_format_image[1:].reshape(d1, d2, -1)
    return normal_format_image
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

看一看效果

    image = plt.imread("data\\bird_small.png")  # (128,128,3)
    plt.subplot(1, 2, 1)
    plt.imshow(image)
    plt.axis('off')
    plt.title("raw image")
    plt.subplot(1, 2, 2)
    compressed_image, colors = compress(image, 16)
    print(compressed_image.shape, colors.shape)
    plt.imshow(compressed_format_to_normal_format(compressed_image, colors))
    plt.axis('off')
    plt.title("compressed image")
    plt.show()
1
2
3
4
5
6
7
8
9
10
11
12

运行时间有点长

========程序开始============
(128, 128, 3)
(16384, 3)
(128, 128, 1) (16, 3)
========程序结束============
该循环程序运行时间： 94.86905074119568
1
2
3
4
5
6

总代码如下

import random
import re
import scipy.io as sio
import numpy as np
import matplotlib.pyplot as plt
from time import *


def find_closet_centroids(X, centroids):
    """
    寻找所属簇,centroids重心的意思，意思就是这个点离所有的中心点最近

    :param X: ndarray,所有点
    :param centroids: ndarray,上一步计算出或初始化的簇中心
    :return: ndarray,每个点所属于的簇
    """
    # 多加了一个0
    res = np.zeros((1,))
    for x in X:
        res = np.append(res, np.argmin(np.sqrt(np.sum((centroids - x) ** 2, axis=1))))

    # 返回结果时，去掉多加的一个0
    return res[1:]


def compute_centroids(X, idx):
    """
    计算新的簇中心
    :param X: ndarray,所有点
    :param idx: ndarray,每个点对应的簇号
    :return: ndarray,所有新簇中心
    """
    # K是一共分成的类数
    K = int(np.max(idx)) + 1
    # m为X的行数
    m = X.shape[0]
    n = X.shape[-1]
    centroids = np.zeros((K, n))
    counts = np.zeros((K, n))

    for i in range(m):
        centroids[int(idx[i])] += X[i]
        counts[int(idx[i])] += 1
    centroids = centroids / counts
    return centroids


def cost(X, idx, centrodis):
    """
    计算损失函数
    :param X:
    :param idx:
    :param centrodis:
    :return:
    """
    c = 0
    for i in range(len(X)):
        c += np.sum((X[i] - centrodis[int(idx[i])]) ** 2)
    c /= len(X)
    return c


def random_initialization(X, K):
    """
    随机选择K组数据，作为簇中心
    :param X: ndarray,所有点
    :param K: int,聚类的类数
    :return: ndarray,簇中心
    """
    res = np.zeros((1, X.shape[-1]))
    m = X.shape[0]
    rl = []
    while True:
        index = random.randint(0, m)
        if index not in rl:
            rl.append(index)
        if len(rl) >= K:
            break
    for index in rl:
        res = np.concatenate((res, X[index].reshape(1, -1)), axis=0)
    return res[1:]


def compress(image, colors_num):
    """
    压缩图片
    :param image: ndarray,原始图片
    :param colors_num: int,压缩后的颜色数量
    :return: (ndarray,ndarray),第一个每个像素点存储一个值，第二个为颜色矩阵
    """
    # _代表3，d1和d2代表图片的长和宽
    d1, d2, _ = image.shape
    print(image.shape)
    # 变成二维数组,
    # (a,b)
    # #改变维度为m行、d列 （-1表示列数自动计算，d= a*b /m ）
    raw_image = image.reshape(d1 * d2, -1)  # 展开成二维数组
    print(raw_image.shape)
    #
    idx, centroids_all = k_means(raw_image, colors_num)
    # colors 表示最后一次的中心点
    colors = centroids_all[-1]
    # 构造压缩后的图片格式
    compressed_image = np.zeros((1, 1))
    for i in range(d1 * d2):
        compressed_image = np.concatenate((compressed_image, idx[i].reshape(1, -1)), axis=0)
    compressed_image = compressed_image[1:].reshape(d1, d2, -1)
    return compressed_image, colors


def compressed_format_to_normal_format(compressed_image, colors):
    """
    将压缩后的图片转为正常可以显示的图片格式
    :param compressed_image: ndarray,压缩后的图片,存储颜色序号
    :param colors: ndarray,颜色列表
    :return: ndarray,正常的rgb格式图片
    """
    d1, d2, _ = compressed_image.shape
    normal_format_image = np.zeros((1, len(colors[0])))
    compressed_image = compressed_image.reshape(d1 * d2, -1)
    for i in range(d1 * d2):
        normal_format_image = np.concatenate((normal_format_image, colors[int(compressed_image[i][0])].reshape(1, -1)),
                                             axis=0)
    normal_format_image = normal_format_image[1:].reshape(d1, d2, -1)
    return normal_format_image


def k_means(X, K):
    """
    k-means聚类算法，
    :param X: ndarray,所有的数据
    :param K: int,分成聚类的类数
    :return: tuple,(idx, centroids_all)
                idx,ndarray 为每个数据所属类标签
                centroids_all,[ndarray,...]计算过程中每轮的簇中心
    """
    centroids = random_initialization(X, K)
    centroids_all = [centroids]
    idx = np.zeros((1,))
    last_c = -1
    now_c = -2
    # iterations = 200
    # for i in range(iterations):
    while now_c != last_c:  # 当收敛时结束算法，或者可以利用指定迭代轮数
        # 算出每个点所属簇类
        idx = find_closet_centroids(X, centroids)
        last_c = now_c
        # 计算损失函数
        now_c = cost(X, idx, centroids)
        # 根据重新规划后的簇类，重新规划簇类中心点
        centroids = compute_centroids(X, idx)
        # 记录训练过程中所有的中心点
        centroids_all.append(centroids)

    return idx, centroids_all


def visualizing(X, idx, centroids_all):
    """
    可视化聚类结果和簇中心的移动过程
    :param X: ndarray,所有的数据
    :param idx: ndarray,每个数据所属类标签
    :param centroids_all: [ndarray,...]计算过程中每轮的簇中心
    :return: None
    """
    plt.scatter(X[..., 0], X[..., 1], c=idx)
    xx = []
    yy = []
    for c in centroids_all:
        xx.append(c[..., 0])
        yy.append(c[..., 1])

    plt.plot(xx, yy, 'rx--')
    plt.show()


if __name__ == '__main__':
    begin_time = time()
    print("========程序开始============")

    # data = sio.loadmat("data\\ex7data2.mat")
    # X = np.array(data['X'])  # (300,2)

    image = plt.imread("data\\bird_small.png")  # (128,128,3)
    plt.subplot(1, 2, 1)
    plt.imshow(image)
    plt.axis('off')
    plt.title("raw image")
    plt.subplot(1, 2, 2)
    compressed_image, colors = compress(image, 16)
    print(compressed_image.shape, colors.shape)
    plt.imshow(compressed_format_to_normal_format(compressed_image, colors))
    plt.axis('off')
    plt.title("compressed image")
    plt.show()
    end_time = time()
    run_time = end_time - begin_time
    print("========程序结束============")
    print('该循环程序运行时间：', run_time)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199

7.3 PCA

主成分分析法是干什么用的？
数据降维，话句话说就是将数据地特征数量变少，但又不是简单地删除特征。数据降维地目的可以是压缩数据，减少数据的存储空间，让算法提速；也可以是将数据降到二维或者三维进行可视化
主成分分析法在做什么？
上面说到主成分分析法用于数据降维，大概理解一下它怎么做的。

现在我们数据维度为n，我想通过降维让数据变成k维。
那么PCA做的就是对于n维空间的数据，寻找一个K维的“面”，让这些数据到这个“面”的距离最短，这个距离又叫做投影误差。
找到这个“面”后，将n维空间的点投影到这个“面”，因此所有点都投影到了k维空间，因此可以特征数量变为了k。
假设n=2,k=1，那么就是将二维平面上的点投影到一个向量上。假设n=3,k=2，那么就是将三维空间的点投影到一个平面上。
主成分分析法具体怎么做呢？
对于数据要从n维降到k维
首先对数据进行feature scaling/mean normalization，也就是归一化
其次计算协方差矩阵

$\sum=\frac{1}{m}X^TX$

接着计算sigma矩阵的“特征向量”，这里使用奇异值分解(single value decomposition)。

$[U,S,V]=svd(\sum)$

利用U计算新的特征，若要下降到K维，取U的前K列构成新矩阵

$Z=U_{reduce}^TX$

7.3.1 题目介绍

在本练习中，您将使用主成分分析(PCA)来执行降维。您将首先尝试一个2D示例数据集来获得关于PCA如何工作的直觉，然后将其用于更大的5000个人脸图像数据集。

7.3.2 算法流程

求解步骤

去除平均值
计算协方差矩阵
计算协方差矩阵的特征值和特征向量
将特征值排序
保留前N个最大的特征值对应的特征向量
将原始特征转换到上面得到的N个特征向量构建的新空间中（最后两步，实现了特征压缩）

7.3.3 代码

1) 示例数据集

    data = sio.loadmat("data\\ex7data1.mat")
    X = np.array(data['X'])  # (50,2)
    plt.scatter(X[..., 0], X[..., 1], marker='x', c='b')
    plt.show()
1
2
3
4

2) 实现PCA

实现PCA首先要做的就是对数据的处理进行归一化，注意这里的方差的计算，默认ddof为0，通常情况下是使用ddof=1，就是方差计算中最后除以m还是m-1的不同。

def data_preprocess(X):
    """
    数据归一化
    :param X: ndarray,原始数据
    :return: (ndarray.ndarray,ndarray),处理后的数据,每个特征均值，每个特征方差
    """
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0, ddof=1)  # 默认ddof=0, 这里一定要修改
    return (X - mean) / std, mean, std
1
2
3
4
5
6
7
8
9

numpy中有奇异值分解的功能，直接使用

def data_preprocess(X):
    """
    数据归一化
    :param X: ndarray,原始数据
    :return: (ndarray.ndarray,ndarray),处理后的数据,每个特征均值，每个特征方差
    """
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0, ddof=1)  # 默认ddof=0, 这里一定要修改
    return (X - mean) / std, mean, std
1
2
3
4
5
6
7
8
9

主成分分析降维,使用PCA进行降维

def project_data(X, U, K):
    """
    数据降维
    :param X: ndarray,原始数据
    :param U: ndarray,奇异值分解后的U
    :param K: int,目标维度
    :return: ndarray,降维后的数据
    """
    return X.dot(U[..., :K])
1
2
3
4
5
6
7
8
9

逆向思维，还可以进行降维后的升维

def reconstruct_data(Z, U, K):
    """
    数据升维
    :param Z: ndarray,降维后的数据
    :param U: ndarray,奇异值分解后的U
    :param K: int,降维的维度
    :return: ndarray,原始数据
    """
    return Z.dot(U[..., :K].T)
1
2
3
4
5
6
7
8
9

测试一下

    data = sio.loadmat("data\\ex7data1.mat")
    X = data['X']  # (50,2)
    normalized_X, _, _ = data_preprocess(X)
    u, _, _ = pca(normalized_X)  # (2,2)
    Z = project_data(normalized_X, u, 1)
    print(Z[0]) # [1.48127391]
    rec_X = reconstruct_data(Z, u, 1)
    print(rec_X[0]) # [-1.04741883 -1.04741883]
1
2
3
4
5
6
7
8

将这个投影可视化

    plt.scatter(normalized_X[..., 0], normalized_X[..., 1], marker='x', c='b', label='normalized x')
    plt.scatter(rec_X[..., 0], rec_X[..., 1], marker='x', c='r', label='reconstructed x')
    plt.title("Visualizing the projections")
    for i in range(len(normalized_X)):
        plt.plot([normalized_X[i][0], rec_X[i][0]], [normalized_X[i][1], rec_X[i][1]], 'k--')
    plt.xlim((-3, 2))
    plt.ylim((-3, 2))
    plt.legend()
    plt.show()
1
2
3
4
5
6
7
8
9

3) 脸图像数据集

将PCA应用到人类数据集上，当前的每张人脸图片为1024像素，因此为1024维。我们的目标是将数据降维到36像素，也就是36维。

使用PCA进行降维

    data = sio.loadmat("data\\ex7faces.mat")
    X = data['X']  # (5000,1024)
    nor_X, _, _ = data_preprocess(X)
    u, _, _ = pca(nor_X)
    Z = project_data(nor_X, u, 36)
    rec_X = reconstruct_data(Z, u, 36)
1
2
3
4
5
6

将前后的图片可视化对比一下，记得要想人脸位置为正向需要转置一下

def visualizing_images(X, d):
    """
    可视化图片
    :param X: ndarray,图片
    :param d: int,一行展示多少张图片
    :return: None
    """
    m = len(X)
    n = X.shape[-1]
    s = int(np.sqrt(n))
    for i in range(1, m + 1):
        plt.subplot(m / d, d, i)
        plt.axis('off')
        plt.imshow(X[i - 1].reshape(s, s).T, cmap='Greys_r')  # 要把脸摆正需要转置
    plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

提取特征后

所有代码

import random
import re
import scipy.io as sio
import numpy as np
import matplotlib.pyplot as plt
from time import *

def data_preprocess(X):
    """
    数据归一化
    :param X: ndarray,原始数据
    :return: (ndarray.ndarray,ndarray),处理后的数据,每个特征均值，每个特征方差
    """
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0, ddof=1)  # 默认ddof=0, 这里一定要修改
    return (X - mean) / std, mean, std


def pca(X):
    """
    numpy中有奇异值分解的功能，直接使用
    :param X:
    :return:
    """
    sigma = X.T.dot(X) / len(X)  # (n,m)x(m,n) (n,n)
    u, s, v = np.linalg.svd(sigma)  # u(n,n) s(n,), v(n,n)
    return u, s, v

def project_data(X, U, K):
    """
    数据降维
    :param X: ndarray,原始数据
    :param U: ndarray,奇异值分解后的U
    :param K: int,目标维度
    :return: ndarray,降维后的数据
    """
    return X.dot(U[..., :K])

def reconstruct_data(Z, U, K):
    """
    数据升维
    :param Z: ndarray,降维后的数据
    :param U: ndarray,奇异值分解后的U
    :param K: int,降维的维度
    :return: ndarray,原始数据
    """
    return Z.dot(U[..., :K].T)

def visualizing_images(X, d):
    """
    可视化图片
    :param X: ndarray,图片
    :param d: int,一行展示多少张图片
    :return: None
    """
    m = len(X)
    n = X.shape[-1]
    s = int(np.sqrt(n))
    for i in range(1, m + 1):
        plt.subplot(m / d, d, i)
        plt.axis('off')
        plt.imshow(X[i - 1].reshape(s, s).T, cmap='Greys_r')  # 要把脸摆正需要转置
    plt.show()


if __name__ == '__main__':
    begin_time = time()
    print("========程序开始============")

    data = sio.loadmat("data\\ex7faces.mat")
    X = data['X']  # (5000,1024)
    nor_X, _, _ = data_preprocess(X)
    u, _, _ = pca(nor_X)
    Z = project_data(nor_X, u, 36)
    rec_X = reconstruct_data(Z, u, 36)
    visualizing_images(X[:25], 5)
    visualizing_images(rec_X[:25], 5)



    end_time = time()
    run_time = end_time - begin_time
    print("========程序结束============")
    print('该循环程序运行时间：', run_time)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/不正经/article/detail/67566