赞
踩
本系列是机器学习课程的系列课程,主要介绍机器学习中无监督算法,包括层次和密度聚类等。
完成一个特定行业的算法应用全过程:
懂业务+会选择合适的算法+数据处理+算法训练+算法调优+算法融合
+算法评估+持续调优+工程化接口实现
关于机器学习的定义,Tom Michael Mitchell的这段话被广泛引用:
对于某类任务T和性能度量P,如果一个计算机程序在T上其性能P随着经验E而自我完善,那么我们称这个计算机程序从经验E中学习。
from scipy.cluster.hierarchy import dendrogram, ward, single
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
X = load_iris().data[:150]
linkage_matrix = ward(X)
dendrogram(linkage_matrix)
plt.show()
输出如下:
这段代码是Python脚本,用于通过Scipy和Scikit-learn库绘制层次聚类的谱系图(dendrogram)。下面是逐行解释:
from scipy.cluster.hierarchy import dendrogram, ward, single
dendrogram
用于绘制谱系图,ward
用于计算ward聚类算法所需的距离矩阵,single
是连接准则的一种,用于确定聚类时的距离。from sklearn.datasets import load_iris
load_iris
函数,用于加载著名的Iris数据集。import matplotlib.pyplot as plt
plt
。pyplot是matplotlib库中用于绘图的一个模块。X = load_iris().data[:150]
load_iris()
函数加载Iris数据集,然后获取该数据集的特征数据(data),并选择前150个样本。linkage_matrix = ward(X)
X
进行聚类,并将生成的连接矩阵(linkage matrix)赋值给变量linkage_matrix
。dendrogram(linkage_matrix)
linkage_matrix
作为参数调用dendrogram
函数,绘制基于这个连接矩阵的谱系图。plt.show()
plt.show()
函数显示上述绘制的谱系图。
# DBSCAN clustering algorithm print(__doc__) import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics # from sklearn.datasets.samples_generator import make_blobs from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler # Generate sample data centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) # Compute DBSCAN db = DBSCAN(eps=0.1, min_samples=10).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) print('Estimated number of clusters: %d' % n_clusters_) print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels)) print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels)) print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels)) print("Adjusted Rand Index: %0.3f" % metrics.adjusted_rand_score(labels_true, labels)) print("Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(labels_true, labels)) print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels)) # import matplotlib.pyplot as plt # Black removed and is used for noise instead. unique_labels = set(labels) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = [0, 0, 0, 1] class_member_mask = (labels == k) xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.title('Estimated number of clusters: %d' % n_clusters_) plt.show()
输出为:
Automatically created module for IPython interactive environment
Estimated number of clusters: 12
Homogeneity: 0.313
Completeness: 0.249
V-measure: 0.277
Adjusted Rand Index: 0.024
Adjusted Mutual Information: 0.267
Silhouette Coefficient: -0.366
这段文字似乎是描述在使用IPython交互式环境(一种广泛使用的Python交互式shell)中自动生成的模块进行聚类分析的结果。以下是每个指标的简要解释:
针对完全没有基础的同学们
1.确定机器学习的应用领域有哪些
2.查找机器学习的算法应用有哪些
3.确定想要研究的领域极其对应的算法
4.通过招聘网站和论文等确定具体的技术
5.了解业务流程,查找数据
6.复现经典算法
7.持续优化,并尝试与对应企业人员沟通心得
8.企业给出反馈
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。