赞
踩
k-prototypes算法是用于处理混合类型数据的经典聚类算法,为了方便研究者利用python进行混合聚类的数据分析,特将python中kmodes包重要参数与使用方法转载如下:
以下内容搬运自创作者的GITHUB:
https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py
kmodes包提供了kprotypes算法的python 实现,使用方式与sklearn中kmeans算法类似。
训练样例:
kp = KPrototypes(n_clusters=i, max_iter=80, n_init=8, n_jobs=5, verbose=2).fit(x_train2, categorical=[3,4,5,7,8,9])
具体的参数如下(parameters对应样例第一个括号内参数):
Parameters
-----------
n_clusters : int, optional, default: 8
要形成的类的数量以及要产生的质心的数量。
max_iter : int, default: 100
k-modes算法单次运行的最大迭代次数。
num_dissim : func, default: euclidian_dissim
数值变量算法所采用的相似度函数。
默认为欧几里得距离函数。
cat_dissim : func, default: matching_dissim
分类变量的kmodes算法使用的相似度函数。(以下内容请自行翻译)
Defaults to the matching dissimilarity function.
n_init : int, default: 10
Number of time the k-modes algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of cost.
init : {‘Huang’, ‘Cao’, ‘random’ or a list of ndarrays}, default: ‘Cao’
Method for initialization:
‘Huang’: Method in Huang [1997, 1998]
‘Cao’: Method in Cao et al. [2009]
‘random’: choose ‘n_clusters’ observations (rows) at random from
data for the initial centroids.
If a list of ndarrays is passed, it should be of length 2, with
shapes (n_clusters, n_features) for numerical and categorical
data respectively. These are the initial centroids.
gamma : float, default: None
Weighing factor that determines relative importance of numerical vs.
categorical attributes (see discussion in Huang [1997]). By default,
automatically calculated from data.
verbose : integer, optional
Verbosity mode.
random_state : int, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random
.
n_jobs : int, default: 1
The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is
used at all, which is useful for debugging. For n_jobs below -1,
(n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one
are used.
训练过程中fit后括号内参数如下:
Parameters
----------
X : array-like, shape=[n_samples, n_features]
categorical : Index of columns that contain categorical data
训练结果的展示代码样例:
label = kp.labels_
其他可选的展示参数如下:
Attributes
----------
cluster_centroids_ : array, [n_clusters, n_features]
Categories of cluster centroids
labels_ :
Labels of each point
cost_ : float
Clustering cost, defined as the sum distance of all points to
their respective cluster centroids.
n_iter_ : int
The number of iterations the algorithm ran for.
epoch_costs_ :
The cost of the algorithm at each epoch from start to completion.
gamma : float
The (potentially calculated) weighing factor.
Notes
-----
See:
Huang, Z.: Extensions to the k-modes algorithm for clustering large
data sets with categorical values, Data Mining and Knowledge
Discovery 2(3), 1998.
原作者还提供了官方的样例如下:
#!/usr/bin/env python import timeit import numpy as np from kmodes.kprototypes import KPrototypes # number of clusters K = 20 # no. of points N = int(1e5) # no. of dimensions M = 10 # no. of numerical dimensions MN = 5 # no. of times test is repeated T = 3 data = np.random.randint(1, 1000, (N, M)) def huang(): KPrototypes(n_clusters=K, init='Huang', n_init=1, verbose=2)\ .fit_predict(data, categorical=list(range(M - MN, M))) def cao(): KPrototypes(n_clusters=K, init='Cao', verbose=2)\ .fit_predict(data, categorical=list(range(M - MN, M))) if __name__ == '__main__': for cm in ('huang', 'cao'): print(cm.capitalize() + ': {:.2} seconds'.format( timeit.timeit(cm + '()', setup='from __main__ import ' + cm, number=T)))
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。