赞
踩
在机器学习,一般不能直接拿整个数据集取训练,而采用cross-validation方法来训练。增强随机性减小噪声等,来减少过拟合,从而有限的数据中获取学习到更全面的信息,增强模型的泛化能力。在sklearn中,经常使用的有:KFold, StratifiedKFold,StratifiedShuffleSplit, GroupKFold。逐一解释使用区别,使用一个简单的df。(一般情况下, n_splits=5/10
)
import pandas as pd import numpy as np from sklearn.model_selection import KFold, StratifiedKFold,\ StratifiedShuffleSplit, GroupKFold, GroupShuffleSplit df2 = pd.DataFrame([[6.5, 1, 2], [8, 1, 0], [61, 2, 1], [54, 0, 1], [78, 0, 1], [119, 2, 2], [111, 1, 2], [23, 0, 0], [31, 2, 0]], columns=['h', 'w', 'class']) df2
h w class
0 6.5 1 2
1 8.0 1 0
2 61.0 2 1
3 54.0 0 1
4 78.0 0 1
5 119.0 2 2
6 111.0 1 2
7 23.0 0 0
8 31.0 2 0
X = df2.drop(['class'], axis=1)
y = df2['class']
floder = KFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in floder.split(X,y):
print("KFold Spliting:")
print('Train index: %s | test index: %s' % (train_idx, test_idx))
# print(X.iloc[train_idx], y.iloc[train_idx], '\n', X.iloc[test_idx], y.iloc[test_idx])
===================================================================
KFold Spliting:
Train index: [0 1 3 5 6 8] | test index: [2 4 7]
KFold Spliting:
Train index: [0 2 3 4 7 8] | test index: [1 5 6]
KFold Spliting:
Train index: [1 2 4 5 6 7] | test index: [0 3 8]
注意划分后得到的是针对数据的索引。我们现在只关注其test index,可以发现每次划分得到的索引不是按照class
对应的类别均匀划分的,如第一次[2,4,7]
对应类别是1,1,0
. 其实 train index也一样,2,0,1,2,2,0
.这在很多时候是不满足要求的,因为我们很多时候希望每次划分得到的train dataset/valid dataset
其中对应的target类别是均匀的。
有意思的是,你将 n_splits=8或9
试试,可以看到不同划分数目,得到test index数目是不一样的。如 n_splits=8
时, 第1 folds中test index size为 n_samples // n_splits + 1= 2
,其余为1。
The first
n_samples % n_splits
folds have sizen_samples // n_splits + 1
, other folds have sizen_samples // n_splits
, wheren_samples
is the number of samples. —— kfold
现在我们知道,KFold不能按照target类别来均匀划分,如果数据集必须按target类别来划分呢?那就要用到 StratifiedKFold
。
sfolder = StratifiedKFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in sfolder.split(X,y):
print("StratifiedKFold Spliting:")
print('Train index: %s | test index: %s' % (train_idx, test_idx))
======================================================
StratifiedKFold Spliting:
Train index: [0 3 4 5 7 8] | test index: [1 2 6]
StratifiedKFold Spliting:
Train index: [1 2 3 5 6 8] | test index: [0 4 7]
StratifiedKFold Spliting:
Train index: [0 1 2 4 6 7] | test index: [3 5 8]
这时我们得到的第一次test index 为 [1 2 6]
,train index也可以验证,也就是说,划分得到的数据集target类别是均匀的。但是还有些数据,如df中特征列 w
如果也代表类别,我们希望将这个特征列相同类别划成一组呢?就像df.groupby
一样意思。这可以用 GroupKFold
.
gfolder = GroupKFold(n_splits=3)
for train_idx, test_idx in gfolder.split(X,y, groups=X['w']):
print("GroupKFold Spliting:")
print('Train index: %s | test index: %s' % (train_idx, test_idx))
========================================================================
GroupKFold Spliting:
Train index: [0 1 3 4 6 7] | test index: [2 5 8]
GroupKFold Spliting:
Train index: [2 3 4 5 7 8] | test index: [0 1 6]
GroupKFold Spliting:
Train index: [0 1 2 5 6 8] | test index: [3 4 7]
这里第一次test index为 [2 5 8]
,对应w列为2。 [0 1 6]
为1。这样就得到了按组划分了。可以试试将 groups=y
看看。
StratifiedShuffleSplit
是 StratifiedKFold
和 ShuffleSplit
缝合怪。其跟 StratifiedKFold
最大区别是可以重复采样,可以看到第一个test index是 [1 5 4]
,第二个是 [8 0 4]
,那么有可能某两个fold的index是一样的, not guarantee that all folds will be different
。
shuffle_split = StratifiedShuffleSplit(n_splits=3, random_state=2020, test_size=3) #test_size必须比类别大或者 可以重复采样
for train_idx, test_idx in shuffle_split.split(X,y):
print("StratifiedShuffleSplit Spliting:")
print('Train index: %s | test index: %s' % (train_idx, test_idx))
====================================================================
StratifiedShuffleSplit Spliting:
Train index: [8 2 3 0 6 7] | test index: [1 5 4]
StratifiedShuffleSplit Spliting:
Train index: [3 1 6 2 7 5] | test index: [8 0 4]
StratifiedShuffleSplit Spliting:
Train index: [1 8 2 6 0 4] | test index: [7 3 5]
现在很多数据集会出现非常不均衡情况,如果在训练可能要求按照某些特征group和target列这两个均匀划分,为此出现了 Stratified Group KFold
, 可以看做 GroupKFold
和 StratifiedKFold
缝合怪。
下面代码来自于stratifiedgroupkfold , 数据集是sklearn iris。另外再添加一列ID,就是令groups=df[‘ID’]
并且划分后train valid 中y还是跟原数据集分布一样。
import numpy as np import pandas as pd import random from sklearn.model_selection import GroupKFold from collections import Counter, defaultdict from sklearn.datasets import load_iris def read_data(): iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['target'] = iris.target #新定义一个ID列 list_id = ['A', 'B', 'C', 'D', 'E'] df['ID'] = np.random.choice(list_id, len(df)) features = iris.feature_names return df, features df, features = read_data() print(df.sample(6))
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target ID
133 6.3 2.8 5.1 1.5 2 C
21 5.1 3.7 1.5 0.4 0 A
84 5.4 3.0 4.5 1.5 1 A
62 6.0 2.2 4.0 1.0 1 D
5 5.4 3.9 1.7 0.4 0 B
132 6.4 2.8 5.6 2.2 2 E
StratiiedGroupKFold
分解实现:
def count_y(y, groups): """统计每个group里各个y 数目""" unique_num = np.max(y) + 1 #key不存在默认返回 np.zeros(unique_num) y_counts_per_group = defaultdict(lambda : np.zeros(unique_num)) for label, g in zip(y, groups): y_counts_per_group[g][label] += 1 # defaultdict(<function__main__.<lambda>>, # {'A': array([5., 9., 8.]), # 'B': array([11., 12., 10.]), # 'C': array([13., 8., 8.]), # 'D': array([9., 11., 11.]), # 'E': array([12., 10., 13.])}) return y_counts_per_group def StratiiedGroupKFold(X, y, groups, features, k, seed=None): """ StratiiedGroupKFold数据,yeild划分后数据集索引 :param X: 数据集X :param y: y target :param groups: 指定其分布划分的groups :param features: 特征 :param k: n_split :param seed: """ max_y = np.max(y) #得到每个groups y的数目的统计字典 y_counts_per_group = count_y(y, groups) gf = GroupKFold(n_splits=k) for train_idx, val_idx in gf.split(X, y, groups): #分别获取train val划分后数据 以及各自对应的ID列类别数目 x_train = X.iloc[train_idx,:] #id列类别数目 id_train = x_train['ID'].unique() x_train = x_train[features] x_val, y_val = X.iloc[val_idx, :], y.iloc[val_idx] id_val = x_val['ID'].unique() x_val = x_val[features] #统计training dataset 和 validation dataset中y中每个类别数目 y_counts_train = np.zeros(max_y + 1) y_counts_val = np.zeros(max_y + 1) for id in id_train: y_counts_train += y_counts_per_group[id] for id in id_val: y_counts_val += y_counts_per_group[id] #train dataset中按ID列统计y类别相对于最大数目的比例 numratio_train = y_counts_train / np.max(y_counts_train) #stratified 数目: validation dataset对应y_counts_train最大值索引的count数目 * numratio_train向上取整 stratified_count = np.ceil(y_counts_val[np.argmax(y_counts_train)] * numratio_train).astype(int) val_idx = np.array([]) np.random.rand(seed) for num in range(max_y + 1): val_idx = np.append(val_idx, np.random.choice(y_val[y_val==num].index, stratified_count[num])) val_idx = val_idx.astype(int) yield train_idx, val_idx
看看划分效果:
def get_distribution(y_vals): """返回个y各类别的占比""" y_distribut = Counter(y_vals) y_vals_sum = sum(y_distribut.values()) return [f'{y_distribut[i]/y_vals_sum:.2%}' for i in range(np.max(y_vals) + 1)] X = df.drop('target', axis=1) y = df['target'] groups = df['ID'] distribution = [get_distribution(y)] index = ['all dataset'] #看看划分情况 for fold, (train_idx, val_idx) in enumerate(StratiiedGroupKFold(X, y, groups, features, k=3, seed=2020)): print(f'Train ID - fold {fold:1d}:{groups[train_idx].unique()}\ Test ID - fold {fold:1d}:{groups[val_idx].unique()}') distribution.append(get_distribution(y[train_idx])) index.append(f'train set - fold{fold:1d}') distribution.append(get_distribution(y[val_idx])) index.append(f'valid set - fold{fold:1d}') print(pd.DataFrame(distribution, index=index, columns={f' Label{l:2d}' for l in range(np.max(y)+1)}))
Train ID - fold 0:['B' 'A' 'C' 'D'] Test ID - fold 0:['E']
Train ID - fold 1:['A' 'D' 'E'] Test ID - fold 1:['B' 'C']
Train ID - fold 2:['B' 'C' 'E'] Test ID - fold 2:['A' 'D']
Label 1 Label 2 Label 0
all dataset 33.33% 33.33% 33.33%
train set - fold0 32.48% 31.62% 35.90%
valid set - fold0 33.33% 33.33% 33.33%
train set - fold1 34.44% 33.33% 32.22%
valid set - fold1 33.93% 33.93% 32.14%
train set - fold2 33.33% 35.48% 31.18%
valid set - fold2 33.33% 35.42% 31.25%
通用实现:
def stratified_group_k_fold(X, y, groups, k, seed=None): labels_num = np.max(y) + 1 y_counts_per_group = defaultdict(lambda: np.zeros(labels_num)) y_distr = Counter() for label, g in zip(y, groups): y_counts_per_group[g][label] += 1 y_distr[label] += 1 y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num)) groups_per_fold = defaultdict(set) def eval_y_counts_per_fold(y_counts, fold): y_counts_per_fold[fold] += y_counts std_per_label = [] for label in range(labels_num): label_std = np.std([y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]) std_per_label.append(label_std) y_counts_per_fold[fold] -= y_counts return np.mean(std_per_label) groups_and_y_counts = list(y_counts_per_group.items()) random.Random(seed).shuffle(groups_and_y_counts) for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])): best_fold = None min_eval = None for i in range(k): fold_eval = eval_y_counts_per_fold(y_counts, i) if min_eval is None or fold_eval < min_eval: min_eval = fold_eval best_fold = i y_counts_per_fold[best_fold] += y_counts groups_per_fold[best_fold].add(g) all_groups = set(groups) for i in range(k): train_groups = all_groups - groups_per_fold[i] test_groups = groups_per_fold[i] train_indices = [i for i, g in enumerate(groups) if g in train_groups] test_indices = [i for i, g in enumerate(groups) if g in test_groups] yield train_indices, test_indices
[1] StratifiedKFold v.s KFold v.s StratifiedShuffleSplit
[2] sampling
[3] imbalanced-learn
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。