5. KFold, StratifiedKFold,StratifiedShuffleSplit, GroupKFold区别以及Stratified Group KFold 实现

在机器学习,一般不能直接拿整个数据集取训练,而采用cross-validation方法来训练。增强随机性减小噪声等,来减少过拟合,从而有限的数据中获取学习到更全面的信息,增强模型的泛化能力。在sklearn中,经常使用的有:KFold, StratifiedKFold,StratifiedShuffleSplit, GroupKFold。逐一解释使用区别,使用一个简单的df。(一般情况下, n_splits=5/10)

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold,\
            StratifiedShuffleSplit, GroupKFold, GroupShuffleSplit
df2 = pd.DataFrame([[6.5, 1, 2],
            [8, 1, 0],
            [61, 2, 1],
            [54, 0, 1],
            [78, 0, 1],
            [119, 2, 2],
            [111, 1, 2],
            [23, 0, 0],
            [31, 2, 0]], columns=['h', 'w', 'class'])
	h		w class
0	6.5		1	2
1	8.0		1	0
2	61.0 	2	1
3	54.0	0	1
4	78.0	0	1
5	119.0	2	2
6	111.0	1	2
7	23.0	0	0
8	31.0	2	0
1. KFold 使用
X = df2.drop(['class'], axis=1)
y = df2['class']
floder = KFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in floder.split(X,y):
    print("KFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
    # print(X.iloc[train_idx], y.iloc[train_idx], '\n', X.iloc[test_idx], y.iloc[test_idx])
KFold Spliting:
Train index: [0 1 3 5 6 8] | test index: [2 4 7]
KFold Spliting:
Train index: [0 2 3 4 7 8] | test index: [1 5 6]
KFold Spliting:
Train index: [1 2 4 5 6 7] | test index: [0 3 8]
注意划分后得到的是针对数据的索引。我们现在只关注其test index,可以发现每次划分得到的索引不是按照class对应的类别均匀划分的,如第一次[2,4,7]对应类别是1,1,0. 其实 train index也一样,2,0,1,2,2,0.这在很多时候是不满足要求的,因为我们很多时候希望每次划分得到的train dataset/valid dataset其中对应的target类别是均匀的。

有意思的是,你将 n_splits=8或9试试,可以看到不同划分数目,得到test index数目是不一样的。如 n_splits=8时, 第1 folds中test index size为 n_samples // n_splits + 1= 2,其余为1。

The first n_samples % n_splits folds have size n_samples // n_splits + 1, other folds have size n_samples // n_splits, where n_samples is the number of samples.

​ —— kfold

现在我们知道,KFold不能按照target类别来均匀划分,如果数据集必须按target类别来划分呢?那就要用到 StratifiedKFold

2. StratifiedKFold使用
sfolder = StratifiedKFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in sfolder.split(X,y):
    print("StratifiedKFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
StratifiedKFold Spliting:
Train index: [0 3 4 5 7 8] | test index: [1 2 6]
StratifiedKFold Spliting:
Train index: [1 2 3 5 6 8] | test index: [0 4 7]
StratifiedKFold Spliting:
Train index: [0 1 2 4 6 7] | test index: [3 5 8]
这时我们得到的第一次test index 为 [1 2 6],train index也可以验证,也就是说,划分得到的数据集target类别是均匀的。但是还有些数据,如df中特征列 w如果也代表类别,我们希望将这个特征列相同类别划成一组呢?就像df.groupby一样意思。这可以用 GroupKFold.

3. GroupKFold使用
gfolder = GroupKFold(n_splits=3)
for train_idx, test_idx in gfolder.split(X,y, groups=X['w']):
    print("GroupKFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
GroupKFold Spliting:
Train index: [0 1 3 4 6 7] | test index: [2 5 8]
GroupKFold Spliting:
Train index: [2 3 4 5 7 8] | test index: [0 1 6]
GroupKFold Spliting:
Train index: [0 1 2 5 6 8] | test index: [3 4 7]
这里第一次test index为 [2 5 8],对应w列为2。 [0 1 6]为1。这样就得到了按组划分了。可以试试将 groups=y看看。

4. StratifiedShuffleSplit使用

StratifiedShuffleSplitStratifiedKFoldShuffleSplit缝合怪。其跟 StratifiedKFold最大区别是可以重复采样,可以看到第一个test index是 [1 5 4],第二个是 [8 0 4],那么有可能某两个fold的index是一样的, not guarantee that all folds will be different

shuffle_split = StratifiedShuffleSplit(n_splits=3, random_state=2020, test_size=3) #test_size必须比类别大或者 可以重复采样
for train_idx, test_idx in shuffle_split.split(X,y):
    print("StratifiedShuffleSplit Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
StratifiedShuffleSplit Spliting:
Train index: [8 2 3 0 6 7] | test index: [1 5 4]
StratifiedShuffleSplit Spliting:
Train index: [3 1 6 2 7 5] | test index: [8 0 4]
StratifiedShuffleSplit Spliting:
Train index: [1 8 2 6 0 4] | test index: [7 3 5]
现在很多数据集会出现非常不均衡情况,如果在训练可能要求按照某些特征group和target列这两个均匀划分,为此出现了 Stratified Group KFold, 可以看做 GroupKFoldStratifiedKFold缝合怪。

下面代码来自于stratifiedgroupkfold , 数据集是sklearn iris。另外再添加一列ID,就是令groups=df[‘ID’]并且划分后train valid 中y还是跟原数据集分布一样。

import numpy as np
import pandas as pd
import random
from sklearn.model_selection import GroupKFold
from collections import Counter, defaultdict
from sklearn.datasets import load_iris

def read_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['target'] = iris.target

    list_id = ['A', 'B', 'C', 'D', 'E']
    df['ID'] = np.random.choice(list_id, len(df))

    features = iris.feature_names
    return df, features

df, features = read_data()
StratiiedGroupKFold 分解实现:

def count_y(y, groups):
    """统计每个group里各个y 数目"""
    unique_num = np.max(y) + 1
    #key不存在默认返回 np.zeros(unique_num)
    y_counts_per_group = defaultdict(lambda : np.zeros(unique_num))

    for label, g  in zip(y, groups):
        y_counts_per_group[g][label] += 1

    # defaultdict(<function__main__.<lambda>>,
    # {'A': array([5., 9., 8.]),
    # 'B': array([11., 12., 10.]),
    # 'C': array([13., 8., 8.]),
    # 'D': array([9., 11., 11.]),
    # 'E': array([12., 10., 13.])})
    return y_counts_per_group

def StratiiedGroupKFold(X, y, groups, features, k, seed=None):
    :param X: 数据集X
    :param y: y target
    :param groups: 指定其分布划分的groups
    :param features: 特征
    :param k: n_split
    :param seed:
    max_y = np.max(y)
    #得到每个groups y的数目的统计字典
    y_counts_per_group = count_y(y, groups)
    gf = GroupKFold(n_splits=k)
    for train_idx, val_idx in gf.split(X, y, groups):
        #分别获取train val划分后数据 以及各自对应的ID列类别数目
        x_train = X.iloc[train_idx,:]
        id_train = x_train['ID'].unique()
        x_train = x_train[features]

        x_val, y_val = X.iloc[val_idx, :], y.iloc[val_idx]
        id_val = x_val['ID'].unique()
        x_val = x_val[features]

        #统计training dataset 和 validation dataset中y中每个类别数目
        y_counts_train = np.zeros(max_y + 1)
        y_counts_val = np.zeros(max_y + 1)
        for id in id_train:
            y_counts_train += y_counts_per_group[id]
        for id in id_val:
            y_counts_val += y_counts_per_group[id]

        #train dataset中按ID列统计y类别相对于最大数目的比例
        numratio_train = y_counts_train / np.max(y_counts_train)
        #stratified 数目: validation dataset对应y_counts_train最大值索引的count数目 * numratio_train向上取整
        stratified_count = np.ceil(y_counts_val[np.argmax(y_counts_train)] * numratio_train).astype(int)

        val_idx = np.array([])
        for num in range(max_y + 1):
            val_idx = np.append(val_idx, np.random.choice(y_val[y_val==num].index, stratified_count[num]))
        val_idx = val_idx.astype(int)

        yield train_idx, val_idx
def get_distribution(y_vals):
    y_distribut = Counter(y_vals)
    y_vals_sum = sum(y_distribut.values())
    return [f'{y_distribut[i]/y_vals_sum:.2%}' for i in range(np.max(y_vals) + 1)]

X = df.drop('target', axis=1)
y = df['target']
groups = df['ID']

distribution = [get_distribution(y)]
index = ['all dataset']

for fold, (train_idx, val_idx) in enumerate(StratiiedGroupKFold(X, y, groups, features, k=3, seed=2020)):
    print(f'Train ID - fold {fold:1d}:{groups[train_idx].unique()}\
       Test ID - fold {fold:1d}:{groups[val_idx].unique()}')

    index.append(f'train set - fold{fold:1d}')
    index.append(f'valid set - fold{fold:1d}')
print(pd.DataFrame(distribution, index=index, columns={f' Label{l:2d}' for l in range(np.max(y)+1)}))
Train ID - fold 1:['A' 'D' 'E']       Test ID - fold 1:['B' 'C']
Train ID - fold 2:['B' 'C' 'E']       Test ID - fold 2:['A' 'D']
                   Label 1  Label 2  Label 0
all dataset         33.33%   33.33%   33.33%
train set - fold0   32.48%   31.62%   35.90%
valid set - fold0   33.33%   33.33%   33.33%
train set - fold1   34.44%   33.33%   32.22%
valid set - fold1   33.93%   33.93%   32.14%
train set - fold2   33.33%   35.48%   31.18%
valid set - fold2   33.33%   35.42%   31.25%

def stratified_group_k_fold(X, y, groups, k, seed=None):
    labels_num = np.max(y) + 1
    y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
    y_distr = Counter()
    for label, g in zip(y, groups):
        y_counts_per_group[g][label] += 1
        y_distr[label] += 1

    y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
    groups_per_fold = defaultdict(set)

    def eval_y_counts_per_fold(y_counts, fold):
        y_counts_per_fold[fold] += y_counts
        std_per_label = []
        for label in range(labels_num):
            label_std = np.std([y_counts_per_fold[i][label] / y_distr[label] for i in range(k)])
        y_counts_per_fold[fold] -= y_counts
        return np.mean(std_per_label)

    groups_and_y_counts = list(y_counts_per_group.items())

    for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
        best_fold = None
        min_eval = None
        for i in range(k):
            fold_eval = eval_y_counts_per_fold(y_counts, i)
            if min_eval is None or fold_eval < min_eval:
                min_eval = fold_eval
                best_fold = i
        y_counts_per_fold[best_fold] += y_counts

    all_groups = set(groups)
    for i in range(k):
        train_groups = all_groups - groups_per_fold[i]
        test_groups = groups_per_fold[i]

        train_indices = [i for i, g in enumerate(groups) if g in train_groups]
        test_indices = [i for i, g in enumerate(groups) if g in test_groups]

        yield train_indices, test_indices
