当前位置:   article > 正文

(11-3-04 )检测以太坊区块链中的非法账户:Train-Test Split(拆分数据集)_train-test-split

train-test-split

11.3.4  Train-Test Split(拆分数据集)

"Train-Test Split" 是机器学习和数据分析中常用的一种数据集拆分方法,用于评估模型的性能和泛化能力。Train-Test Split的主要目的是,将原始数据集划分为两个互斥的子集:训练集(Training Set)和测试集(Test Set)。

(1)导入了 sklearn(Scikit-Learn)库中的 train_test_split 函数,并展示了数据集的前几行。 train_test_split 函数是用于将数据集划分为训练集和测试集的常用工具。它可以将数据集按照一定的比例分割成训练集和测试集,以便进行机器学习模型的训练和评估。具体实现代码如下所示。

  1. from sklearn.model_selection import train_test_split
  2. dataset.head()

执行后会输出:

  1. Address FLAG Avg min between sent tnx Avg min between received tnx Time Diff between first and last (Mins) Sent tnx Received Tnx Number of Created Contracts Unique Received From Addresses Unique Sent To Addresses ... max val sent to contract total Ether sent total ether balance Total ERC20 tnxs ERC20 total Ether received ERC20 total ether sent ERC20 total Ether sent contract ERC20 uniq sent addr.1 ERC20 uniq rec contract addr ERC20 min val rec
  2. 0 0x00009277775ac7d0d59eaad8fee3d10ac6c805e8 0 844.26 1093.71 704785.63 721 89 0 40 118 ... 0.0 865.691093 -279.224419 265.0 3.558854e+07 3.560317e+07 0.0 0.0 58.0 0.0
  3. 1 0x0002b44ddb1476db43c868bd494422ee4c136fed 0 12709.07 2958.44 1218216.73 94 8 0 5 14 ... 0.0 3.087297 -0.001819 8.0 4.034283e+02 2.260809e+00 0.0 0.0 7.0 0.0
  4. 2 0x0002bda54cb772d040f779e88eb453cac0daa244 0 246194.54 2434.02 516729.30 2 10 0 10 2 ... 0.0 3.588616 0.000441 8.0 5.215121e+02 0.000000e+00 0.0 0.0 8.0 0.0
  5. 3 0x00038e6ba2fd5c09aedb96697c8d7b8fa6632e5e 0 10219.60 15785.09 397555.90 25 9 0 7 13 ... 0.0 1750.045862 -854.646303 14.0 1.711105e+04 1.141223e+04 0.0 0.0 11.0 0.0
  6. 4 0x00062d1dd1afb6fb02540ddad9cdebfe568e0d89 0 36.61 10707.77 382472.42 4598 20 1 7 19 ... 0.0 104.318883 -50.896986 42.0 1.628297e+05 1.235399e+05 0.0 0.0 27.0 0.0

2)首先将目标变量(响应变量)存储在 y 变量中,特征变量存储在 X 变量中。同时,将 "FLAG" 列和 "Address" 列从特征中移除。然后,定义了一个名为 train_val_test_split 的函数,用于将数据集划分为训练集、验证集和测试集。这个函数使用 train_test_split 函数来进行划分。最后,使用 train_val_test_split 函数将数据集划分为训练集(80%)、验证集(10%)和测试集(10%),并分别存储在 X_train、X_val、X_test、y_train、y_val 和 y_test 变量中。具体实现代码如下所示。

  1. # 将响应变量放入 y,将特征变量放入 X
  2. y = dataset['FLAG']
  3. X = dataset.drop(['FLAG', 'Address'], axis=1)
  4. # 定义一个用于划分数据集的函数
  5. def train_val_test_split(X, y, train_size, val_size, test_size):
  6. X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=test_size)
  7. relative_train_size = train_size / (val_size + train_size)
  8. X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val,
  9. train_size=relative_train_size, test_size=1-relative_train_size)
  10. return X_train, X_val, X_test, y_train, y_val, y_test
  11. # 将数据集划分为训练集、验证集和测试集
  12. X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(X, y, 0.8, 0.1, 0.1)
  13. X_train.shape, y_train.shape, X_test.shape, y_test.shape,X_val.shape,y_val.shape

这些形状信息可以用于确保数据集的维度正确,并且可以作为训练、测试和验证过程中的参考。

3获取训练集 X_train 的列名,具体实现代码如下所示。

X_train.columns

执行后将返回训练集中的特征列(不包括目标列)的列名列表:

  1. Index([' Total ERC20 tnxs', ' ERC20 uniq rec contract addr',
  2. 'total ether balance', 'Time Diff between first and last (Mins)',
  3. 'max value received ', 'avg val received',
  4. ' ERC20 total Ether received', ' ERC20 min val rec',
  5. 'Unique Received From Addresses', 'Received Tnx',
  6. 'Avg min between received tnx', 'min value received',
  7. 'Avg min between sent tnx', 'total Ether sent', 'avg val sent',
  8. 'max val sent', 'Sent tnx', 'Unique Sent To Addresses'],
  9. dtype='object')

4通过互信息评估每个特征对目标的重要性,并可视化显示了前 18 个具有最大信息增益的特征的重要性。具体实现代码如下所示。

  1. !pip install skfeature-chappers
  2. from sklearn.feature_selection import mutual_info_classif
  3. importance=mutual_info_classif(X_train,y_train)
  4. feat_importances=pd.Series(importance,X_train.columns[0:len(X_train.columns)])
  5. plt.figure(figsize=[30,15])
  6. feat_importances.nlargest(18).plot(kind='barh',color='teal',)
  7. plt.show()

5获取具有最大信息增益的前 18 个重要特征的列名,这些列名被存储在名为 col_x 的变量中。具体实现代码如下所示。

  1. col_x=feat_importances.nlargest(18).index
  2. col_x

执行后将获得这些重要特征的列名列表,这些列名代表了对目标变量具有较高影响的特征。

  1. Index([' Total ERC20 tnxs', ' ERC20 uniq rec contract addr',
  2. 'total ether balance', 'Time Diff between first and last (Mins)',
  3. 'max value received ', 'avg val received',
  4. ' ERC20 total Ether received', ' ERC20 min val rec',
  5. 'Unique Received From Addresses', 'Received Tnx',
  6. 'Avg min between received tnx', 'min value received',
  7. 'Avg min between sent tnx', 'total Ether sent', 'avg val sent',
  8. 'max val sent', 'Sent tnx', 'Unique Sent To Addresses'],
  9. dtype='object')

6从训练集 X_train、验证集 X_val 和测试集 X_test 中选择了具有最大信息增益的前 18 个重要特征,并将这些特征存储在了相应的数据集中。具体实现代码如下所示。

  1. X_train=X_train[col_x]
  2. X_val=X_val[col_x]
  3. X_test=X_test[col_x]
  4. feat_importances

执行后会输出:

  1. Avg min between sent tnx 0.096649
  2. Avg min between received tnx 0.102166
  3. Time Diff between first and last (Mins) 0.237711
  4. Sent tnx 0.068052
  5. Received Tnx 0.109679
  6. #######省略部分输出结果
  7. ERC20 total Ether sent contract 0.005287
  8. ERC20 uniq sent addr.1 0.001419
  9. ERC20 uniq rec contract addr 0.254201
  10. ERC20 min val rec 0.141128
  11. dtype: float64

未完待续

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/276969
推荐阅读
相关标签
  

闽ICP备14008679号