当前位置:   article > 正文

2022雪浪云赛题2总结_雪浪云-汽车全厂排产优化赛题目

雪浪云-汽车全厂排产优化赛题目

记录我一次体验完整的比赛经历,通过datawhale的开源学习以及和众多大佬的比赛交流,收获颇多。

比赛官网:https://www.xuelangyun.com/#/sign-up-statistics?tab=0

比赛排名:55/263

比赛内容:阀体装配线产品合格检测(时序数据二分类、异常检测)

比赛思路如下:

读取csv数据构建特征

统计特征:最小值、最大值、标准差、方差、极差、均值、均方根;

时域特征:总能量、偏度、峭度、波形因子、峰值因子、脉冲因子、裕度因子;

频域特征:经过fft(快速傅里叶变换)后的数据的总能量、最大值、最小值、标准差、方差、极差、均值、偏度、均方根。

这些特征的详细代码参照时频域特征:http://t.csdn.cn/qP8cx

  1. # 定义特征抽取函数
  2. import pandas as pd
  3. import os
  4. import numpy as np
  5. from scipy.linalg import norm
  6. from tqdm import tqdm
  7. import pickle
  8. import json
  9. def get_each_sample_feature(ori_path):
  10. data = {}
  11. feature_name_list = []
  12. columns_feature_name = ['sum','min','max','std','var','pk','mean','skew','kurt','rms','boxing','fengzhi','maichong','yudu','25','50','75']
  13. defult_parameters=[float('nan') for i in range(len(columns_feature_name))]
  14. columns = ['id']
  15. sample_list = os.listdir(os.path.join(ori_path,'P1000','Report_P1000_Time'))
  16. for sample in sample_list:
  17. sample_name = sample.replace('.csv','')
  18. data[sample_name] = []
  19. station_list = os.listdir(ori_path)
  20. for station in station_list:
  21. for sensor in os.listdir(os.path.join(ori_path,station)):
  22. feature_name_list.append(ori_path+'/%s/%s'%(station,sensor))
  23. for feature_name in tqdm(feature_name_list):
  24. for k in columns_feature_name:
  25. tmp_name = '_'.join(feature_name.split('/')[-2:])
  26. columns.append(tmp_name+'_'+k)
  27. for key in data.keys():
  28. sample_path = feature_name+'/'+key+'.csv'
  29. if os.path.exists(sample_path):
  30. sample_onesensor_pd = pd.read_csv(sample_path)
  31. if sample_onesensor_pd.shape[0] == 0:
  32. sample_ = defult_parameters
  33. else:
  34. sample_ = [sample_onesensor_pd.sum()[0],
  35. sample_onesensor_pd.min()[0],
  36. sample_onesensor_pd.max()[0],
  37. sample_onesensor_pd.std()[0],
  38. sample_onesensor_pd.var()[0],
  39. sample_onesensor_pd.max()[0]-sample_onesensor_pd.min()[0],
  40. sample_onesensor_pd.mean()[0],
  41. sample_onesensor_pd.skew()[0],
  42. sample_onesensor_pd.kurt()[0],
  43. norm(sample_onesensor_pd,2)/np.sqrt(sample_onesensor_pd.shape[-1]),
  44. norm(sample_onesensor_pd,2)/np.sqrt(sample_onesensor_pd.shape[-1])/sample_onesensor_pd.mean()[0],
  45. ((sample_onesensor_pd.max()-sample_onesensor_pd.min())/norm(sample_onesensor_pd,2)/np.sqrt(sample_onesensor_pd.shape[-1]))[0],
  46. ((sample_onesensor_pd.max()-sample_onesensor_pd.min())/sample_onesensor_pd.mean())[0],
  47. ((sample_onesensor_pd.max()-sample_onesensor_pd.min())/(np.mean(np.sqrt(abs(sample_onesensor_pd)))**2))[0],
  48. np.percentile(sample_onesensor_pd,(25)),
  49. np.percentile(sample_onesensor_pd,(50)),
  50. np.percentile(sample_onesensor_pd,(75))]
  51. else:
  52. sample_ = defult_parameters
  53. for i in sample_:
  54. data[key].append(i)
  55. return data,columns

以上代码根据datawhale开源代码思路进行了自己的特征扩展,datawhale的思路和开源代码地址:

https://datawhaler.feishu.cn/docx/T3Stdh8nFo4FSwxpTX8cFI0rnnd

接下来提取NG,OK样本中的特征数据,并分别保存为OK.csv ,NG.csv到本地方便使用。

  1. OK,columns = get_each_sample_feature('train/OK')
  2. pd.DataFrame(list(OK.values()),columns=columns[1:]).to_csv('OK.csv',index = False)
  3. NG,columns = get_each_sample_feature('train/NG')
  4. pd.DataFrame(list(NG.values()),columns=columns[1:]).to_csv('NG.csv',index = False)

通过以上代码可以把一个一个检测台文件夹里的样本csv处理成一个特征csv。

读取特征数据处理后进行特征筛选

读取特征数据
  1. #读取OK特征数据,删除行列全为空的数据,赋予对应的label为0
  2. OK = pd.read_csv('OK.csv')
  3. OK = OK.dropna(axis = 0,how = 'all')
  4. OK = OK.dropna(axis = 1,how = 'all')
  5. OK['label'] = 0
  6. OK
  7. NG = pd.read_csv('NG.csv')
  8. NG = NG.dropna(axis = 0,how = 'all')
  9. NG = NG.dropna(axis = 1,how = 'all')
  10. NG['label'] = 1
  11. NG
  12. #拼接ok和ng特征数据
  13. data = pd.concat([OK,NG])
  14. data
  15. #查看缺失值数量
  16. print(data.isna().sum())
  17. print(data.isna().sum().sum())

由于我做的特征有的存在较多的缺失值,所以需要删掉缺失值过多的特征,如缺失50%就可以认为缺失过多。

  1. #查看缺失值在该列的占比
  2. df.isna().mean()
  3. #定义查找缺失值超过比例的列的函数
  4. #这里给个参数ratio用于控制缺失值比例,比如给个0.01,意思就是缺失值超过99%这个特征就剔除。
  5. def filter_col_by_nan(df, ratio=0.05):
  6. cols = []
  7. for col in df.columns:
  8. if df[col].isna().mean() >= (1-ratio):
  9. cols.append(col)
  10. return cols
  11. #比如我想把缺失值比例超过0.02的列删掉
  12. nanfeature = filter_col_by_nan(data, ratio=0.98)
  13. nanfeature
  14. data = data.drop(labels = nanfeature,axis = 1)
  15. data
缺失值补全

常见的是用均值,中位数以及众数填充缺失值(甚至可以直接用一个常数填充)

但是针对这一场景:

对于NG样本而言,过检测的传感器发现数据异常,就会判定阀体异常,那么其他的数据就不会去测了,所以有的阀体是有缺失值的,即没有全部的有效特征#针对这种情况,这里选择从OK样本中随机选一个值进行填充,即我们认为没有测量的的特征值默认是能通过阀体检测台的。

  1. #定义缺失值填充函数
  2. def fill_with_random(df, column):
  3. df[column] = df[column].apply(lambda x: np.random.choice(OK[column].dropna().values) if np.isnan(x) else x)
  4. return df
  5. for i in tqdm(data.columns):
  6. data=fill_with_random(data,i)
  7. #可以再次检查是否还有缺失值
  8. print(data.isna().sum())
特征筛选
方差过滤+皮尔逊系数(只能筛选到线性特征)
  1. train_data=data.drop(labels='label', axis=1)
  2. train_data
  3. from sklearn.feature_selection import VarianceThreshold
  4. var = VarianceThreshold(threshold = np.median(train_data.var().values))
  5. # var = VarianceThreshold(threshold = 0)
  6. var.fit_transform(train_data)
  7. is_select = var.get_support()
  8. var_feature = train_data.iloc[:,is_select]
  9. var_feature
  10. x_array = np.array(var_feature)
  11. x_array
  12. y_array = np.array(data['label'])
  13. y_array
  14. from scipy.stats import pearsonr
  15. def multivariate_pearsonr(X, Y):
  16. scores, pvalues = [], []
  17. for column in range(X.shape[1]):
  18. cur_score, cur_p = pearsonr(X[:,column], Y)
  19. scores.append(abs(cur_score))
  20. pvalues.append(cur_p)
  21. return (np.array(scores), np.array(pvalues))
  22. from sklearn.feature_selection import SelectKBest
  23. m_pearsonr = SelectKBest(score_func=multivariate_pearsonr, k=50)
  24. X_pearson = m_pearsonr.fit_transform(x_array, y_array)
  25. print(m_pearsonr.scores_)
  26. pearsonr = pd.DataFrame(m_pearsonr.scores_, columns = ["pearsonr"], index=var_feature.columns)
  27. pearsonr = pearsonr.reset_index()
  28. pearsonr.sort_values('pearsonr',ascending=0)
随机森林特征重要性(负样本少,重要性可能不准)
  1. x = data.iloc[:,:-1]
  2. y = data['label']
  3. from sklearn.feature_selection import SelectFromModel
  4. from sklearn.ensemble import RandomForestClassifier as rfc
  5. from sklearn.model_selection import cross_val_score
  6. import numpy as np
  7. import matplotlib.pyplot as plt
  8. rfc_ = rfc(n_estimators=88,random_state=2023)
  9. rfc_.fit(x,y).feature_importances_
  10. max_fi = max(rfc_.fit(x,y).feature_importances_)
  11. max_fi
  12. threshold = np.linspace(0,max_fi,20)
  13. threshold
  14. scores = []
  15. for i in threshold:
  16. X_embedded = SelectFromModel(rfc_,threshold=i).fit_transform(x,y)
  17. score = cross_val_score(rfc_,X_embedded,y,cv=5).mean()
  18. scores.append(score)
  19. plt.plot(threshold,scores)
  20. plt.show()
  21. print("模型最优时的分数:",max(scores))
  22. print("模型最优时的阈值:",scores.index(max(scores)))
  23. a = pd.DataFrame({'feature':x.columns.to_list()})
  24. a
  25. b = pd.DataFrame({'feature_importances_':rfc_.fit(x,y).feature_importances_})
  26. b
  27. c = pd.concat([a,b],axis=1)
  28. c = c.sort_values(by='feature_importances_',ascending=False)
  29. #c = c[:60]
  30. c
lightgbm特征重要性
  1. from sklearn.model_selection import StratifiedKFold
  2. import lightgbm
  3. def select_by_lgb(train_data,train_label,random_state=2023,n_splits=5,metric='auc',num_round=10000,early_stopping_rounds=100):
  4. kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
  5. feature_importances = pd.DataFrame()
  6. feature_importances['feature'] = train_data.columns
  7. fold=0
  8. for train_idx, val_idx in kfold.split(train_data,target):
  9. random_state+=1
  10. train_x = train_data.iloc[train_idx]
  11. train_y = train_label.iloc[train_idx]
  12. test_x = train_data.iloc[val_idx]
  13. test_y = train_label.iloc[val_idx]
  14. clf=lightgbm
  15. train_matrix=clf.Dataset(train_x,label=train_y)
  16. test_matrix=clf.Dataset(test_x,label=test_y)
  17. params={
  18. 'boosting_type': 'gbdt',
  19. 'objective': 'binary',
  20. 'learning_rate': 0.1,
  21. 'metric': metric,
  22. 'seed': 2020,
  23. 'nthread':-1 ,
  24. 'verbose': -1}
  25. model=clf.train(params,train_matrix,num_round,valid_sets=test_matrix,early_stopping_rounds=early_stopping_rounds)
  26. feature_importances['fold_{}'.format(fold + 1)] = model.feature_importance()
  27. fold+=1
  28. feature_importances['averge']=feature_importances[['fold_{}'.format(i) for i in range(1,n_splits+1)]].mean(axis=1)
  29. return feature_importances
  30. train_data = data.drop(labels='label', axis=1)
  31. train_data
  32. target = data['label']
  33. target
  34. feature_importances=select_by_lgb(train_data,target)
  35. feature_importances['averge']=feature_importances[['fold_{}'.format(i) for i in range(1,6)]].mean(axis=1)
  36. feature_importances=feature_importances.sort_values(by='averge',ascending=False)
  37. feature_importances
互信息特征筛选(线性和非线性特征都能筛选到)
  1. X = data.iloc[:,:-1]
  2. y = data['label']
  3. from sklearn.feature_selection import mutual_info_classif as MIC
  4. result = MIC(X,y)
  5. k = result.shape[0] - sum(result <= 0)
  6. a = pd.DataFrame({'feature':X.columns.to_list()})
  7. a
  8. b = pd.DataFrame({'feature_importances_':result})
  9. b
  10. c = pd.concat([a,b],axis=1)
  11. c = c.sort_values(by='feature_importances_',ascending=False)
  12. c

特征筛选方法简要总结:https://cloud.tencent.com/developer/article/1848218#

由于比赛要求只能使用50个特征,所以还需要写一段代码判断是否超过50个特征

  1. feature_list = []
  2. for feature in list(c['feature'][:100]):
  3. if 'sum' in feature:
  4. feature = feature.replace('_sum','')
  5. if 'min' in feature:
  6. feature = feature.replace('_min','')
  7. if 'max' in feature:
  8. feature = feature.replace('_max','')
  9. if 'std' in feature:
  10. feature = feature.replace('_std','')
  11. if 'var' in feature:
  12. feature = feature.replace('_var','')
  13. if 'pk' in feature:
  14. feature = feature.replace('_pk','')
  15. if 'mean' in feature:
  16. feature = feature.replace('_mean','')
  17. if 'skew' in feature:
  18. feature = feature.replace('_skew','')
  19. if 'kurt' in feature:
  20. feature = feature.replace('_kurt','')
  21. if 'rms' in feature:
  22. feature = feature.replace('_rms','')
  23. if 'boxing' in feature:
  24. feature = feature.replace('_boxing','')
  25. if 'fengzhi' in feature:
  26. feature = feature.replace('_fengzhi','')
  27. if 'maichong' in feature:
  28. feature = feature.replace('_maichong','')
  29. if 'yudu' in feature:
  30. feature = feature.replace('_yudu','')
  31. if '25' in feature:
  32. feature = feature.replace('_25','')
  33. if '50' in feature:
  34. feature = feature.replace('_50','')
  35. if '75' in feature:
  36. feature = feature.replace('_75','')
  37. else:
  38. print('全部替换完成')
  39. feature_list.append(feature)
  40. new_feature_list = sorted(set(feature_list),key = feature_list.index)
  41. new_feature_list
  42. len(new_feature_list)
  43. allfeature = list(c['feature'][:100])
  44. allfeature

模型建立

模型可以选择常规的机器学习模型,如随机森林、lightgbm、xgboost、catboost等等,但是由于我做的特征出现过拟合验证的问题,在这些模型上的效果不太理想,于是我选择了gbdt+lr的模型进行训练预测,但是效果还是不咋样,于是调参等一系列操作都没有进行。这里就不放代码了,对gbdt+lr模型感兴趣的点击这个链接看看:http://t.csdn.cn/t6ZKI代码在文末的github里。

不过进行cvk折交叉验证值得记录一下,因为我只用过cvk折交叉验证来在随机森林上进行调参。因为样本不均衡,这里没有选择使用train_test_split或者KFold,而是使用了根据样本标签比例来划分数据集的StratifiedKFold。

  1. from sklearn.model_selection import StratifiedKFold
  2. from sklearn.metrics import f1_score, recall_score
  3. import xgboost as xgb
  4. kf = StratifiedKFold(n_splits=5, shuffle=True)
  5. f1_scores1 = []
  6. recall_scores1 = []
  7. counter = 0
  8. for train_index, test_index in kf.split(train_data,label):
  9. samples = np.array(train_data)
  10. labels = np.array(label)
  11. X_train, X_test = samples[train_index], samples[test_index]
  12. y_train, y_test = labels[train_index], labels[test_index]
  13. ng_nums = y_test.sum()
  14. print('Test ok nums: {}'.format(len(y_test) - ng_nums))
  15. print('Test NG nums: {}'.format(ng_nums))
  16. model1 = xgb.XGBClassifier(max_depth=8, learning_rate=0.05,n_estimators=100, objective='binary:logistic',
  17. eval_metric=['logloss', 'auc', 'error'],
  18. use_label_encoder=False)
  19. model1.fit(X_train,y_train)
  20. test_pred1 = model1.predict(X_test)
  21. f1_scores1.append(f1_score(y_test,test_pred1))
  22. recall_scores1.append(recall_score(y_test,test_pred1,average='binary'))
  23. print(f1_scores1)
  24. print(recall_scores1)

模型融合代码:

  1. from sklearn.ensemble import VotingClassifier
  2. voting_clf = VotingClassifier(estimators=[('rfc',rfc),('cbc',cbc),('xgb',xgb)],
  3. voting='soft')
  4. for clf in (rfc, cbc, xgb, voting_clf):
  5. clf.fit(x_train, y_train)
  6. y_pred = clf.predict(x_test)

docker常用操作

  1. #查看linux空间
  2. df -h
  3. #安装依赖
  4. pip install -r requirements.txt
  5. #创建docker镜像
  6. chmod +x build.sh
  7. sudo ./build.sh
  8. #挂载数据到镜像
  9. sudo docker run -it -v /home/wyp/data:/code/components/data components-demo:1.0.0 /bin/bash
  10. sudo docker run -it -v /home/wyp/train/NG:/code/components/data components-demo:1.0.0 /bin/bash
  11. #执行代码
  12. chmod +x run.sh
  13. ./run.sh

总结

首先,由于对业务背景认识不够准确,对异常样本ng也没有理解到位,在构建特征时的操作是无脑堆特征,有的特征是不必要的,有的特征是冗余的。我也没有加以鉴别就直接筛选使用。在赛后看到大佬们做了eda并分析得到有的特征是一样的等相关有用信息进行构造特征,由此我也认识到特征工程的重要性,好的特征是成功的基础。同时我也学会新的缺失值处理和筛选方法,也不再拘泥于使用一种模型,尝试了多个模型,进行了cvk折交叉验证,也尝试使用投票法融合模型,总体学到了不少知识,收获很大。

其次,打比赛还是不能看榜,人比人属实是气死人。自己做自己的就好,有收获,有成果提交上去试试就行。为了打比赛把身体搞垮了也得不偿失。比赛确实是玄学,后期在机缘巧合之下从50+到75分,没有点运气加持那是不可能的。

最后,感谢飞哥一个比赛全程的交流学习,感谢群里的大佬的答疑,感谢军哥的指点,感谢队友凯凯的合作。

然后2023ccf dcic欺诈风险识别(风控,二分类)和第六届全国工业互联网数据创新应用大赛——产品量测值预测赛题(回归)再战!!!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/701833
推荐阅读
相关标签
  

闽ICP备14008679号