当前位置:   article > 正文

《Web安全之深度学习实战》笔记:第十四章 恶意程序分类识别_marco ramilli的mist数据集

marco ramilli的mist数据集

本小节主要以MIST数据集为例介绍恶意程序的分类识别技术,使用特征提取方法为2-Gram和TF-IDF模型,介绍的分类算法包括支持向量机、XGBoost和多层感知机。

一、恶意程序

常见的恶意程序识别方法主要依据是静态文件特征码和高危动态行为特征等,会随着恶意程序呈指数级增长。传统的基于规则的检测技术已经难以覆盖全部恶意程序,终端安全厂商将大量的人力物力投入到使用沙箱以及机器学习技术上,希望可以有效提高识别恶意程序的能力。

二、数据集

测试数据来自Marco Ramilli的MIST数据集(Malware Instruction Set for Behaviour Analysis),MIST通过分析大量的恶意程序,提取静态的文件特征以及动态的程序行为特征,对应的数据特征获取过程如图14-2所示。

源码如下所示:

  1. def load_files():
  2. malware_class=['APT1','Crypto','Locker','Zeus']
  3. x=[]
  4. y=[]
  5. for i,family in enumerate(malware_class):
  6. dir="../data/malware/MalwareTrainingSets-master/trainingSets/%s/*" % family
  7. print ("Load files from %s index %d" % (dir,i))
  8. v=load_files_from_dir(dir)
  9. x+=v
  10. y+=[i]*len(v)
  11. print ("Loaded files %d" % len(x))
  12. return x,y

 对于每个文件,处理如下

  1. def load_files_from_dir(dir):
  2. import glob
  3. files=glob.glob(dir)
  4. result = []
  5. for file in files:
  6. #print ("Load file %s" % file)
  7. with open(file) as f:
  8. lines=f.readlines()
  9. lines_to_line=" ".join(lines)
  10. lines_to_line = re.sub(r"[APT|Crypto|Locker|Zeus]", ' ', lines_to_line,flags=re.I)
  11. result.append(lines_to_line)
  12. return result

三、特征提取

(一)Ngram-TFIDF

  1. def get_feature_text():
  2. x,y=load_files()
  3. max_features=1000
  4. vectorizer = CountVectorizer(
  5. decode_error='ignore',
  6. ngram_range=(2, 2),
  7. strip_accents='ascii',
  8. max_features=max_features,
  9. stop_words='english',
  10. max_df=1.0,
  11. min_df=1,
  12. token_pattern=r'\b\w+\b',
  13. binary=True)
  14. print (vectorizer)
  15. x=vectorizer.fit_transform(x)
  16. transformer = TfidfTransformer(smooth_idf=False)
  17. x = transformer.fit_transform(x)
  18. # 非常重要 稀疏矩阵转换成矩阵
  19. x = x.toarray()
  20. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
  21. return x_train, x_test, y_train, y_test

(二)Ngram-2D

  1. def get_feature_pe_picture():
  2. #加载原始文件
  3. x,y=load_files()
  4. max_features=1024
  5. vectorizer = CountVectorizer(
  6. decode_error='ignore',
  7. ngram_range=(2, 2),
  8. strip_accents='ascii',
  9. max_features=max_features,
  10. stop_words='english',
  11. max_df=1.0,
  12. min_df=1,
  13. dtype=np.int,
  14. token_pattern=r'\b\w+\b',
  15. binary=False)
  16. print (vectorizer)
  17. x=vectorizer.fit_transform(x)
  18. #非常重要 稀疏矩阵转换成矩阵
  19. x=x.toarray()
  20. x_pic = []
  21. for i in range(4762):
  22. #将形状为(10241)的向量转化成(3232)的矩阵
  23. pic=np.reshape(x[i],(32,32,1))
  24. x_pic.append(pic)
  25. #save_image(pic,i)
  26. #随机分配训练和测试集合
  27. x_train, x_test, y_train, y_test = train_test_split(x_pic, y, test_size=0.4)
  28. return x_train, x_test, y_train, y_test

四、模型构建

(一)XGBOOST

  1. def do_xgboost(x_train, x_test, y_train, y_test):
  2. xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
  3. y_pred = xgb_model.predict(x_test)
  4. print(classification_report(y_test, y_pred))

(二)SVM

  1. def do_svm(x_train, x_test, y_train, y_test):
  2. from sklearn.svm import SVC
  3. clf = svm.SVC(kernel='linear', C=1.0)
  4. clf.fit(x_train, y_train)
  5. y_pred = clf.predict(x_test)
  6. print(classification_report(y_test, y_pred))

(三)MLP

  1. def do_mlp(x_train, x_test, y_train, y_test):
  2. clf = MLPClassifier(solver='lbfgs',
  3. alpha=1e-5,
  4. hidden_layer_sizes = (10, 4),
  5. random_state = 1)
  6. clf.fit(x_train, y_train)
  7. y_pred = clf.predict(x_test)
  8. print(classification_report(y_test, y_pred))
  9. print(metrics.confusion_matrix(y_test, y_pred))

(四)CNN_2d

  1. def do_cnn_2d(trainX, testX, trainY, testY):
  2. print("text feature and cnn 2d")
  3. # Converting labels to binary vectors
  4. trainY = to_categorical(trainY, nb_classes=4)
  5. testY = to_categorical(testY, nb_classes=4)
  6. # Building convolutional network
  7. network = input_data(shape=[None, 32, 32,1], name='input')
  8. network = conv_2d(network, 16, 3, activation='relu', regularizer="L2")
  9. network = max_pool_2d(network, 2)
  10. network = local_response_normalization(network)
  11. network = conv_2d(network, 16, 3, activation='relu', regularizer="L2")
  12. network = max_pool_2d(network, 2)
  13. network = local_response_normalization(network)
  14. network = fully_connected(network, 16, activation='tanh')
  15. network = dropout(network, 0.1)
  16. network = fully_connected(network, 16, activation='tanh')
  17. network = dropout(network, 0.1)
  18. network = fully_connected(network, 4, activation='softmax')
  19. network = regression(network, optimizer='adam', learning_rate=0.01,
  20. loss='categorical_crossentropy', name='target')
  21. # Training
  22. model = tflearn.DNN(network, tensorboard_verbose=0)
  23. model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY),show_metric=True, run_id="malware")

(五)CNN_1d

  1. def do_cnn_1d(trainX, testX, trainY, testY):
  2. print("text feature and cnn")
  3. # Converting labels to binary vectors
  4. trainY = to_categorical(trainY, nb_classes=4)
  5. testY = to_categorical(testY, nb_classes=4)
  6. # Building convolutional network
  7. network = input_data(shape=[None,1000], name='input')
  8. network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
  9. branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
  10. branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
  11. branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
  12. network = merge([branch1, branch2, branch3], mode='concat', axis=1)
  13. network = tf.expand_dims(network, 2)
  14. network = global_max_pool(network)
  15. network = dropout(network, 0.8)
  16. network = fully_connected(network, 4, activation='softmax')
  17. network = regression(network, optimizer='adam', learning_rate=0.001,
  18. loss='categorical_crossentropy', name='target')
  19. # Training
  20. model = tflearn.DNN(network, tensorboard_verbose=0)
  21. model.fit(trainX, trainY,
  22. n_epoch=5, shuffle=True, validation_set=(testX, testY),
  23. show_metric=True, batch_size=100,run_id="malware")

五、运行结果

1D运行结果,从结果来看cnn效果有点过差

  1. xgboost
  2. precision recall f1-score support
  3. 0 0.98 0.94 0.96 113
  4. 1 0.97 0.95 0.96 803
  5. 2 0.96 0.90 0.93 205
  6. 3 0.94 0.98 0.96 784
  7. accuracy 0.95 1905
  8. macro avg 0.96 0.94 0.95 1905
  9. weighted avg 0.96 0.95 0.95 1905
  10. svm
  11. precision recall f1-score support
  12. 0 0.96 0.92 0.94 113
  13. 1 0.95 0.96 0.95 803
  14. 2 0.91 0.87 0.89 205
  15. 3 0.94 0.94 0.94 784
  16. accuracy 0.94 1905
  17. macro avg 0.94 0.92 0.93 1905
  18. weighted avg 0.94 0.94 0.94 1905
  19. cnn
  20. | Adam | epoch: 001 | loss: 1.09685 - acc: 0.4432 | val_loss: 1.13283 - val_acc: 0.4089 -- iter: 2857/2857
  21. | Adam | epoch: 002 | loss: 1.09272 - acc: 0.4425 | val_loss: 1.12148 - val_acc: 0.4089 -- iter: 2857/2857
  22. | Adam | epoch: 003 | loss: 1.11942 - acc: 0.4117 | val_loss: 1.11967 - val_acc: 0.4089 -- iter: 2857/2857
  23. | Adam | epoch: 004 | loss: 1.12596 - acc: 0.4221 | val_loss: 1.12072 - val_acc: 0.4089 -- iter: 2857/2857
  24. | Adam | epoch: 005 | loss: 1.11561 - acc: 0.4272 | val_loss: 1.12084 - val_acc: 0.4089 -- iter: 2857/2857

CNN 2D的性能如下,看起来也没好到哪里去

  1. | Adam | epoch: 001 | loss: 1.23541 - acc: 0.4109 | val_loss: 1.11576 - val_acc: 0.4247 -- iter: 2857/2857
  2. | Adam | epoch: 002 | loss: 1.16763 - acc: 0.4203 | val_loss: 1.11529 - val_acc: 0.4247 -- iter: 2857/2857
  3. | Adam | epoch: 003 | loss: 1.12465 - acc: 0.4194 | val_loss: 1.11524 - val_acc: 0.4247 -- iter: 2857/2857
  4. | Adam | epoch: 004 | loss: 1.11964 - acc: 0.4281 | val_loss: 1.11697 - val_acc: 0.4247 -- iter: 2857/2857
  5. | Adam | epoch: 005 | loss: 1.11276 - acc: 0.4242 | val_loss: 1.11429 - val_acc: 0.4247 -- iter: 2857/2857
  6. | Adam | epoch: 006 | loss: 1.11595 - acc: 0.4346 | val_loss: 1.11510 - val_acc: 0.4247 -- iter: 2857/2857
  7. | Adam | epoch: 007 | loss: 1.10915 - acc: 0.4170 | val_loss: 1.10926 - val_acc: 0.4247 -- iter: 2857/2857
  8. | Adam | epoch: 008 | loss: 1.11696 - acc: 0.4282 | val_loss: 1.10626 - val_acc: 0.4268 -- iter: 2857/2857
  9. | Adam | epoch: 009 | loss: 1.14538 - acc: 0.4108 | val_loss: 1.08093 - val_acc: 0.4268 -- iter: 2857/2857
  10. | Adam | epoch: 010 | loss: 1.09208 - acc: 0.4215 | val_loss: 1.08760 - val_acc: 0.4241 -- iter: 2857/2857

不过xgboost性能还是不错的,cnn总体上在图像效果不错,在这个恶意软件识别中,效果过于差了些

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/471046
推荐阅读
相关标签
  

闽ICP备14008679号