赞
踩
本小节主要以MIST数据集为例介绍恶意程序的分类识别技术,使用特征提取方法为2-Gram和TF-IDF模型,介绍的分类算法包括支持向量机、XGBoost和多层感知机。
常见的恶意程序识别方法主要依据是静态文件特征码和高危动态行为特征等,会随着恶意程序呈指数级增长。传统的基于规则的检测技术已经难以覆盖全部恶意程序,终端安全厂商将大量的人力物力投入到使用沙箱以及机器学习技术上,希望可以有效提高识别恶意程序的能力。
测试数据来自Marco Ramilli的MIST数据集(Malware Instruction Set for Behaviour Analysis),MIST通过分析大量的恶意程序,提取静态的文件特征以及动态的程序行为特征,对应的数据特征获取过程如图14-2所示。
源码如下所示:
- def load_files():
- malware_class=['APT1','Crypto','Locker','Zeus']
- x=[]
- y=[]
- for i,family in enumerate(malware_class):
- dir="../data/malware/MalwareTrainingSets-master/trainingSets/%s/*" % family
- print ("Load files from %s index %d" % (dir,i))
- v=load_files_from_dir(dir)
- x+=v
- y+=[i]*len(v)
- print ("Loaded files %d" % len(x))
- return x,y
对于每个文件,处理如下
- def load_files_from_dir(dir):
- import glob
- files=glob.glob(dir)
- result = []
- for file in files:
- #print ("Load file %s" % file)
- with open(file) as f:
- lines=f.readlines()
- lines_to_line=" ".join(lines)
- lines_to_line = re.sub(r"[APT|Crypto|Locker|Zeus]", ' ', lines_to_line,flags=re.I)
- result.append(lines_to_line)
- return result
- def get_feature_text():
- x,y=load_files()
- max_features=1000
-
- vectorizer = CountVectorizer(
- decode_error='ignore',
- ngram_range=(2, 2),
- strip_accents='ascii',
- max_features=max_features,
- stop_words='english',
- max_df=1.0,
- min_df=1,
- token_pattern=r'\b\w+\b',
- binary=True)
- print (vectorizer)
- x=vectorizer.fit_transform(x)
-
- transformer = TfidfTransformer(smooth_idf=False)
- x = transformer.fit_transform(x)
-
- # 非常重要 稀疏矩阵转换成矩阵
- x = x.toarray()
-
- x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
- return x_train, x_test, y_train, y_test
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
- def get_feature_pe_picture():
- #加载原始文件
- x,y=load_files()
- max_features=1024
- vectorizer = CountVectorizer(
- decode_error='ignore',
- ngram_range=(2, 2),
- strip_accents='ascii',
- max_features=max_features,
- stop_words='english',
- max_df=1.0,
- min_df=1,
- dtype=np.int,
- token_pattern=r'\b\w+\b',
- binary=False)
- print (vectorizer)
- x=vectorizer.fit_transform(x)
- #非常重要 稀疏矩阵转换成矩阵
- x=x.toarray()
- x_pic = []
- for i in range(4762):
- #将形状为(1024,1)的向量转化成(32,32)的矩阵
- pic=np.reshape(x[i],(32,32,1))
- x_pic.append(pic)
- #save_image(pic,i)
- #随机分配训练和测试集合
- x_train, x_test, y_train, y_test = train_test_split(x_pic, y, test_size=0.4)
- return x_train, x_test, y_train, y_test
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
- def do_xgboost(x_train, x_test, y_train, y_test):
- xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
- y_pred = xgb_model.predict(x_test)
- print(classification_report(y_test, y_pred))
- def do_svm(x_train, x_test, y_train, y_test):
- from sklearn.svm import SVC
- clf = svm.SVC(kernel='linear', C=1.0)
- clf.fit(x_train, y_train)
- y_pred = clf.predict(x_test)
- print(classification_report(y_test, y_pred))
- def do_mlp(x_train, x_test, y_train, y_test):
-
- clf = MLPClassifier(solver='lbfgs',
- alpha=1e-5,
- hidden_layer_sizes = (10, 4),
- random_state = 1)
- clf.fit(x_train, y_train)
- y_pred = clf.predict(x_test)
- print(classification_report(y_test, y_pred))
- print(metrics.confusion_matrix(y_test, y_pred))
- def do_cnn_2d(trainX, testX, trainY, testY):
- print("text feature and cnn 2d")
- # Converting labels to binary vectors
- trainY = to_categorical(trainY, nb_classes=4)
- testY = to_categorical(testY, nb_classes=4)
- # Building convolutional network
- network = input_data(shape=[None, 32, 32,1], name='input')
- network = conv_2d(network, 16, 3, activation='relu', regularizer="L2")
- network = max_pool_2d(network, 2)
- network = local_response_normalization(network)
- network = conv_2d(network, 16, 3, activation='relu', regularizer="L2")
- network = max_pool_2d(network, 2)
- network = local_response_normalization(network)
- network = fully_connected(network, 16, activation='tanh')
- network = dropout(network, 0.1)
- network = fully_connected(network, 16, activation='tanh')
- network = dropout(network, 0.1)
- network = fully_connected(network, 4, activation='softmax')
- network = regression(network, optimizer='adam', learning_rate=0.01,
- loss='categorical_crossentropy', name='target')
-
- # Training
- model = tflearn.DNN(network, tensorboard_verbose=0)
- model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY),show_metric=True, run_id="malware")
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
- def do_cnn_1d(trainX, testX, trainY, testY):
- print("text feature and cnn")
- # Converting labels to binary vectors
- trainY = to_categorical(trainY, nb_classes=4)
- testY = to_categorical(testY, nb_classes=4)
-
- # Building convolutional network
- network = input_data(shape=[None,1000], name='input')
- network = tflearn.embedding(network, input_dim=1000000, output_dim=128,validate_indices=False)
- branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
- branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
- branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
- network = merge([branch1, branch2, branch3], mode='concat', axis=1)
- network = tf.expand_dims(network, 2)
- network = global_max_pool(network)
- network = dropout(network, 0.8)
- network = fully_connected(network, 4, activation='softmax')
- network = regression(network, optimizer='adam', learning_rate=0.001,
- loss='categorical_crossentropy', name='target')
- # Training
- model = tflearn.DNN(network, tensorboard_verbose=0)
- model.fit(trainX, trainY,
- n_epoch=5, shuffle=True, validation_set=(testX, testY),
- show_metric=True, batch_size=100,run_id="malware")
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
1D运行结果,从结果来看cnn效果有点过差
- xgboost
- precision recall f1-score support
-
- 0 0.98 0.94 0.96 113
- 1 0.97 0.95 0.96 803
- 2 0.96 0.90 0.93 205
- 3 0.94 0.98 0.96 784
-
- accuracy 0.95 1905
- macro avg 0.96 0.94 0.95 1905
- weighted avg 0.96 0.95 0.95 1905
-
- svm
- precision recall f1-score support
-
- 0 0.96 0.92 0.94 113
- 1 0.95 0.96 0.95 803
- 2 0.91 0.87 0.89 205
- 3 0.94 0.94 0.94 784
-
- accuracy 0.94 1905
- macro avg 0.94 0.92 0.93 1905
- weighted avg 0.94 0.94 0.94 1905
- cnn
- | Adam | epoch: 001 | loss: 1.09685 - acc: 0.4432 | val_loss: 1.13283 - val_acc: 0.4089 -- iter: 2857/2857
- | Adam | epoch: 002 | loss: 1.09272 - acc: 0.4425 | val_loss: 1.12148 - val_acc: 0.4089 -- iter: 2857/2857
- | Adam | epoch: 003 | loss: 1.11942 - acc: 0.4117 | val_loss: 1.11967 - val_acc: 0.4089 -- iter: 2857/2857
- | Adam | epoch: 004 | loss: 1.12596 - acc: 0.4221 | val_loss: 1.12072 - val_acc: 0.4089 -- iter: 2857/2857
- | Adam | epoch: 005 | loss: 1.11561 - acc: 0.4272 | val_loss: 1.12084 - val_acc: 0.4089 -- iter: 2857/2857
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
CNN 2D的性能如下,看起来也没好到哪里去
- | Adam | epoch: 001 | loss: 1.23541 - acc: 0.4109 | val_loss: 1.11576 - val_acc: 0.4247 -- iter: 2857/2857
- | Adam | epoch: 002 | loss: 1.16763 - acc: 0.4203 | val_loss: 1.11529 - val_acc: 0.4247 -- iter: 2857/2857
- | Adam | epoch: 003 | loss: 1.12465 - acc: 0.4194 | val_loss: 1.11524 - val_acc: 0.4247 -- iter: 2857/2857
- | Adam | epoch: 004 | loss: 1.11964 - acc: 0.4281 | val_loss: 1.11697 - val_acc: 0.4247 -- iter: 2857/2857
- | Adam | epoch: 005 | loss: 1.11276 - acc: 0.4242 | val_loss: 1.11429 - val_acc: 0.4247 -- iter: 2857/2857
- | Adam | epoch: 006 | loss: 1.11595 - acc: 0.4346 | val_loss: 1.11510 - val_acc: 0.4247 -- iter: 2857/2857
- | Adam | epoch: 007 | loss: 1.10915 - acc: 0.4170 | val_loss: 1.10926 - val_acc: 0.4247 -- iter: 2857/2857
- | Adam | epoch: 008 | loss: 1.11696 - acc: 0.4282 | val_loss: 1.10626 - val_acc: 0.4268 -- iter: 2857/2857
- | Adam | epoch: 009 | loss: 1.14538 - acc: 0.4108 | val_loss: 1.08093 - val_acc: 0.4268 -- iter: 2857/2857
- | Adam | epoch: 010 | loss: 1.09208 - acc: 0.4215 | val_loss: 1.08760 - val_acc: 0.4241 -- iter: 2857/2857
不过xgboost性能还是不错的,cnn总体上在图像效果不错,在这个恶意软件识别中,效果过于差了些
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。