当前位置:   article > 正文

《Web安全之深度学习实战》笔记:第十章 用户行为分析与恶意行为检测_web应用 用户异常行为分析

web应用 用户异常行为分析

        本章基于SEA数据集介绍UBA的一个典型应用场景,即恶意操作行为检测。事实上,在《web安全之机器学习入门》中,我们已经了解过该数据集。

        我们将恶意内部人员和内部员工的异常操作统称为恶意操作。检测这种恶意操作需要使用高级技术,比如用户行为分析(User Behawiors Analysis,UBA),这种新兴技术可提供以往被遗漏的数据保护和欺诈检测功能。结合用户日常操作的系统,UBA利用一种专门的安全分析算法,不仅可以关注初始登录操作,还能跟踪用户的一举一动。UBA有两个主要功能:它有助于为用户执行的正常活动确立基线;迅速识别偏离正常行为的异常行为,以便安全分析员执行调查。某些异常行为可能乍一看不是恶意的,但是这需要安全分析员进一步调查情况,以确定它是合法行为还是恶意行为 。

一、数据集

        SEA数据集涵盖70多个UNIX系统用户的行为日志,这些数据来自UNIX系统记录的用户使用的命令。SEA数据集中每个用户都采集了15000条命令,从用户集合中随机抽取50个用户作为正常用户,剩余用户的命令随机插入,模拟作为内部伪装者发起的攻击。

        每个用户的数据按照连续的100个命令一组分为150个块,前三分之一数据块用作训练该用户正常行为模型,剩余三分之二数据块随机插入了测试用的恶意数据。SEA数据集中恶意数据的分布具有统计规律,任意给定一个测试集命令块,其中含有恶意指令的概率为1%,而当一个命令块中含有恶意指令,则后续命令块也含有恶意指令的概率达到80% [2] 。可以看出SEA数据集将连续数据块看作一个会话,只能模拟连续会话关联的攻击行为。另外,SEA数据集中黑样本偏少,虽然这更接近实际情况,但是却给我们在随机划分训练集和测试集时带来了挑战,如果使用常见的划分方法,有相当大的概率训练集中都是白样本,所以本章的样本划分需要特殊处理,保证训练集中有足够的黑样本。

具体源码如下所示:

  1. cmdlines_file="../data/uba/MasqueradeDat/User7"
  2. labels_file="../data/uba/MasqueradeDat/label.txt"
  3. word2ver_bin="uba_word2vec.bin"
  4. max_features=300
  5. index = 80
  6. def get_cmdlines():
  7. x=np.loadtxt(cmdlines_file,dtype=str)
  8. x=x.reshape((150,100))
  9. y=np.loadtxt(labels_file, dtype=int,usecols=6)
  10. y=y.reshape((100, 1))
  11. y_train=np.zeros([50,1],int)
  12. y=np.concatenate([y_train,y])
  13. y=y.reshape((150, ))
  14. return x,y

具体细节可以参考之前的笔记,讲解这部分数据集的处理部分。

《Web安全之机器学习入门》笔记:第五章 5.3 K近邻检测异常操作(一)_mooyuan的博客-CSDN博客

二、特征提取

(一)Wordbag

  1. def get_features_by_wordbag():
  2. global max_features
  3. global index
  4. x_arr,y=get_cmdlines()
  5. x=[]
  6. for i,v in enumerate(x_arr):
  7. v=" ".join(v)
  8. x.append(v)
  9. vectorizer = CountVectorizer(
  10. decode_error='ignore',
  11. strip_accents='ascii',
  12. max_features=max_features,
  13. stop_words='english',
  14. max_df=1.0,
  15. min_df=1 )
  16. x=vectorizer.fit_transform(x)
  17. x_train=x[0:index,]
  18. x_test=x[index:,]
  19. y_train=y[0:index,]
  20. y_test=y[index:,]
  21. transformer = TfidfTransformer(smooth_idf=False)
  22. transformer.fit(x)
  23. x_test = transformer.transform(x_test)
  24. x_train = transformer.transform(x_train)
  25. return x_train, x_test, y_train, y_test

以第一个元素为例,读入数据集调用函数

x_arr,y=get_cmdlines()

 此时第一个元素值如下

  1. ['cpp' 'sh' 'xrdb' 'cpp' 'sh' 'xrdb' 'mkpts' 'test' 'stty' 'hostname'
  2. 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'env' 'echo' 'sh' 'userenv'
  3. 'wait4wm' 'xhost' 'xsetroot' 'reaper' 'xmodmap' 'sh' '[' 'cat' 'stty'
  4. 'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
  5. 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh'
  6. 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'launchef' 'launchef'
  7. 'sh' '9term' 'sh' 'launchef' 'sh' 'launchef' 'hostname' '[' 'cat' 'stty'
  8. 'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
  9. 'more' 'sh' 'ex' 'sendmail' 'sendmail' 'sh' 'MediaMai' 'sendmail' 'sh'
  10. 'rm' 'MediaMai' 'sh' 'rm' 'MediaMai' 'launchef' 'launchef']

将其进行拼接处理,代码如下

  1. x=[]
  2. print(x_arr[0])
  3. print(np.array(x_arr).shape)
  4. for i,v in enumerate(x_arr):
  5. v=" ".join(v)
  6. x.append(v)

此时第一个元素变为

cpp sh xrdb cpp sh xrdb mkpts test stty hostname date echo [ find chmod tty echo env echo sh userenv wait4wm xhost xsetroot reaper xmodmap sh [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh launchef launchef sh 9term sh launchef sh launchef hostname [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh ex sendmail sendmail sh MediaMai sendmail sh rm MediaMai sh rm MediaMai launchef launchef

 接下来词袋处理,代码如下

  1. vectorizer = CountVectorizer(
  2. decode_error='ignore',
  3. strip_accents='ascii',
  4. max_features=max_features,
  5. stop_words='english',
  6. max_df=1.0,
  7. min_df=1 )
  8. x=vectorizer.fit_transform(x)

词袋处理过后的结果如下

  1. (0, 13) 2
  2. (0, 90) 25
  3. (0, 129) 2
  4. (0, 67) 1
  5. (0, 103) 1
  6. (0, 95) 3
  7. (0, 46) 4
  8. (0, 14) 3
  9. (0, 25) 7
  10. (0, 9) 3
  11. (0, 109) 3
  12. (0, 28) 1
  13. (0, 114) 1
  14. (0, 117) 1
  15. (0, 123) 1
  16. (0, 132) 1
  17. (0, 82) 1
  18. (0, 127) 1
  19. (0, 8) 2
  20. (0, 51) 6
  21. (0, 1) 1
  22. (0, 30) 1
  23. (0, 89) 3
  24. (0, 64) 3
  25. (0, 84) 2

再接下来TF-IDF处理

  1. transformer = TfidfTransformer(smooth_idf=False)
  2. transformer.fit(x)
  3. x_train = transformer.transform(x_train)

处理后如下所示

  1. (0, 132) 0.07590139034306102
  2. (0, 129) 0.15180278068612205
  3. (0, 127) 0.08038431251734222
  4. (0, 123) 0.07590139034306102
  5. (0, 117) 0.08038431251734222
  6. (0, 114) 0.07590139034306102
  7. (0, 109) 0.15401560934616065
  8. (0, 103) 0.07805309706428058
  9. (0, 95) 0.15273504757432566
  10. (0, 90) 0.7130806021902251
  11. (0, 89) 0.160729204804717
  12. (0, 84) 0.06701351493784334
  13. (0, 82) 0.08038431251734222
  14. (0, 67) 0.07590139034306102
  15. (0, 64) 0.12397049936474483
  16. (0, 51) 0.29322010438631935
  17. (0, 46) 0.1367884925425544
  18. (0, 30) 0.04484290866252592
  19. (0, 28) 0.07590139034306102
  20. (0, 25) 0.35638177767342655
  21. (0, 14) 0.09652546185265033
  22. (0, 13) 0.149769219205087
  23. (0, 9) 0.15147374310180983
  24. (0, 8) 0.06933404617293713
  25. (0, 1) 0.16804767932167497

 这时的x_train和x_test的shape如下,可以看到向量长度为136,可是max_features=300。其实这是因为我们用来训练的数据集太少了,故而特征长度仅为136

  1. max_features=300
  2. x_train (80, 136)
  3. x_test (70, 136)

(二)n-gram

  1. def get_features_by_ngram():
  2. global max_features
  3. global index
  4. x_arr,y=get_cmdlines()
  5. x=[]
  6. for i,v in enumerate(x_arr):
  7. v=" ".join(v)
  8. x.append(v)
  9. vectorizer = CountVectorizer(
  10. ngram_range=(2, 4),
  11. token_pattern=r'\b\w+\b',
  12. decode_error='ignore',
  13. strip_accents='ascii',
  14. max_features=max_features,
  15. stop_words='english',
  16. max_df=1.0,
  17. min_df=1 )
  18. x=vectorizer.fit_transform(x)
  19. x_train=x[0:index,]
  20. x_test=x[index:,]
  21. y_train=y[0:index,]
  22. y_test=y[index:,]
  23. transformer = TfidfTransformer(smooth_idf=False)
  24. transformer.fit(x)
  25. x_test = transformer.transform(x_test)
  26. x_train = transformer.transform(x_train)
  27. return x_train, x_test, y_train, y_test

(三)Word2Vec

  1. def get_features_by_word2vec():
  2. global word2ver_bin
  3. global index
  4. global max_features
  5. x_all=[]
  6. x_arr,y=get_cmdlines()
  7. x=[]
  8. for i,v in enumerate(x_arr):
  9. v=" ".join(v)
  10. x.append(v)
  11. for i in range(1,30):
  12. filename="../data/uba/MasqueradeDat/User%d" % i
  13. with open(filename) as f:
  14. x_all.append([w.strip('\n') for w in f.readlines()])
  15. cores=multiprocessing.cpu_count()
  16. if os.path.exists(word2ver_bin):
  17. print ("Find cache file %s" % word2ver_bin)
  18. model=gensim.models.Word2Vec.load(word2ver_bin)
  19. else:
  20. model=gensim.models.Word2Vec(size=max_features, window=5, min_count=1, iter=60, workers=cores)
  21. model.build_vocab(x_all)
  22. model.train(x_all, total_examples=model.corpus_count, epochs=model.iter)
  23. model.save(word2ver_bin)
  24. x = np.concatenate([buildWordVector(model, z, max_features) for z in x])
  25. x = scale(x)
  26. x_train = x[0:index,]
  27. x_test = x[index:,]
  28. y_train = y[0:index,]
  29. y_test = y[index:,]
  30. return x_train, x_test, y_train, y_test

(四)词集

  1. def get_features_by_wordseq():
  2. global max_features
  3. global index
  4. x_arr,y=get_cmdlines()
  5. x=[]
  6. for i,v in enumerate(x_arr):
  7. v=" ".join(v)
  8. x.append(v)
  9. vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_features,
  10. min_frequency=0,
  11. vocabulary=None,
  12. tokenizer_fn=None)
  13. x=vp.fit_transform(x, unused_y=None)
  14. x = np.array(list(x))
  15. x_train = x[0:index, ]
  16. x_test = x[index:, ]
  17. y_train = y[0:index, ]
  18. y_test = y[index:, ]
  19. return x_train, x_test, y_train, y_test

三、模型构建

(一)NB

  1. def do_nb(x_train, x_test, y_train, y_test):
  2. gnb = GaussianNB()
  3. gnb.fit(x_train,y_train)
  4. y_pred=gnb.predict(x_test)
  5. print(classification_report(y_test, y_pred))
  6. print (metrics.confusion_matrix(y_test, y_pred))

运行结果1

  1. nb and wordbag
  2. precision recall f1-score support
  3. 0 0.98 0.97 0.98 64
  4. 1 0.71 0.83 0.77 6
  5. accuracy 0.96 70
  6. macro avg 0.85 0.90 0.87 70
  7. weighted avg 0.96 0.96 0.96 70
  8. [[62 2]
  9. [ 1 5]]

(二)XGB-BOOST

  1. def do_xgboost(x_train, x_test, y_train, y_test):
  2. xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
  3. y_pred = xgb_model.predict(x_test)
  4. print(classification_report(y_test, y_pred))
  5. print (metrics.confusion_matrix(y_test, y_pred))

运行结果

  1. xgboost and wordbag
  2. precision recall f1-score support
  3. 0 0.96 1.00 0.98 64
  4. 1 1.00 0.50 0.67 6
  5. accuracy 0.96 70
  6. macro avg 0.98 0.75 0.82 70
  7. weighted avg 0.96 0.96 0.95 70
  8. [[64 0]
  9. [ 3 3]]

(三)MLP

源码如下

  1. def do_mlp(x_train, x_test, y_train, y_test):
  2. global max_features
  3. # Building deep neural network
  4. clf = MLPClassifier(solver='lbfgs',
  5. alpha=1e-5,
  6. hidden_layer_sizes = (5, 2),
  7. random_state = 1)
  8. clf.fit(x_train, y_train)
  9. y_pred = clf.predict(x_test)
  10. print(classification_report(y_test, y_pred))
  11. print (metrics.confusion_matrix(y_test, y_pred))

运行结果如下

  1. mlp and wordbag
  2. precision recall f1-score support
  3. 0 0.91 1.00 0.96 64
  4. 1 0.00 0.00 0.00 6
  5. accuracy 0.91 70
  6. macro avg 0.46 0.50 0.48 70
  7. weighted avg 0.84 0.91 0.87 70
  8. [[64 0]
  9. [ 6 0]]

(四)CNN

  1. def do_cnn(trainX, testX, trainY, testY):
  2. global max_features
  3. y_test = testY
  4. #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
  5. #testX = pad_sequences(testX, maxlen=max_features, value=0.)
  6. # Converting labels to binary vectors
  7. trainY = to_categorical(trainY, nb_classes=2)
  8. testY = to_categorical(testY, nb_classes=2)
  9. # Building convolutional network
  10. network = input_data(shape=[None,max_features], name='input')
  11. network = tflearn.embedding(network, input_dim=1000, output_dim=128,validate_indices=False)
  12. branch1 = conv_1d(network, 128, 2, padding='valid', activation='relu', regularizer="L2")
  13. branch2 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
  14. branch3 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
  15. network = merge([branch1, branch2, branch3], mode='concat', axis=1)
  16. network = tf.expand_dims(network, 2)
  17. network = global_max_pool(network)
  18. network = dropout(network, 1)
  19. network = fully_connected(network, 2, activation='softmax')
  20. network = regression(network, optimizer='adam', learning_rate=0.001,
  21. loss='categorical_crossentropy', name='target')
  22. # Training
  23. model = tflearn.DNN(network, tensorboard_verbose=0)
  24. model.fit(trainX, trainY,
  25. n_epoch=10, shuffle=True, validation_set=0,
  26. show_metric=True, batch_size=10,run_id="uba")
  27. y_predict_list = model.predict(testX)
  28. y_predict = []
  29. for i in y_predict_list:
  30. if i[0] > 0.5:
  31. y_predict.append(0)
  32. else:
  33. y_predict.append(1)
  34. print(classification_report(y_test, y_predict))
  35. print (metrics.confusion_matrix(y_test, y_predict))

(五)RNN

  1. def do_rnn_wordbag(trainX, testX, trainY, testY):
  2. y_test=testY
  3. #trainX = pad_sequences(trainX, maxlen=100, value=0.)
  4. #testX = pad_sequences(testX, maxlen=100, value=0.)
  5. # Converting labels to binary vectors
  6. trainY = to_categorical(trainY, nb_classes=2)
  7. testY = to_categorical(testY, nb_classes=2)
  8. # Network building
  9. net = tflearn.input_data([None, 100])
  10. net = tflearn.embedding(net, input_dim=1000, output_dim=128)
  11. net = tflearn.lstm(net, 128, dropout=0.1)
  12. net = tflearn.fully_connected(net, 2, activation='softmax')
  13. net = tflearn.regression(net, optimizer='adam', learning_rate=0.005,
  14. loss='categorical_crossentropy')
  15. # Training
  16. model = tflearn.DNN(net, tensorboard_verbose=0)
  17. model.fit(trainX, trainY, validation_set=0.1, show_metric=True,
  18. batch_size=1,run_id="uba",n_epoch=10)
  19. y_predict_list = model.predict(testX)
  20. y_predict = []
  21. for i in y_predict_list:
  22. if i[0] >= 0.5:
  23. y_predict.append(0)
  24. else:
  25. y_predict.append(1)
  26. print(classification_report(y_test, y_predict))
  27. print (metrics.confusion_matrix(y_test, y_predict))
  28. print (y_train)
  29. print ("ture")
  30. print (y_test)
  31. print ("pre")
  32. print (y_predict)

(六)Bi-RNN

  1. def do_birnn_wordbag(trainX, testX, trainY, testY):
  2. y_test=testY
  3. #trainX = pad_sequences(trainX, maxlen=100, value=0.)
  4. #testX = pad_sequences(testX, maxlen=100, value=0.)
  5. # Converting labels to binary vectors
  6. trainY = to_categorical(trainY, nb_classes=2)
  7. testY = to_categorical(testY, nb_classes=2)
  8. # Network building
  9. net = input_data(shape=[None, 100])
  10. net = tflearn.embedding(net, input_dim=10000, output_dim=128)
  11. net = tflearn.bidirectional_rnn(net, BasicLSTMCell(128), BasicLSTMCell(128))
  12. net = dropout(net, 0.5)
  13. net = fully_connected(net, 2, activation='softmax')
  14. net = regression(net, optimizer='adam', loss='categorical_crossentropy')
  15. # Training
  16. model = tflearn.DNN(net, tensorboard_verbose=0)
  17. model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
  18. batch_size=1,run_id="uba",n_epoch=10)
  19. y_predict_list = model.predict(testX)
  20. y_predict = []
  21. for i in y_predict_list:
  22. if i[0] >= 0.5:
  23. y_predict.append(0)
  24. else:
  25. y_predict.append(1)
  26. print(classification_report(y_test, y_predict))
  27. print (metrics.confusion_matrix(y_test, y_predict))

 四、总结

        其实这个数据集的数据过少,没有实际意义。但是这一小节也是将机器学习的算法应用到了恶意行为检测中,提取特征并构建模型进行训练和测试。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/487455
推荐阅读
相关标签
  

闽ICP备14008679号