《Web安全之深度学习实战》笔记:第十章 用户行为分析与恶意行为检测_web应用 用户异常行为分析

        我们将恶意内部人员和内部员工的异常操作统称为恶意操作。检测这种恶意操作需要使用高级技术,比如用户行为分析(User Behawiors Analysis,UBA),这种新兴技术可提供以往被遗漏的数据保护和欺诈检测功能。结合用户日常操作的系统,UBA利用一种专门的安全分析算法,不仅可以关注初始登录操作,还能跟踪用户的一举一动。UBA有两个主要功能:它有助于为用户执行的正常活动确立基线;迅速识别偏离正常行为的异常行为,以便安全分析员执行调查。某些异常行为可能乍一看不是恶意的,但是这需要安全分析员进一步调查情况,以确定它是合法行为还是恶意行为 。



        每个用户的数据按照连续的100个命令一组分为150个块,前三分之一数据块用作训练该用户正常行为模型,剩余三分之二数据块随机插入了测试用的恶意数据。SEA数据集中恶意数据的分布具有统计规律,任意给定一个测试集命令块,其中含有恶意指令的概率为1%,而当一个命令块中含有恶意指令,则后续命令块也含有恶意指令的概率达到80% [2] 。可以看出SEA数据集将连续数据块看作一个会话,只能模拟连续会话关联的攻击行为。另外,SEA数据集中黑样本偏少,虽然这更接近实际情况,但是却给我们在随机划分训练集和测试集时带来了挑战,如果使用常见的划分方法,有相当大的概率训练集中都是白样本,所以本章的样本划分需要特殊处理,保证训练集中有足够的黑样本。


  1. cmdlines_file="../data/uba/MasqueradeDat/User7"
  2. labels_file="../data/uba/MasqueradeDat/label.txt"
  3. word2ver_bin="uba_word2vec.bin"
  4. max_features=300
  5. index = 80
  6. def get_cmdlines():
  7. x=np.loadtxt(cmdlines_file,dtype=str)
  8. x=x.reshape((150,100))
  9. y=np.loadtxt(labels_file, dtype=int,usecols=6)
  10. y=y.reshape((100, 1))
  11. y_train=np.zeros([50,1],int)
  12. y=np.concatenate([y_train,y])
  13. y=y.reshape((150, ))
  14. return x,y


  1. def get_features_by_wordbag():
  2. global max_features
  3. global index
  4. x_arr,y=get_cmdlines()
  5. x=[]
  6. for i,v in enumerate(x_arr):
  7. v=" ".join(v)
  8. x.append(v)
  9. vectorizer = CountVectorizer(
  10. decode_error='ignore',
  11. strip_accents='ascii',
  12. max_features=max_features,
  13. stop_words='english',
  14. max_df=1.0,
  15. min_df=1 )
  16. x=vectorizer.fit_transform(x)
  17. x_train=x[0:index,]
  18. x_test=x[index:,]
  19. y_train=y[0:index,]
  20. y_test=y[index:,]
  21. transformer = TfidfTransformer(smooth_idf=False)
  22. transformer.fit(x)
  23. x_test = transformer.transform(x_test)
  24. x_train = transformer.transform(x_train)
  25. return x_train, x_test, y_train, y_test




  1. ['cpp' 'sh' 'xrdb' 'cpp' 'sh' 'xrdb' 'mkpts' 'test' 'stty' 'hostname'
  2. 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'env' 'echo' 'sh' 'userenv'
  3. 'wait4wm' 'xhost' 'xsetroot' 'reaper' 'xmodmap' 'sh' '[' 'cat' 'stty'
  4. 'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
  5. 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh'
  6. 'more' 'sh' 'more' 'sh' 'more' 'sh' 'more' 'sh' 'launchef' 'launchef'
  7. 'sh' '9term' 'sh' 'launchef' 'sh' 'launchef' 'hostname' '[' 'cat' 'stty'
  8. 'hostname' 'date' 'echo' '[' 'find' 'chmod' 'tty' 'echo' 'sh' 'more' 'sh'
  9. 'more' 'sh' 'ex' 'sendmail' 'sendmail' 'sh' 'MediaMai' 'sendmail' 'sh'
  10. 'rm' 'MediaMai' 'sh' 'rm' 'MediaMai' 'launchef' 'launchef']


  1. x=[]
  2. print(x_arr[0])
  3. print(np.array(x_arr).shape)
  4. for i,v in enumerate(x_arr):
  5. v=" ".join(v)
  6. x.append(v)


cpp sh xrdb cpp sh xrdb mkpts test stty hostname date echo [ find chmod tty echo env echo sh userenv wait4wm xhost xsetroot reaper xmodmap sh [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh more sh launchef launchef sh 9term sh launchef sh launchef hostname [ cat stty hostname date echo [ find chmod tty echo sh more sh more sh ex sendmail sendmail sh MediaMai sendmail sh rm MediaMai sh rm MediaMai launchef launchef


  1. vectorizer = CountVectorizer(
  2. decode_error='ignore',
  3. strip_accents='ascii',
  4. max_features=max_features,
  5. stop_words='english',
  6. max_df=1.0,
  7. min_df=1 )
  8. x=vectorizer.fit_transform(x)


  1. (0, 13) 2
  2. (0, 90) 25
  3. (0, 129) 2
  4. (0, 67) 1
  5. (0, 103) 1
  6. (0, 95) 3
  7. (0, 46) 4
  8. (0, 14) 3
  9. (0, 25) 7
  10. (0, 9) 3
  11. (0, 109) 3
  12. (0, 28) 1
  13. (0, 114) 1
  14. (0, 117) 1
  15. (0, 123) 1
  16. (0, 132) 1
  17. (0, 82) 1
  18. (0, 127) 1
  19. (0, 8) 2
  20. (0, 51) 6
  21. (0, 1) 1
  22. (0, 30) 1
  23. (0, 89) 3
  24. (0, 64) 3
  25. (0, 84) 2


  1. transformer = TfidfTransformer(smooth_idf=False)
  2. transformer.fit(x)
  3. x_train = transformer.transform(x_train)


  1. (0, 132) 0.07590139034306102
  2. (0, 129) 0.15180278068612205
  3. (0, 127) 0.08038431251734222
  4. (0, 123) 0.07590139034306102
  5. (0, 117) 0.08038431251734222
  6. (0, 114) 0.07590139034306102
  7. (0, 109) 0.15401560934616065
  8. (0, 103) 0.07805309706428058
  9. (0, 95) 0.15273504757432566
  10. (0, 90) 0.7130806021902251
  11. (0, 89) 0.160729204804717
  12. (0, 84) 0.06701351493784334
  13. (0, 82) 0.08038431251734222
  14. (0, 67) 0.07590139034306102
  15. (0, 64) 0.12397049936474483
  16. (0, 51) 0.29322010438631935
  17. (0, 46) 0.1367884925425544
  18. (0, 30) 0.04484290866252592
  19. (0, 28) 0.07590139034306102
  20. (0, 25) 0.35638177767342655
  21. (0, 14) 0.09652546185265033
  22. (0, 13) 0.149769219205087
  23. (0, 9) 0.15147374310180983
  24. (0, 8) 0.06933404617293713
  25. (0, 1) 0.16804767932167497


  1. max_features=300
  2. x_train (80, 136)
  3. x_test (70, 136)


  1. def get_features_by_ngram():
  2. global max_features
  3. global index
  4. x_arr,y=get_cmdlines()
  5. x=[]
  6. for i,v in enumerate(x_arr):
  7. v=" ".join(v)
  8. x.append(v)
  9. vectorizer = CountVectorizer(
  10. ngram_range=(2, 4),
  11. token_pattern=r'\b\w+\b',
  12. decode_error='ignore',
  13. strip_accents='ascii',
  14. max_features=max_features,
  15. stop_words='english',
  16. max_df=1.0,
  17. min_df=1 )
  18. x=vectorizer.fit_transform(x)
  19. x_train=x[0:index,]
  20. x_test=x[index:,]
  21. y_train=y[0:index,]
  22. y_test=y[index:,]
  23. transformer = TfidfTransformer(smooth_idf=False)
  24. transformer.fit(x)
  25. x_test = transformer.transform(x_test)
  26. x_train = transformer.transform(x_train)
  27. return x_train, x_test, y_train, y_test


  1. def get_features_by_word2vec():
  2. global word2ver_bin
  3. global index
  4. global max_features
  5. x_all=[]
  6. x_arr,y=get_cmdlines()
  7. x=[]
  8. for i,v in enumerate(x_arr):
  9. v=" ".join(v)
  10. x.append(v)
  11. for i in range(1,30):
  12. filename="../data/uba/MasqueradeDat/User%d" % i
  13. with open(filename) as f:
  14. x_all.append([w.strip('\n') for w in f.readlines()])
  15. cores=multiprocessing.cpu_count()
  16. if os.path.exists(word2ver_bin):
  17. print ("Find cache file %s" % word2ver_bin)
  18. model=gensim.models.Word2Vec.load(word2ver_bin)
  19. else:
  20. model=gensim.models.Word2Vec(size=max_features, window=5, min_count=1, iter=60, workers=cores)
  21. model.build_vocab(x_all)
  22. model.train(x_all, total_examples=model.corpus_count, epochs=model.iter)
  23. model.save(word2ver_bin)
  24. x = np.concatenate([buildWordVector(model, z, max_features) for z in x])
  25. x = scale(x)
  26. x_train = x[0:index,]
  27. x_test = x[index:,]
  28. y_train = y[0:index,]
  29. y_test = y[index:,]
  30. return x_train, x_test, y_train, y_test


  1. def get_features_by_wordseq():
  2. global max_features
  3. global index
  4. x_arr,y=get_cmdlines()
  5. x=[]
  6. for i,v in enumerate(x_arr):
  7. v=" ".join(v)
  8. x.append(v)
  9. vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_features,
  10. min_frequency=0,
  11. vocabulary=None,
  12. tokenizer_fn=None)
  13. x=vp.fit_transform(x, unused_y=None)
  14. x = np.array(list(x))
  15. x_train = x[0:index, ]
  16. x_test = x[index:, ]
  17. y_train = y[0:index, ]
  18. y_test = y[index:, ]
  19. return x_train, x_test, y_train, y_test



  1. def do_nb(x_train, x_test, y_train, y_test):
  2. gnb = GaussianNB()
  3. gnb.fit(x_train,y_train)
  4. y_pred=gnb.predict(x_test)
  5. print(classification_report(y_test, y_pred))
  6. print (metrics.confusion_matrix(y_test, y_pred))


  1. nb and wordbag
  2. precision recall f1-score support
  3. 0 0.98 0.97 0.98 64
  4. 1 0.71 0.83 0.77 6
  5. accuracy 0.96 70
  6. macro avg 0.85 0.90 0.87 70
  7. weighted avg 0.96 0.96 0.96 70
  8. [[62 2]
  9. [ 1 5]]


  1. def do_xgboost(x_train, x_test, y_train, y_test):
  2. xgb_model = xgb.XGBClassifier().fit(x_train, y_train)
  3. y_pred = xgb_model.predict(x_test)
  4. print(classification_report(y_test, y_pred))
  5. print (metrics.confusion_matrix(y_test, y_pred))


  1. xgboost and wordbag
  2. precision recall f1-score support
  3. 0 0.96 1.00 0.98 64
  4. 1 1.00 0.50 0.67 6
  5. accuracy 0.96 70
  6. macro avg 0.98 0.75 0.82 70
  7. weighted avg 0.96 0.96 0.95 70
  8. [[64 0]
  9. [ 3 3]]



  1. def do_mlp(x_train, x_test, y_train, y_test):
  2. global max_features
  3. # Building deep neural network
  4. clf = MLPClassifier(solver='lbfgs',
  5. alpha=1e-5,
  6. hidden_layer_sizes = (5, 2),
  7. random_state = 1)
  8. clf.fit(x_train, y_train)
  9. y_pred = clf.predict(x_test)
  10. print(classification_report(y_test, y_pred))
  11. print (metrics.confusion_matrix(y_test, y_pred))


  1. mlp and wordbag
  2. precision recall f1-score support
  3. 0 0.91 1.00 0.96 64
  4. 1 0.00 0.00 0.00 6
  5. accuracy 0.91 70
  6. macro avg 0.46 0.50 0.48 70
  7. weighted avg 0.84 0.91 0.87 70
  8. [[64 0]
  9. [ 6 0]]


  1. def do_cnn(trainX, testX, trainY, testY):
  2. global max_features
  3. y_test = testY
  4. #trainX = pad_sequences(trainX, maxlen=max_features, value=0.)
  5. #testX = pad_sequences(testX, maxlen=max_features, value=0.)
  6. # Converting labels to binary vectors
  7. trainY = to_categorical(trainY, nb_classes=2)
  8. testY = to_categorical(testY, nb_classes=2)
  9. # Building convolutional network
  10. network = input_data(shape=[None,max_features], name='input')
  11. network = tflearn.embedding(network, input_dim=1000, output_dim=128,validate_indices=False)
  12. branch1 = conv_1d(network, 128, 2, padding='valid', activation='relu', regularizer="L2")
  13. branch2 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
  14. branch3 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
  15. network = merge([branch1, branch2, branch3], mode='concat', axis=1)
  16. network = tf.expand_dims(network, 2)
  17. network = global_max_pool(network)
  18. network = dropout(network, 1)
  19. network = fully_connected(network, 2, activation='softmax')
  20. network = regression(network, optimizer='adam', learning_rate=0.001,
  21. loss='categorical_crossentropy', name='target')
  22. # Training
  23. model = tflearn.DNN(network, tensorboard_verbose=0)
  24. model.fit(trainX, trainY,
  25. n_epoch=10, shuffle=True, validation_set=0,
  26. show_metric=True, batch_size=10,run_id="uba")
  27. y_predict_list = model.predict(testX)
  28. y_predict = []
  29. for i in y_predict_list:
  30. if i[0] > 0.5:
  31. y_predict.append(0)
  32. else:
  33. y_predict.append(1)
  34. print(classification_report(y_test, y_predict))
  35. print (metrics.confusion_matrix(y_test, y_predict))


  1. def do_rnn_wordbag(trainX, testX, trainY, testY):
  2. y_test=testY
  3. #trainX = pad_sequences(trainX, maxlen=100, value=0.)
  4. #testX = pad_sequences(testX, maxlen=100, value=0.)
  5. # Converting labels to binary vectors
  6. trainY = to_categorical(trainY, nb_classes=2)
  7. testY = to_categorical(testY, nb_classes=2)
  8. # Network building
  9. net = tflearn.input_data([None, 100])
  10. net = tflearn.embedding(net, input_dim=1000, output_dim=128)
  11. net = tflearn.lstm(net, 128, dropout=0.1)
  12. net = tflearn.fully_connected(net, 2, activation='softmax')
  13. net = tflearn.regression(net, optimizer='adam', learning_rate=0.005,
  14. loss='categorical_crossentropy')
  15. # Training
  16. model = tflearn.DNN(net, tensorboard_verbose=0)
  17. model.fit(trainX, trainY, validation_set=0.1, show_metric=True,
  18. batch_size=1,run_id="uba",n_epoch=10)
  19. y_predict_list = model.predict(testX)
  20. y_predict = []
  21. for i in y_predict_list:
  22. if i[0] >= 0.5:
  23. y_predict.append(0)
  24. else:
  25. y_predict.append(1)
  26. print(classification_report(y_test, y_predict))
  27. print (metrics.confusion_matrix(y_test, y_predict))
  28. print (y_train)
  29. print ("ture")
  30. print (y_test)
  31. print ("pre")
  32. print (y_predict)


  1. def do_birnn_wordbag(trainX, testX, trainY, testY):
  2. y_test=testY
  3. #trainX = pad_sequences(trainX, maxlen=100, value=0.)
  4. #testX = pad_sequences(testX, maxlen=100, value=0.)
  5. # Converting labels to binary vectors
  6. trainY = to_categorical(trainY, nb_classes=2)
  7. testY = to_categorical(testY, nb_classes=2)
  8. # Network building
  9. net = input_data(shape=[None, 100])
  10. net = tflearn.embedding(net, input_dim=10000, output_dim=128)
  11. net = tflearn.bidirectional_rnn(net, BasicLSTMCell(128), BasicLSTMCell(128))
  12. net = dropout(net, 0.5)
  13. net = fully_connected(net, 2, activation='softmax')
  14. net = regression(net, optimizer='adam', loss='categorical_crossentropy')
  15. # Training
  16. model = tflearn.DNN(net, tensorboard_verbose=0)
  17. model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
  18. batch_size=1,run_id="uba",n_epoch=10)
  19. y_predict_list = model.predict(testX)
  20. y_predict = []
  21. for i in y_predict_list:
  22. if i[0] >= 0.5:
  23. y_predict.append(0)
  24. else:
  25. y_predict.append(1)
  26. print(classification_report(y_test, y_predict))
  27. print (metrics.confusion_matrix(y_test, y_predict))



