《Web安全之深度学习实战》笔记:第八章 骚扰短信识别(1)_smsspamcollection


    本章主要以SMS Spam Collection数据集 为例介绍骚扰短信的识别技术。介绍识别骚扰短信使用的征提取方法,包括词袋和TF-IDF模型、词汇表模型以及Word2Vec和Doc2Vec模型,介绍使用的模型以及对应的验证结果,包括朴素贝叶斯、支持向量机、XGBoost和MLP算法。这一节与第六章的垃圾邮件、第七章的负面评论类似、只是识别的内容变为了骚扰短信,均为2分类问题。


        测试数据来自SMS Spam Collection数据集,SMS Spam Collection是用于骚扰短信识别的经典数据集,完全来自真实短信内容,包括4831条正常短信和747条骚扰短信。从官网下载数据集压缩包并解压,正常短信和骚扰短信保存在一个文本文件中。 如下所示,下图中的SMSSpamCollection.txt文件即为测试数据集。


  1. def load_all_files():
  2. x=[]
  3. y=[]
  4. datafile="../data/sms/smsspamcollection/SMSSpamCollection.txt"
  5. with open(datafile, encoding='utf-8') as f:
  6. for line in f:
  7. line=line.strip('\n')
  8. label,text=line.split('\t')
  9. x.append(text)
  10. if label == 'ham':
  11. y.append(0)
  12. else:
  13. y.append(1)
  14. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
  15. return x_train, x_test, y_train, y_test


x_train, x_test, y_train, y_test=load_all_files()




  1. def get_features_by_tf():
  2. global max_document_length
  3. x_train, x_test, y_train, y_test=load_all_files()
  4. vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
  5. min_frequency=0,
  6. vocabulary=None,
  7. tokenizer_fn=None)
  8. x_train=vp.fit_transform(x_train, unused_y=None)
  9. x_train=np.array(list(x_train))
  10. x_test=vp.transform(x_test)
  11. x_test=np.array(list(x_test))
  12. return x_train, x_test, y_train, y_test



And stop being an old man. You get to build snowman snow angels and snowball fights.


  1. vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
  2. min_frequency=0,
  3. vocabulary=None,
  4. tokenizer_fn=None)
  5. x_train=vp.fit_transform(x_train, unused_y=None)


  1. [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 0 0 0 0 0 0 0
  2. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  3. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  4. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  5. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  6. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  7. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]



What's ur pin?


  1. [17 18 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  2. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  3. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  4. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  5. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  6. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  7. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


  1. def get_features_by_wordbag():
  2. global max_features
  3. x_train, x_test, y_train, y_test=load_all_files()
  4. vectorizer = CountVectorizer(
  5. decode_error='ignore',
  6. strip_accents='ascii',
  7. max_features=max_features,
  8. stop_words='english',
  9. max_df=1.0,
  10. min_df=1 )
  11. print (vectorizer)
  12. x_train=vectorizer.fit_transform(x_train)
  13. x_train=x_train.toarray()
  14. vocabulary=vectorizer.vocabulary_
  15. vectorizer = CountVectorizer(
  16. decode_error='ignore',
  17. strip_accents='ascii',
  18. vocabulary=vocabulary,
  19. stop_words='english',
  20. max_df=1.0,
  21. min_df=1 )
  22. print (vectorizer)
  23. x_test=vectorizer.fit_transform(x_test)
  24. x_test=x_test.toarray()
  25. return x_train, x_test, y_train, y_test

