当前位置:   article > 正文

NLP|LSTM+Attention文本分类_文本分类引入注意力机制

文本分类引入注意力机制

目录

一、Attention原理简介

二、LSTM+Attention文本分类实战

1、数据读取及预处理

2、文本序列编码

3、LSTM文本分类

三、划重点

少走10年弯路


        LSTM是一种特殊的循环神经网络(RNN),用于处理序列数据和时间序列数据的建模和预测。而在NLP和时间序列领域上Attention-注意力机制也早已有了大量应用,本文将介绍在LSTM基础上如何添加Attention来优化模型效果。

一、Attention原理简介

        注意力机制通过聚焦于重要的信息,忽略不重要的信息,从而有效地处理输入信息。在神经网络中,注意力机制可以帮助模型更好地关注输入中的重要特征,从而提高模型的性能。

        简单而言,在文本处理任务中,self-attention对每一个词会随机初始化q、k、v三个向量,用每个词的q向量和其他k向量做点积、再归一化得到这个词的权重向量w,用w给v向量加权求和得到z向量(该词attention之后的向量)。再延伸一点,其实可以初始化多组q、k、v矩阵,从而得到多组z矩阵拼接起来(类似于CNN中的多个卷积核、来提取不同信息),再乘上一个矩阵压缩回原来的维度,得到最终的embedding。

        细节原理相对繁琐,推荐大家可以去看一下这篇博客的bert介绍,其中self-attention部分详细且清晰。

https://blog.csdn.net/jiaowoshouzi/article/details/89073944

二、LSTM+Attention文本分类实战

1、数据读取及预处理

  1. import re
  2. import os
  3. from sqlalchemy import create_engine
  4. import pandas as pd
  5. import numpy as np
  6. import warnings
  7. warnings.filterwarnings('ignore')
  8. import sklearn
  9. from sklearn.model_selection import train_test_split
  10. from sklearn.metrics import roc_curve,roc_auc_score
  11. import xgboost as xgb
  12. from xgboost.sklearn import XGBClassifier
  13. import lightgbm as lgb
  14. import matplotlib.pyplot as plt
  15. import gc
  16. from tensorflow.keras.preprocessing.text import Tokenizer
  17. from tensorflow.keras import models
  18. from tensorflow.keras import layers
  19. from tensorflow.keras import optimizers
  20. # 2、数据读取+预处理
  21. data=pd.read_excel('Inshorts Cleaned Data.xlsx')
  22. def data_preprocess(data):
  23. df=data.drop(['Publish Date','Time ','Headline'],axis=1).copy()
  24. df.rename(columns={'Source ':'Source'},inplace=True)
  25. df=df[df.Source.isin(['YouTube','India Today'])].reset_index(drop=True)
  26. df['y']=np.where(df.Source=='YouTube',1,0)
  27. df=df.drop(['Source'],axis=1)
  28. return df
  29. df=data.pipe(data_preprocess)
  30. print(df.shape)
  31. df.head()
  32. # 导入英文停用词
  33. from nltk.corpus import stopwords
  34. from nltk.tokenize import sent_tokenize
  35. stop_english=stopwords.words('english')
  36. stop_spanish=stopwords.words('spanish')
  37. stop_english
  38. # 4、文本预处理:处理简写、小写化、去除停用词、词性还原
  39. from nltk.stem import WordNetLemmatizer
  40. from nltk.corpus import stopwords
  41. from nltk.tokenize import sent_tokenize
  42. import nltk
  43. def replace_abbreviation(text):
  44. rep_list=[
  45. ("it's", "it is"),
  46. ("i'm", "i am"),
  47. ("he's", "he is"),
  48. ("she's", "she is"),
  49. ("we're", "we are"),
  50. ("they're", "they are"),
  51. ("you're", "you are"),
  52. ("that's", "that is"),
  53. ("this's", "this is"),
  54. ("can't", "can not"),
  55. ("don't", "do not"),
  56. ("doesn't", "does not"),
  57. ("we've", "we have"),
  58. ("i've", " i have"),
  59. ("isn't", "is not"),
  60. ("won't", "will not"),
  61. ("hasn't", "has not"),
  62. ("wasn't", "was not"),
  63. ("weren't", "were not"),
  64. ("let's", "let us"),
  65. ("didn't", "did not"),
  66. ("hadn't", "had not"),
  67. ("waht's", "what is"),
  68. ("couldn't", "could not"),
  69. ("you'll", "you will"),
  70. ("i'll", "i will"),
  71. ("you've", "you have")
  72. ]
  73. result = text.lower()
  74. for word_replace in rep_list:
  75. result=result.replace(word_replace[0],word_replace[1])
  76. # result = result.replace("'s", "")
  77. return result
  78. def drop_char(text):
  79. result=text.lower()
  80. result=re.sub('[^\w\s]',' ',result) # 去掉标点符号、特殊字符
  81. result=re.sub('\s+',' ',result) # 多空格处理为单空格
  82. return result
  83. def stemed_words(text,stop_words,lemma):
  84. word_list = [lemma.lemmatize(word, pos='v') for word in text.split() if word not in stop_words]
  85. result=" ".join(word_list)
  86. return result
  87. def text_preprocess(text_seq):
  88. stop_words = stopwords.words("english")
  89. lemma = WordNetLemmatizer()
  90. result=[]
  91. for text in text_seq:
  92. if pd.isnull(text):
  93. result.append(None)
  94. continue
  95. text=replace_abbreviation(text)
  96. text=drop_char(text)
  97. text=stemed_words(text,stop_words,lemma)
  98. result.append(text)
  99. return result
  100. df['short']=text_preprocess(df.Short)
  101. df[['Short','short']]
  102. # 5、划分训练、测试集
  103. test_index=list(df.sample(2000).index)
  104. df['label']=np.where(df.index.isin(test_index),'test','train')
  105. df['label'].value_counts()

2、文本序列编码

        按照词频排序,创建长度为6000的高频词词典、来对文本进行序列化编码。

  1. from tensorflow.keras.preprocessing.text import Tokenizer
  2. def word_dict_fit(train_text_list,num_words):
  3. '''
  4. train_text_list: ['some thing today ','some thing today2']
  5. '''
  6. tok_params={
  7. 'num_words':num_words, # 词典的长度,仅保留词频top的num_words个词
  8. 'filters':'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
  9. 'lower':True,
  10. 'split':' ',
  11. 'char_level':False,
  12. 'oov_token':None, # 设定词典外的词编码
  13. }
  14. tok = Tokenizer(**tok_params) # 分词
  15. tok.fit_on_texts(train_text_list)
  16. return tok
  17. def word_dict_apply_sequences(tok_model,text_list,len_vec):
  18. '''
  19. text_list: ['some thing today ','some thing today2']
  20. '''
  21. list_tok = tok_model.texts_to_sequences(text_list) # 编码映射
  22. pad_params={
  23. 'sequences':list_tok,
  24. 'maxlen':len_vec, # 补全后向量长度
  25. 'padding':'pre', # 'pre' or 'post',在前、在后补全
  26. 'truncating':'pre', # 'pre' or 'post',在前、在后删除长度多余的部分
  27. 'value':0, # 补全0
  28. }
  29. seq_tok = pad_sequences(**pad_params) # 补全编码向量,返回二维array
  30. return seq_tok
  31. num_words,len_vec = 6000,40
  32. tok_model= word_dict_fit(df[df.label=='train'].short,num_words)
  33. tok_train = word_dict_apply_sequences(tok_model,df[df.label=='train'].short,len_vec)
  34. tok_test = word_dict_apply_sequences(tok_model,df[df.label=='test'].short,len_vec)
  35. tok_test

图片

3、LSTM文本分类

        LSTM层的输入是三维张量(batch_size, timesteps, input_dim),所以使用的数据可以是时间序列、也可以是文本数据的embedding;输出设置return_sequences为False,返回尺寸为 (batch_size, units) 的 2D 张量。

  1. '''
  2. LSTM层核心参数
  3. units:输出维度
  4. activation:激活函数
  5. recurrent_activation: RNN循环激活函数
  6. use_bias: 布尔值,是否使用偏置项
  7. dropout:0~1之间的浮点数,神经元失活比例
  8. recurrent_dropout:0~1之间的浮点数,循环状态的神经元失活比例
  9. return_sequences: True时返回RNN全部输出序列(3D),False时输出序列的最后一个输出(2D)
  10. '''
  11. def init_lstm_model(max_features, embed_size):
  12. model = Sequential()
  13. model.add(Embedding(input_dim=max_features, output_dim=embed_size))
  14. model.add(Bidirectional(LSTM(units=32,activation='relu', recurrent_dropout=0.1)))
  15. model.add(Dropout(0.25,seed=1))
  16. model.add(Dense(64))
  17. model.add(Dropout(0.3,seed=1))
  18. model.add(Dense(1, activation='sigmoid'))
  19. model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  20. return model
  21. def model_fit(model, x, y,test_x,test_y):
  22. return model.fit(x, y, batch_size=100, epochs=2, validation_data=(test_x,test_y))
  23. embed_size = 128
  24. lstm_model=init_lstm_model(num_words, embed_size)
  25. model_train=model_fit(lstm_model,tok_train,np.array(df[df.label=='train'].y),tok_test,np.array(df[df.label=='test'].y))
  26. lstm_model.summary()
  27. def model_fit(model, x, y,test_x,test_y):
  28. return model.fit(x, y, batch_size=100, epochs=2, validation_data=(test_x,test_y))
  29. embed_size = 128
  30. lstm_model=init_lstm_model(num_words, embed_size)
  31. model_train=model_fit(lstm_model,tok_train,np.array(df[df.label=='train'].y),tok_test,np.array(df[df.label=='test'].y))
  32. lstm_model.summary()

 

  1. def ks_auc_value(y_value,y_pred):
  2. fpr,tpr,thresholds= roc_curve(list(y_value),list(y_pred))
  3. ks=max(tpr-fpr)
  4. auc= roc_auc_score(list(y_value),list(y_pred))
  5. return ks,auc
  6. print('train_ks_auc',ks_auc_value(df[df.label=='train'].y,lstm_model.predict(tok_train)))
  7. print('test_ks_auc',ks_auc_value(df[df.label=='test'].y,lstm_model.predict(tok_test)))
  8. '''
  9. train_ks_auc (0.7223217797649937, 0.922939132379851)
  10. test_ks_auc (0.7046603930606234, 0.9140880065296716)
  11. '''

4、LSTM+Attention文本分类

        在LSTM层之后添加Attention层优化效果。

  1. from tensorflow.keras.models import Model
  2. def init_lstm_model(max_features, embed_size ,embedding_matrix):
  3. input_=layers.Input(shape=(40,))
  4. x=Embedding(input_dim=max_features, output_dim=embed_size,weights=[embedding_matrix],trainable=False)(input_)
  5. x=Bidirectional(layers.LSTM(units=32,activation='relu', recurrent_dropout=0.1,return_sequences=True))(x)
  6. x=layers.Attention(40)([x,x])
  7. x=Dropout(0.25)(x)
  8. x=layers.Flatten()(x)
  9. x=Dense(64)(x)
  10. x=Dropout(0.3)(x)
  11. x=Dense(1,activation='sigmoid')(x)
  12. model = Model(inputs=input_, outputs=x)
  13. model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  14. return model
  15. def model_fit(model, x, y,test_x,test_y):
  16. return model.fit(x, y, batch_size=100, epochs=5, validation_data=(test_x,test_y))
  17. num_words,embed_size = 6000,128
  18. lstm_model2=init_lstm_model(num_words, embed_size ,embedding_matrix)
  19. model_train=model_fit(lstm_model2,tok_train,np.array(df[df.label=='train'].y),tok_test,np.array(df[df.label=='test'].y))
  20. print('train_ks_auc',ks_auc_value(df[df.label=='train'].y,gru_model.predict(tok_train)))
  21. print('test_ks_auc',ks_auc_value(df[df.label=='test'].y,gru_model.predict(tok_test)))
  22. '''
  23. train_ks_auc (0.7126925954159541, 0.9199721561742299)
  24. test_ks_auc (0.7239373279559567, 0.917086274086166)
  25. '''

三、划重点

少走10年弯路

        关注威信公众号 Python风控模型与数据分析,回复 文本分类5 获取本篇数据及代码

        还有更多理论、代码分享等你来拿

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/寸_铁/article/detail/1015668
推荐阅读
相关标签
  

闽ICP备14008679号