当前位置:   article > 正文

LSTM情感分析案例demo(pytorch)_lstm demo

lstm demo

目录

零、pytorch 中 lstm 参数

一、数据

二、整体步骤

  0、定义参数的类

  1、数据预处理

    (1)数据去重、数据分割训练集测试集验证集

    (2)分词、噪声字段、空格、数字、大小写替换、过滤停止词字典停止词 

  2、构建vocab、静态word embedding 表

    (1)构建vocab

    (2)构建静态word embedding

    (3)PAD截取并转化为id

  3、DataSet、DataLoader

  4、模型搭建

  5、模型实例化

  6、优化器【分层学习率】、LOSS函数、学习率调整器

  7、训练train&eval与保存模型参数字典

  8、测试集test

  9、无target预测

三、训练时候的问题

  3.1、模型过拟合

  3.2、常见的问题


​​​​​​​

零、pytorch 中 lstm 参数

 Pytorch的LSTM参数、输入、输出

一、数据

           数据:online_shopping_10_cats数据

           腾讯800w预训练静态词向量:Tencent AI Lab Embedding Corpus for Chinese Words and Phrases

           腾讯800w预训练静态词向量加载方法:腾讯词向量使用

二、整体步骤

  0、定义参数的类

  1. class Config():
  2. model_name = 'lstm_attention' # 可以使用的模型:"lstm_attention"、"lstm"
  3. learning_rate = 0.0006 # 学习率
  4. max_seq = 64 # LSTM 输入最长的序列长度,该长度不是模型训练时batch的真实长度,dataloader会截取batch真实数据的最长长度,也就是每一个batch的序列长度可能是不同的
  5. batch_size = 32 # batch size
  6. epochs = 200 # iter 次数
  7. embedding_dim = 200 # 词 embedding
  8. layer_num = 2 # LSTM 层数
  9. num_classes = 2 # label 类别数
  10. dropout = 0.1 # drop 保留 1 - dropout
  11. bidirectional = True # 是否使用双向LSTM
  12. hidden_dim = 200 # LSTM hidden_size
  13. vocab_most_common = 55000 # 取词频前50000词构建词汇表【词汇表最大为64221】
  14. pretrain_w2v_limit = 500000 # 腾讯预训练词embedding加载个数
  15. w2v_grad = True # 词 embedding 是否参与训练
  16. focal_loss = False # 是否使用focal_loss
  17. num_workers = 4 # 进程数
  18. info_interval = 160 # 训练时多少个batch打印一次log
  19. stop_word_path = './data/stopword.txt' # 停止词文件
  20. pretrain_w2v = './data/Tencent_AILab_ChineseEmbedding.txt' # 腾讯800w词预训练静态词向量
  21. vocab_save_path = './word2vec/Vocab_MostCommon{}.txt'.format(vocab_most_common) # 保存经过过滤词并排序后的vocab,过滤的两个方向:① 停止词 ②低频词
  22. embedding_path = './word2vec/Embedding_PretrianLimit{}.txt'.format(vocab_most_common,pretrain_w2v_limit)
  23. source_data = './data/online_shopping_10_cats.csv'
  24. train_data = './data/train.txt'
  25. val_data = './data/validation.txt'
  26. test_data = './data/test.txt'
  27. predict_data = './data/predict.txt' # 预测predict的数据
  28. checkpoint = './model/{}.ckpt'.format(model_name)

  1、数据预处理

    (1)数据去重、数据分割训练集测试集验证集

  1. class CreateModelData():
  2. """
  3. 给定 一个csv原始数据分成3分,生成 7:3:1的train,数据,格式为:target text
  4. """
  5. def __init__(self):
  6. pass
  7. def load_csv_data(self,csv_data):
  8. """
  9. 加载、去重、shuffle
  10. """
  11. source_df = pd.read_csv(csv_data)
  12. # 去除首尾有空格的行
  13. source_df.iloc[:,-1] = source_df.iloc[:,-1].str.strip()
  14. # 只要有空行就删除
  15. source_df = source_df.dropna(how='any')
  16. # 打乱顺讯
  17. index_shuffle = np.random.permutation(len(source_df))
  18. source_df = source_df.iloc[index_shuffle,:]
  19. return source_df
  20. def split_data_to_train_eval_test(self,dataframe):
  21. """
  22. 对每个一cat类型、label、类别别分割为tran、eval、test,分割比例 7:2:1
  23. """
  24. cats = dataframe.loc[:,'cat'].unique()
  25. labels = dataframe.loc[:,'label'].unique()
  26. train_df = pd.DataFrame(columns=dataframe.columns[-2:])
  27. val_df = pd.DataFrame(columns=dataframe.columns[-2:])
  28. test_df = pd.DataFrame(columns=dataframe.columns[-2:])
  29. for cat in cats:
  30. dataframe_cat = dataframe[dataframe.loc[:,'cat'] == cat].loc[:,dataframe.columns[-2:]]
  31. for label in labels:
  32. dataframe_label = dataframe_cat[dataframe_cat.loc[:,'label'] == label]
  33. size = dataframe_label.shape[0]
  34. train_end_idx = int(size * 0.7)
  35. val_end_idx = int(size * 0.9)
  36. train_df = pd.concat([train_df,dataframe_label.iloc[:train_end_idx,:]],axis=0)
  37. val_df = pd.concat([val_df, dataframe_label.iloc[train_end_idx:val_end_idx, :]], axis=0)
  38. test_df = pd.concat([test_df, dataframe_label.iloc[val_end_idx:, :]], axis=0)
  39. return train_df,val_df,test_df
  40. def save_csv(self,dataframe,path):
  41. """
  42. 保存文件为 csv
  43. """
  44. dataframe.to_csv(path,sep='\t',header=None,index=None)
  45. def forward(self,source_data_path):
  46. """
  47. 执行函数
  48. """
  49. source_df = self.load_csv_data(csv_data = source_data_path)
  50. # 分割 7:2:1 为 train val test
  51. train_df,val_df,test_df = self.split_data_to_train_eval_test(dataframe=source_df)
  52. # 保存
  53. print("源数据一共:{}条,分割后train data:{} - eval data:{} - test data:{},保存至:'{}' - '{}' - '{}'".format(len(source_df),
  54. len(train_df),len(val_df),len(test_df),'./data/train.data','./data/val.data','./data/test.data'))
  55. self.save_csv(train_df,'./data/train.data')
  56. self.save_csv(val_df,'./data/val.data')
  57. self.save_csv(test_df,'./data/test.data')

    (2)分词、噪声字段、空格、数字、大小写替换、过滤停止词字典停止词 

  1. 分词
  2. 噪声字段、空格、数据、英文字母大小写、停止词字典
  1. # 带有标签的数据
  2. class DataProcessWithTarget():
  3. """
  4. ************* 训练集、验证集、测试集 数据预处理(文件带有target) **************
  5. 数据做以下:
  6. ① jieba 分词
  7. ② 去除停止词(低频词在构建 vocab时去除)、原始数据噪声词、空格、数字、爬虫标签、英文大小写预处理
  8. ③ 保存分词结果
  9. """
  10. def __init__(self):
  11. pass
  12. def load_csv(self,path):
  13. data_df = pd.read_csv(path,sep='\t',header=None)
  14. target = data_df.iloc[:,-2]
  15. data = data_df.iloc[:,-1]
  16. return data,target
  17. def load_stopword(self, path):
  18. """
  19. 加载停止词
  20. """
  21. stop_word = []
  22. with open(path, 'r', encoding='utf-8-sig') as f:
  23. for line in f:
  24. line = line.strip()
  25. if line:
  26. stop_word.append(line)
  27. return stop_word
  28. def jieba_(self,text,stop_word):
  29. """
  30. jieba 分词的函数
  31. ① 这里我进行停止词
  32. ② 原始数据噪声词、空格、数字、爬虫标签、英文大小写预处理
  33. """
  34. words = jieba.lcut(text)
  35. words_list = []
  36. # 对单词的预处理:
  37. for word in words:
  38. if word not in stop_word:
  39. # 去除分词中的空格,并且将英文转化为小写
  40. word = word.strip()
  41. word = word.lower()
  42. if word:
  43. words_list.append(word)
  44. return words_list
  45. def save_file(self,target,data,path):
  46. if len(target) != len(data):
  47. raise Exception('长度不一致!')
  48. with open(path,'w',encoding='utf-8') as w:
  49. for idx in range(len(data)):
  50. word_str = ' '.join(data[idx])
  51. w.write(str(target[idx]))
  52. w.write('\t')
  53. w.write(word_str)
  54. w.write('\n')
  55. def forward(self,source_path,stop_word_path,report_path):
  56. """
  57. 主函数
  58. return 分词结果 X,标签 target
  59. """
  60. print('正在预处理:"{}"数据,处理后保存至:"{}",请稍等...'.format(source_path,report_path))
  61. # 加载csv
  62. data,target = self.load_csv(path=source_path)
  63. # 加载 stop word
  64. stop_word = self.load_stopword(stop_word_path)
  65. # 分词、停止词、原始数据噪声词、空格、数字、爬虫标签、英文大小写预处理
  66. data_list = []
  67. target_list = []
  68. for idx in range(len(target)):
  69. word_list = self.jieba_(data.iloc[idx],stop_word=stop_word)
  70. if word_list:
  71. data_list.append(word_list)
  72. target_list.append(target.iloc[idx])
  73. else:
  74. print('数据:"{}",行号:{}数据预处理后有空值,去除处理'.format(source_path,idx+1))
  75. # 保存
  76. self.save_file(target=target_list,data=data_list,path = report_path)
  77. return data_list,target_list
  1. # 预测时无标签的数据,数据处理必须与train、val、test集相同
  2. class DataProcessNoTarget():
  3. """
  4. 模型predict的数据预处理(模型上线后数据预处理,需要与模型训练时预处理的方法完全相同)
  5. ruturn predict集 X array
  6. """
  7. def __init__(self):
  8. pass
  9. def load_data(self,path):
  10. text_list = []
  11. with open(path,'r',encoding='utf-8') as f:
  12. for line in f:
  13. line = line.strip()
  14. if line:
  15. text_list.append(line)
  16. return text_list
  17. def load_stopword(self, path):
  18. """
  19. 加载停止词
  20. """
  21. stop_word = []
  22. with open(path, 'r', encoding='utf-8-sig') as f:
  23. for line in f:
  24. line = line.strip()
  25. if line:
  26. stop_word.append(line)
  27. return stop_word
  28. def jieba_(self,text,stop_word):
  29. """
  30. jieba 分词的函数
  31. ① 这里我进行停止词
  32. ② 原始数据噪声词、空格、数字、爬虫标签、英文大小写预处理
  33. ③ 映射为 id,并截取填充
  34. """
  35. words = jieba.lcut(text)
  36. words_list = []
  37. # 对单词的预处理:
  38. for word in words:
  39. if word not in stop_word:
  40. # 去除分词中的空格,并且将英文转化为小写
  41. word = word.strip()
  42. word = word.lower()
  43. if word:
  44. words_list.append(word)
  45. return words_list
  46. def data_2_id(self,vocab_2_id, max_seq, text):
  47. """
  48. 将 text 数据生成 model 输入数据 X 与 label。
  49. 通过 vocab 映射为 id
  50. ① 确定文本的最长长度,超过进行截取,不足的用 PAD 填充
  51. ② 由于vocab去除了低词频的词,所以也要用到 UNK 标签
  52. return: X矩阵,2D 维度 numpy,Y 向量 1D 维度 numpy
  53. """
  54. def padding(max_seq, X):
  55. """ Pad 或 截取到相同长度,pad的值放在真实数据的前面 """
  56. if len(X) < max_seq:
  57. while len(X) < max_seq:
  58. X.insert(0,vocab_2_id['<PAD>'])
  59. else:
  60. X = X[:max_seq]
  61. return X
  62. X = []
  63. for line in text:
  64. # mapping 为 id,注意 UNK 标签
  65. line = [vocab_2_id[word] if word in vocab_2_id else vocab_2_id["<UNK>"] for word in line]
  66. # padding 或 截取 为 固定长度,pad的值放在真实数据的前面
  67. line = padding(max_seq=max_seq, X=line)
  68. # 保存 X
  69. X.append(line)
  70. return np.array(X)
  71. def forward(self,source_path,stop_word_path,vocab_2_id,max_seq):
  72. """
  73. 主函数
  74. return predict数据映射的id numpy 矩阵
  75. """
  76. print('正在预处理:"{}"数据,请稍等...'.format(source_path))
  77. # 加载csv
  78. data = self.load_data(path=source_path)
  79. # 加载 stop word
  80. stop_word = self.load_stopword(stop_word_path)
  81. # 分词、停止词、原始数据噪声词、空格、数字、爬虫标签、英文大小写预处理
  82. data_list = []
  83. for idx in range(len(data)):
  84. word_list = self.jieba_(data[idx],stop_word=stop_word)
  85. if word_list:
  86. data_list.append(word_list)
  87. else:
  88. print('数据:"{}",行号:{}数据预处理后有空值,去除处理'.format(source_path,idx+1))
  89. # 映射填充为id
  90. data = self.data_2_id(vocab_2_id=vocab_2_id,max_seq=max_seq,text=data_list)
  91. return data

  2、构建vocab、静态word embedding 表

    (1)构建vocab

  1. 需要用train data 与 val data 共同构建voacb字典【添加PAD、UNK、BEG、END】
  2. 低频词去除(根据词频去除低频词)most_common
  1. def build_vocab(train_data,val_data,save_path,most_common = None):
  2. """
  3. 使用 train data 和 val data 共同生成vocab,添加标签 <PAD> <UNK>,使用过滤词,词频从高到低排序
  4. ① 低频词去除【保留前 most_common 个词】
  5. """
  6. vocab_dict = {}
  7. paths = [train_data,val_data]
  8. for _path in paths:
  9. with open(_path,'r',encoding='utf-8-sig') as f:
  10. for line in f:
  11. line = line.strip()
  12. if line:
  13. word_list = line.split()[1:] # .split() 默认使用任何空格进行分类
  14. for word in word_list:
  15. if word not in vocab_dict:
  16. vocab_dict[word] = 1
  17. else:
  18. vocab_dict[word] = vocab_dict[word] + 1
  19. # 取前 most_common 个词
  20. if most_common is not None:
  21. ordered_vocab = Counter(vocab_dict).most_common(most_common)
  22. else:
  23. ordered_vocab = Counter(vocab_dict).most_common(sys.maxsize)
  24. # 建立 vocab2id 字典,并加入 <PAD> <UNK> 标签
  25. vocab_dict = collections.OrderedDict()
  26. vocab_dict["<PAD>"] = 0
  27. vocab_dict["<UNK>"] = 1
  28. for word,counts in ordered_vocab:
  29. if word not in vocab_dict:
  30. vocab_dict[word] = len(vocab_dict)
  31. # 保存 vocab_2_id
  32. vocab_size = len(vocab_dict)
  33. with open(save_path,'w',encoding = 'utf-8') as w:
  34. for idx,(k,v) in enumerate(vocab_dict.items()):
  35. w.write('{}\t{}'.format(k,v))
  36. if idx + 1 < vocab_size:
  37. w.write('\n')
  38. return vocab_dict

    (2)构建静态word embedding

  1. 利用vocab字典与腾讯800w预训练向量生成vocab维度的embedding表
  1. def build_embedding(vocab_2_id,pretrain_w2v,save_path):
  2. """
  3. 使用 腾讯 预训练的词向量构建预训练词向量表, 用 numpy 保存txt格式数组
  4. """
  5. # 加载腾讯词向量,limit 用于限制加载词向量的个数
  6. pretrain_w2v_model = KeyedVectors.load_word2vec_format(pretrain_w2v,binary=False,limit=config.pretrain_w2v_limit) # limit 用于限制加载词汇表大小
  7. # 初始化 embedding table
  8. vocab_dim = len(vocab_2_id)
  9. embed_dim = pretrain_w2v_model.vector_size
  10. embedding_table = np.random.uniform(-1.,1.,(vocab_dim,embed_dim))
  11. # 将 预训练词向量 对embedding表进行赋值
  12. for word,index in vocab_2_id.items():
  13. try:
  14. embedding_table[index] = pretrain_w2v_model[word]
  15. except KeyError:
  16. pass
  17. # 保存 embedding 表
  18. np.savetxt(save_path,embedding_table)
  19. return embedding_table

    (3)PAD截取并转化为id

  1. def data_2_id(vocab_2_id,max_seq,file_path):
  2. """
  3. 将 text 数据生成 model 输入数据 X 与 label。
  4. 通过 vocab 映射为 id
  5. ① 确定文本的最长长度,超过进行截取,不足的用 PAD 填充
  6. ② 由于vocab去除了低词频的词,所以也要用到 UNK 标签
  7. return: X矩阵,2D 维度 numpy,Y 向量 1D 维度 numpy
  8. """
  9. def padding(max_seq,X):
  10. """ Pad 或 截取到相同长度,pad的值放在真实数据的前面 """
  11. if len(X) < max_seq:
  12. while len(X) < max_seq:
  13. X.insert(0,vocab_2_id['<PAD>'])
  14. else:
  15. X = X[:max_seq]
  16. return X
  17. label = []
  18. X = []
  19. with open(file_path,'r',encoding='utf-8-sig') as f:
  20. for line in f:
  21. line = line.strip()
  22. if line:
  23. line_list = line.split() # .split() 默认使用任意个空格作为分隔符
  24. # 获取 label 标签
  25. label.append(int(line_list[0])) # 标签需要用 int 转化
  26. # 获取 X
  27. X_tmp = line_list[1:]
  28. # mapping 为 id,注意 UNK 标签
  29. X_tmp = [vocab_2_id[word] if word in vocab_2_id else vocab_2_id["<UNK>"] for word in X_tmp ]
  30. # padding 或 截取 为 固定长度,pad的值放在真实数据的前面
  31. X_tmp = padding(max_seq=max_seq,X=X_tmp)
  32. # 保存 X
  33. X.append(X_tmp)
  34. return np.array(X),np.array(label)

  3、DataSet、DataLoader

  1. 将数据映射为id,且将序列填充裁剪为固定长度
  2. 构建DataSet、DataLoader
  1. class Data_Set(Dataset):
  2. """
  3. 生成 dataset
  4. """
  5. def __init__(self,X,Label=None):
  6. """
  7. X: 2D numpy int64
  8. Label: 1D numpy int64
  9. """
  10. self.X = X
  11. self.Label = Label
  12. def __len__(self):
  13. return len(self.X)
  14. def __getitem__(self,idx):
  15. if self.Label is not None:
  16. X = torch.tensor(self.X[idx],dtype=torch.int64) # 使用torch默认的整形数据
  17. Label = torch.tensor(self.Label[idx],dtype=torch.int64)
  18. return X,Label
  19. # 考虑predict阶段没有label
  20. else:
  21. X = torch.tensor(self.X[idx],dtype=torch.int64)
  22. return X
  23. def collate_fn(batch):
  24. """
  25. 参数:batch 是 list 类型
  26. DataLoader 中定义的 collate_fn 函数,用于对一个batch的数据进行处理
  27. ② 将 batch 数据转化为tensor
  28. ① 去除一个batch中多余的 PAD ,将数据最长长度调整为batch中最长样本的真实长度
  29. """
  30. def intercept(X):
  31. """
  32. X dim: [batch,T]
  33. 将tensor截取为真实值的最长度,要注意PAD必须为0才可执行
  34. """
  35. max_seq = torch.max(torch.sum(X>=1,dim=1))
  36. return X[:,-max_seq:]
  37. X_list = []
  38. label_list =[]
  39. for item in batch:
  40. if isinstance(item, tuple):
  41. X,target_label = item # X dim: [batch,T]
  42. if not (torch.is_tensor(X) and torch.is_tensor(target_label)):
  43. X = torch.tensor(X)
  44. target_label = torch.tensor(target_label)
  45. X_list.append(X)
  46. label_list.append(target_label)
  47. # 考虑到预测没有标签
  48. else:
  49. X = item
  50. if not torch.is_tensor(X):
  51. X = torch.tensor(X)
  52. X_list.append(X)
  53. if label_list:
  54. X = torch.stack(X_list,dim=0) # X dim: [batch,T]
  55. label = torch.stack(label_list,dim=0)
  56. return intercept(X), label
  57. else:
  58. X = torch.stack(X_list,dim=0) # X dim: [batch,T]
  59. return intercept(X)
  60. def get_vocab(file_path):
  61. """
  62. 加载 vocab_2_id
  63. """
  64. vocab_dict = collections.OrderedDict()
  65. with open(file_path,'r',encoding='utf-8-sig') as f:
  66. for line in f:
  67. line = line.strip()
  68. if line:
  69. key,value = line.split()
  70. vocab_dict[key] = int(value)
  71. return vocab_dict
  72. def get_pretrain_embedding(file_path):
  73. """
  74. 加载 腾讯预训练 embedding
  75. """
  76. embedding = np.loadtxt(file_path)
  77. return embedding
  78. def sort_eval(X,Y=None):
  79. """
  80. X: 2D
  81. 接受验证集与测试集的 X Y array,对其真实长度从到小进行排序
  82. return 验证集与测试集排序后的 X,Y
  83. """
  84. if Y is not None:
  85. seq_len = np.sum(X>0,axis=1)
  86. datas = list(zip(X,Y,seq_len))
  87. datas = sorted(datas,key=lambda i:i[-1])
  88. X,Y,_ = zip(*datas)
  89. return X,Y
  90. else:
  91. seq_len = np.sum(X > 0, axis=1)
  92. datas = list(zip(X, seq_len))
  93. datas = sorted(datas, key=lambda i: i[-1])
  94. X, Y, = zip(*datas)
  95. return X
  96. if __name__ == '__main__':
  97. pass

  4、模型搭建

  1. class LSTM_Model(nn.Module):
  2. def __init__(self,
  3. vocab_size,
  4. n_class,
  5. embedding_dim,
  6. hidden_dim,
  7. num_layers,
  8. dropout,
  9. bidirectional,
  10. embedding_weights=None,
  11. train_w2v=True,
  12. **kwargs):
  13. super(LSTM_Model, self).__init__()
  14. self.vocab_size = vocab_size
  15. self.n_class = n_class
  16. self.embedding_dim = embedding_dim
  17. self.hidden_dim = hidden_dim
  18. self.num_layers = num_layers
  19. self.dropout = dropout
  20. self.bidirectional = bidirectional
  21. self.embedding_weights = embedding_weights
  22. self.train_w2v = train_w2v
  23. # 构建 embedding 层
  24. if self.embedding_weights is not None:
  25. self.embedding_weights = torch.tensor(self.embedding_weights,
  26. dtype=torch.float32) # torch 不接受 numpy 64位的浮点型,这里必须转化为32位,否则报错
  27. self.embedding = nn.Embedding.from_pretrained(self.embedding_weights)
  28. self.embedding.weight.requires_grad = self.train_w2v
  29. else: # 保证预测的情况无需传入 预训练的embedding表
  30. self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
  31. self.embedding.weight.requires_grad = self.train_w2v
  32. nn.init.uniform_(self.embedding.weight, -1., 1.)
  33. # 构建 lstm
  34. self.lstm = nn.LSTM(input_size=self.embedding_dim,
  35. hidden_size=self.hidden_dim,
  36. num_layers=self.num_layers,
  37. dropout=self.dropout,
  38. bidirectional=self.bidirectional)
  39. # 双向
  40. if self.bidirectional:
  41. # FC[第一个时刻与最后一个时刻需要拼接]
  42. self.fc1 = nn.Linear(4 * self.hidden_dim, self.hidden_dim)
  43. self.fc2 = nn.Linear(self.hidden_dim, self.n_class)
  44. else:
  45. # FC
  46. self.fc1 = nn.Linear(self.hidden_dim, self.n_class)
  47. def forward(self, x):
  48. # 0、embedding
  49. embeddings = self.embedding(x) # (B,T) --> (B,T,D)
  50. # 1、LSTM
  51. outputs, states = self.lstm(embeddings.permute([1, 0, 2])) # lstm 默认 输入维度为 (seq,batch,dim),因此这里需要进行转换
  52. if self.bidirectional:
  53. input_tmp = torch.cat([outputs[0],outputs[-1]],dim=-1)
  54. outputs = F.relu(self.fc1(input_tmp))
  55. outputs = self.fc2(outputs)
  56. else:
  57. outputs = self.fc1(outputs[-1])
  58. return outputs
  59. class LSTM_Attention(nn.Module):
  60. def __init__(self,
  61. vocab_size,
  62. n_class,
  63. embedding_dim,
  64. hidden_dim,
  65. num_layers,
  66. dropout,
  67. bidirectional,
  68. embedding_weights = None,
  69. train_w2v=True,
  70. **kwargs):
  71. super(LSTM_Attention,self).__init__()
  72. self.vocab_size = vocab_size
  73. self.n_class = n_class
  74. self.embedding_dim = embedding_dim
  75. self.hidden_dim = hidden_dim
  76. self.num_layers = num_layers
  77. self.dropout = dropout
  78. self.bidirectional = bidirectional
  79. self.embedding_weights = embedding_weights
  80. self.train_w2v = train_w2v
  81. # 构建 embedding 层
  82. if self.embedding_weights is not None:
  83. self.embedding_weights = torch.tensor(self.embedding_weights,dtype=torch.float32) # torch 不接受 numpy 64位的浮点型,这里必须转化为32位,否则报错
  84. self.embedding = nn.Embedding.from_pretrained(self.embedding_weights)
  85. self.embedding.weight.requires_grad = self.train_w2v
  86. else: # 保证预测的情况无需传入 预训练的embedding表
  87. self.embedding = nn.Embedding(self.vocab_size,self.embedding_dim)
  88. self.embedding.weight.requires_grad = self.train_w2v
  89. nn.init.uniform_(self.embedding.weight,-1.,1.)
  90. # 构建 lstm
  91. self.lstm = nn.LSTM(input_size=self.embedding_dim,
  92. hidden_size=self.hidden_dim,
  93. num_layers=self.num_layers,
  94. dropout=self.dropout,
  95. bidirectional=self.bidirectional)
  96. # 双向
  97. if self.bidirectional:
  98. # attention
  99. self.attention1 = nn.Linear(2 * self.hidden_dim,2 * self.hidden_dim)
  100. self.attention2 = nn.Linear(2 * self.hidden_dim,1)
  101. # FC
  102. self.fc1 = nn.Linear(2 * self.hidden_dim, self.hidden_dim)
  103. self.fc2 = nn.Linear(self.hidden_dim,self.n_class)
  104. else:
  105. # attention
  106. self.attention1 = nn.Linear(self.hidden_dim, self.hidden_dim)
  107. self.attention2 = nn.Linear(self.hidden_dim,1)
  108. # FC
  109. self.fc1 = nn.Linear(self.hidden_dim, self.hidden_dim)
  110. self.fc2 = nn.Linear(self.hidden_dim,self.n_class)
  111. def forward(self,x):
  112. # 0、embedding
  113. embeddings = self.embedding(x) # (B,T) --> (B,T,D)
  114. # 1、LSTM
  115. outputs,states = self.lstm(embeddings.permute([1,0,2])) # lstm 默认 输入维度为 (seq,batch,dim),因此这里需要进行转换
  116. T,B,D = outputs.size() # D = 2 * hidden_dim
  117. outputs = outputs.permute([1,0,2])
  118. # attention
  119. u = torch.tanh(self.attention1(outputs))
  120. v = self.attention2(u)
  121. att_scores = F.softmax(v,dim=1)
  122. encoding = torch.sum(torch.mul(outputs,att_scores),dim=1)
  123. # FC
  124. outputs = F.relu6(self.fc1(encoding))
  125. outputs=self.fc2(outputs)
  126. return outputs
  127. if __name__ == '__main__':
  128. lstm_attention = LSTM_Attention(10000,2,200,256,2,0.2,bidirectional=True,embedding_weights=None,train_w2v=True)
  129. print(lstm_attention)

  5、模型实例化

  1. # 模型搭建
  2. if Config.model_name == 'lstm_attention':
  3. model = LSTM_Attention( vocab_size = len(vocab_2_id),
  4. n_class = Config.num_classes,
  5. embedding_dim = Config.embedding_dim,
  6. hidden_dim = Config.hidden_dim,
  7. num_layers = Config.layer_num,
  8. dropout = Config.dropout,
  9. bidirectional = Config.bidirectional,
  10. embedding_weights = embedding_table,
  11. train_w2v = Config.w2v_grad
  12. )
  13. # print(model.embedding.weight)
  14. else:
  15. model = LSTM_Model(vocab_size = len(vocab_2_id),
  16. n_class = Config.num_classes,
  17. embedding_dim = Config.embedding_dim,
  18. hidden_dim = Config.hidden_dim,
  19. num_layers = Config.layer_num,
  20. dropout = Config.dropout,
  21. bidirectional = Config.bidirectional,
  22. embedding_weights = embedding_table,
  23. train_w2v = Config.w2v_grad
  24. )
  25. print('Model-"{}" 细节:\n'.format(Config.model_name),model)
  26. view_will_trained_params(model,model_name=Config.model_name)

  6、优化器【分层学习率】、LOSS函数、学习率调整器

  1. # 优化器 分层学习率
  2. # 由于embedding是腾讯预训练词向量生成的,所有需要较小的学习率,一般低于正常神经网络训练学习率的10倍
  3. special_layers = nn.ModuleList([model.embedding])
  4. # 获取特等层的参数列表的内存id列表
  5. special_layers_ids = list(map(lambda x: id(x), special_layers.parameters()))
  6. # 基础层的参数列表
  7. basic_params = filter(lambda x: id(x) not in special_layers_ids, model.parameters())
  8. optimizer = optim.Adam([{'params': filter(lambda p: p.requires_grad, basic_params)},
  9. {'params': filter(lambda p: p.requires_grad, special_layers.parameters()), 'lr': 8e-5}],
  10. lr=Config.learning_rate)
  1. import torch
  2. import numpy as np
  3. import torch.nn.functional as F
  4. import math
  5. from sklearn.metrics import confusion_matrix,accuracy_score,recall_score,precision_score,f1_score
  6. def view_will_trained_params(model,model_name):
  7. """
  8. ********** 查看模型哪些层的参数参与训练,哪些层的参数被固定了 ************
  9. """
  10. train_params = []
  11. for name,param in model.named_parameters():
  12. if param.requires_grad == True:
  13. train_params.append((name,param.shape))
  14. print("\n{} 模型将要参与训练的层为:\n".format(model_name),train_params,end='\n\n\n')
  15. def get_device():
  16. dev = 'cuda:0' if torch.cuda.is_available() else 'cpu'
  17. device = torch.device(dev)
  18. return device
  19. def focal_loss(output, target, alpha=1.0, gamma=2.0, *args, **kwargs):
  20. """
  21. ********** 给定模型前向传播的输出[batch,class]与真实值target[class,],计算loss误差 ************
  22. 1. 仅仅在训练的时候使用 focal_loss ,验证时不使用 focal_loss
  23. 2. 默认情况下不进行聚合
  24. """
  25. assert np.ndim(output) == 2
  26. assert np.ndim(target) == 1
  27. assert len(output) == len(target)
  28. ce_loss = F.cross_entropy(input=output, target=target, reduction="none") # 这里必须使用 none 模式, ce_loss dim: [B,]
  29. pt = torch.exp(-ce_loss) # pt dim: [B,]
  30. # 构建 focal_loss
  31. focalloss = (alpha * (torch.tensor(1.0) - pt) ** gamma * ce_loss).mean()
  32. return focalloss
  33. def cross_entropy(output, target, *args, **kwargs):
  34. """
  35. 普通的交叉熵损失函数,默认情况下不进行聚合
  36. """
  37. assert np.ndim(output) == 2
  38. assert np.ndim(target) == 1
  39. assert len(output) == len(target)
  40. ce_loss = F.cross_entropy(input=output, target=target, reduction="mean") # ce_loss 是一个均值
  41. return ce_loss
  42. class WarmupCosineLR():
  43. def __init__(self,optimizer,warmup_iter:int,lrs_min:list = [1e-5,],T_max:int = 10):
  44. """
  45. ******************* pytorch自定义学习率 预热warmup + Cosline 余弦衰减 **************************
  46. 具体可看文章:https://blog.csdn.net/qq_36560894/article/details/114004799?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-13.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-13.control
  47. Args:
  48. optimizer (Optimizer): pytotch 优化器
  49. warmup_iter: 预热的最大epoch
  50. lrs_min: list, optimizer 学习率一一对应的最小值
  51. T_max:余弦半周期,该值必须比 warmup_iter 大
  52. 特点:
  53. ① 支持分层学习率多组学习率衰减
  54. """
  55. self.optimizer = optimizer
  56. self.warmup_iter = warmup_iter
  57. self.lrs_min = lrs_min
  58. self.T_max = T_max
  59. self.base_lrs = [i['lr'] for i in optimizer.param_groups]
  60. def get_lr(self):
  61. if self.iter < self.warmup_iter:
  62. return [i * self.iter *1. / self.warmup_iter for i in self.base_lrs]
  63. else:
  64. return [self.lrs_min[idx] + 0.5*(i-self.lrs_min[idx])*(1.0+math.cos((self.iter-self.warmup_iter)/(self.T_max-self.warmup_iter)*math.pi)) \
  65. for idx,i in enumerate(self.base_lrs)]
  66. def step(self,iter:int):
  67. if iter == 0:
  68. iter = iter + 1
  69. self.iter = iter
  70. # 获取当前epoch学习率
  71. decay_lrs = self.get_lr()
  72. # 更新学习率
  73. for param_group, lr in zip(self.optimizer.param_groups, decay_lrs):
  74. param_group['lr'] = lr
  75. def get_score(target,predict):
  76. """
  77. 给定真实的变迁target 与 预测的标签predict ,计算 acc、recall、precision、F1
  78. """
  79. import warnings
  80. warnings.filterwarnings('ignore')
  81. assert np.ndim(target) == 1
  82. assert np.ndim(predict) == 1
  83. assert np.shape(target) == np.shape(predict)
  84. con_matrix = confusion_matrix(y_true=target,y_pred=predict)
  85. # 计算acc
  86. acc = accuracy_score(y_true=target,y_pred=predict)
  87. # 计算 macro recall
  88. recall = recall_score(y_true=target,y_pred=predict,average='macro')
  89. # 计算 macro precision
  90. precision = precision_score(y_true=target,y_pred=predict,average='macro')
  91. # 计算 macro F1
  92. F1 = f1_score(y_true=target,y_pred=predict,average='macro')
  93. return (acc,recall,precision,F1),con_matrix
  94. if __name__ == "__main__":
  95. # 0、WramUp + cosinelr 学习率变化曲线
  96. import torch.optim as optim
  97. import matplotlib.pyplot as plt
  98. optimizer = optim.Adam(params=[torch.ones((3,4),requires_grad=True)],lr=0.01)
  99. scheduler_ = WarmupCosineLR(optimizer,
  100. warmup_iter=5,
  101. lrs_min=[0.001,],
  102. T_max=50)
  103. lr = optimizer.param_groups[0]['lr']
  104. print(lr)
  105. y = []
  106. x = []
  107. for epoch in range(200):
  108. scheduler_.step(epoch+1)
  109. print(optimizer.param_groups[0]['lr'])
  110. y.append(optimizer.param_groups[0]['lr'])
  111. x.append(epoch+1)
  112. plt.plot(x,y)
  113. plt.show()
  114. # 计算分值
  115. y_t = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
  116. y_p = [1,1,1,0,0,1,1,0,1,0,2,2,1,1,1,1,0,1,1]
  117. print(get_score(y_t,y_p))

  7、训练train&eval与保存模型参数字典

  1. 梯度截断的使用步骤:
  2. 1. 计算loss函数值
  3. 2. loss 反向传播
  4. 3. 梯度截断
  5. 4. 优化器更新梯度参数
  6. optimizer.zero_grad()
  7. loss, hidden = model(data, hidden, targets)
  8. loss.backward()
  9. # 梯度截断
  10. torch.nn.utils.clip_grad_norm(filter(lambda p: p.requires_grad,model.parameters()), args.clip)
  11. optimizer.step()
  1. from __future__ import print_function
  2. from __future__ import division
  3. from __future__ import absolute_import
  4. from __future__ import with_statement
  5. from os import lseek
  6. from model import LSTM_Attention,LSTM_Model
  7. from data_process import data_2_id
  8. from loader_utils import get_vocab,get_pretrain_embedding,Data_Set,collate_fn,sort_eval
  9. from model_utils import view_will_trained_params,focal_loss,cross_entropy,WarmupCosineLR,get_score
  10. from create_config import Config
  11. from torch.utils.data import DataLoader
  12. import torch
  13. import torch.nn as nn
  14. import torch.optim as optim
  15. import numpy as np
  16. import copy
  17. import os
  18. def train_one_epoch(model,device,optimizer,loss_fun,metric_fun,train_loader,current_epoch,info_interval:int=None):
  19. """
  20. ********** 一个epoch模型训练 ************
  21. 关于 model.eval() model.train() with torch.no_grad() with torch.set_grad_enabled(bool) 区别
  22. return:
  23. ① batch_losses:每个batch均值loss列表
  24. ② 整个epoch 的 acc,recall,precision,F1
  25. """
  26. print('Training ... ')
  27. model.train()
  28. model.to(device)
  29. LRs = [i['lr'] for i in optimizer.param_groups] # 获取当前epoch 优化器 optimizer 学习率组
  30. batch_losses = []
  31. batch_targets = []
  32. batch_predicts = []
  33. for idx, (input_x, target) in enumerate(train_loader):
  34. input_x, target = input_x.to(device), target.to(device)
  35. optimizer.zero_grad()
  36. output = model(input_x) # 前向传播
  37. loss = loss_fun(output, target, alpha=1.0, gamma=2.0)
  38. loss.backward() # 反向传播计算梯度
  39. optimizer.step() # 更新
  40. batch_losses.append(loss.item())
  41. # 计算score
  42. pre = torch.argmax(output, dim=1)
  43. pre = pre.cpu().numpy().reshape(-1).tolist()
  44. target = target.cpu().numpy().reshape(-1).tolist()
  45. (acc,recall,precision,F1),con_matrix = metric_fun(target=target,predict=pre)
  46. batch_targets.extend(target)
  47. batch_predicts.extend(pre)
  48. if info_interval is not None:
  49. if idx % info_interval == 0:
  50. print("Epoch:{}\t[{}\{}\t\t{:.2f}%]\tLoss:{:.8f}\tScores: < acc:{:.3f}%\t"\
  51. "macro_recall:{:.3f}%\tmacro_precision:{:.3f}%\tmacro_F1:{:.3f}%\t >\t\tBatch input_x shape:{}".format(
  52. current_epoch, idx * len(input_x),
  53. len(train_loader.dataset), 100. * (idx / len(train_loader)),loss.item(),
  54. 100. * acc,100. * recall,100. * precision,100. * F1,input_x.shape
  55. ))
  56. # 计算一个epoch的score
  57. (epoch_acc, epoch_recall, epoch_precision, epoch_F1), con_matrix = metric_fun(target=batch_targets, predict=batch_predicts)
  58. print("Epoch Info :\tLoss:{:.8f}\tScores: <\tacc:{:.3f}%\t "\
  59. "macro_recall:{:.3f}%\t macro_precision:{:.3f}%\t macro_F1:{:.3f}%\t>\tLRs:{}".format(
  60. np.mean(batch_losses),100. * epoch_acc,100. * epoch_recall,100. * epoch_precision,100. * epoch_F1,LRs
  61. ))
  62. return batch_losses,[epoch_acc, epoch_recall, epoch_precision, epoch_F1]
  63. def eval_one_epoch(model,device,loss_fun,metric_fun,eval_loader):
  64. """
  65. ********** 一个epoch模型验证 ************
  66. 关于 model.eval() model.train() with torch.no_grad() with torch.set_grad_enabled(bool) 区别
  67. return: batch_losses 每个batch均值loss列表,batch_scores 每个batch的 acc,recall,precision,F1
  68. """
  69. print('Evaling ... ')
  70. model.eval() # 开启与dropout、BN层,它不会阻止梯度的计算,只不过回传参数,因此,eval 模式使用 with torch.no_grad() 还是很有必要的,加快计算速度。
  71. model.to(device)
  72. batch_losses = []
  73. batch_targets = []
  74. batch_predicts = []
  75. with torch.no_grad():
  76. for idx, (input_x, target) in enumerate(eval_loader):
  77. input_x, target = input_x.to(device), target.to(device)
  78. output = model(input_x) # 前向传播
  79. loss = loss_fun(output, target, alpha=1.0, gamma=2.0)
  80. batch_losses.append(loss.item())
  81. # 计算score
  82. pre = torch.argmax(output, dim=1)
  83. pre = pre.cpu().numpy().reshape(-1).tolist()
  84. target = target.cpu().numpy().reshape(-1).tolist()
  85. (acc, recall, precision, F1), con_matrix = metric_fun(target=target, predict=pre)
  86. batch_targets.extend(target)
  87. batch_predicts.extend(pre)
  88. # 计算一个epoch的score
  89. (epoch_acc, epoch_recall, epoch_precision, epoch_F1), con_matrix = metric_fun(target=batch_targets, predict=batch_predicts)
  90. print(
  91. "Epoch Info :\tLoss:{:.8f}\tScores: Scores: <\tacc:{:.3f}%\t "\
  92. "macro_recall:{:.3f}%\t macro_precision:{:.3f}%\t macro_F1:{:.3f}%\t>".format(
  93. np.mean(batch_losses), 100. * epoch_acc, 100. * epoch_recall,
  94. 100. * epoch_precision, 100. * epoch_F1
  95. ))
  96. return batch_losses,[epoch_acc, epoch_recall, epoch_precision, epoch_F1]
  97. def train(model,device,optimizer,scheduler_fun,loss_fun,epochs,metric_fun,info_interval,checkpoint,train_loader,eval_loader):
  98. """
  99. ********** 模型训练 ************
  100. return:
  101. ① train_losses,eval_losses: 2D list ,(epoch,batch_num)
  102. ② train_scores,eval_scores: 2D list,(epoch,4)acc,recall,precision,F1
  103. """
  104. # 判断加载已保留的最优的模型参数【支持断点续传】
  105. best_scores = [-0.000001,-0.000001,-0.000001,-0.000001] # 定义初始的acc,recall,precision,F1的值
  106. history_epoch,best_epoch = 0,0 # 定义历史训练模型epoch次数初始值、最优模型的epoch初始值
  107. best_params = copy.deepcopy(model.state_dict()) # 获取模型的最佳参数,OrderDict属于链表,对其更该引用的变量也会变动,因此这里要用到深拷贝
  108. best_optimizer = copy.deepcopy(optimizer.state_dict())
  109. LRs = [i['lr'] for i in optimizer.param_groups]
  110. if os.path.exists(checkpoint):
  111. """
  112. 为了保证 gpu/cpu 训练的模型参数可以相互加载,这里在load时使用 map_location=lambda storage, loc: storage 来控制,详情请看文章:
  113. https://blog.csdn.net/nospeakmoreact/article/details/89634039?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.withoutpai&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.withoutpai
  114. """
  115. if torch.cuda.is_available():
  116. ck_dict = torch.load(checkpoint, map_location=lambda storage, loc: storage.cuda()) # 使用 gpu 读取 模型参数
  117. else:
  118. ck_dict = torch.load(checkpoint, map_location=lambda storage, loc: storage) # 使用 cpu 读取模型参数
  119. best_scores = ck_dict['best_score']
  120. history_epoch,best_epoch = ck_dict['epochs'],ck_dict['best_epochs']
  121. model.load_state_dict(ck_dict['best_params'])
  122. # optimizer.load_state_dict(ck_dict['optimizer'])
  123. # if torch.cuda.is_available():
  124. # """
  125. # 重载optimizer的参数时将所有的tensor都放到cuda上(optimizer保存时默认放在cpu上了),详情见:
  126. # https://blog.csdn.net/weixin_41848012/article/details/105675735
  127. # """
  128. # for state in optimizer.state.values():
  129. # for k, v in state.items():
  130. # if torch.is_tensor(v):
  131. # state[k] = v.cuda()
  132. best_params = copy.deepcopy(model.state_dict()) # 获取模型的最佳参数,OrderDict属于链表,对其更该引用的变量也会变动,因此这里要用到深拷贝
  133. # best_optimizer = copy.deepcopy(optimizer.state_dict())
  134. LRs = [i['lr'] for i in optimizer.param_groups]
  135. print('From "{}" load history model params:\n\tTrained Epochs:{}\n\t'\
  136. 'Best Model Epoch:{}\n\t各层学习率 LRs 为:{}\n\tBest Score:<\tacc:{:.3f}%\t'\
  137. ' macro_recall:{:.3f}%\t macro_precision:{:.3f}%\t macro_F1:{:.3f}%\t>\n'.format(
  138. checkpoint, history_epoch,best_epoch,LRs
  139. , 100. * best_scores[0],100. * best_scores[1]
  140. ,100. * best_scores[2],100. * best_scores[3]))
  141. # print(best_params)
  142. # print(best_optimizer)
  143. # Train
  144. train_losses =[]
  145. eval_losses = []
  146. train_scores = []
  147. eval_scores = []
  148. for epoch in range(1,epochs + 1):
  149. # 获得本次训练的 lr 学习率
  150. scheduler_fun.step(history_epoch + epoch) # 这里需要使用历史的epoch,为了是LR变化符合 Warmup + cosine
  151. LRs = [i['lr'] for i in optimizer.param_groups]
  152. # train & eval
  153. train_batch_loss,train_score = train_one_epoch(model=model,
  154. device=device,
  155. optimizer=optimizer,
  156. loss_fun=loss_fun,
  157. metric_fun=metric_fun,
  158. train_loader=train_loader,
  159. current_epoch=history_epoch+epoch,
  160. info_interval=info_interval)
  161. print()
  162. eval_batch_loss,eval_score = eval_one_epoch(model=model,
  163. device=device,
  164. loss_fun=loss_fun,
  165. metric_fun=metric_fun,
  166. eval_loader=eval_loader)
  167. train_losses.append(train_batch_loss)
  168. eval_losses.append(eval_batch_loss)
  169. train_scores.append(train_score)
  170. eval_scores.append(eval_score)
  171. # 保存模型[当验证集的 F1 值 大于最优F1时,模型进行保存
  172. if best_scores[3] < eval_score[3]:
  173. print('历史模型分值:{:.3f}%,更新分值{:.3f}%,优化器学习率:{},模型参数更新保存\n'.format(100.*best_scores[3],100.*eval_score[3],LRs))
  174. best_scores = eval_score
  175. best_params = copy.deepcopy(model.state_dict())
  176. best_optimizer = copy.deepcopy(optimizer.state_dict())
  177. best_epoch = history_epoch + epoch
  178. else:
  179. print("模型最优的epcoh为:{},模型验证集最高分值:{:.3f}%, model 效果未提升\n".format(best_epoch,100.* best_scores[3]))
  180. ck_dict = {
  181. "best_score":best_scores,
  182. "best_params":best_params,
  183. "optimizer":best_optimizer,
  184. 'epochs':history_epoch + epoch,
  185. 'best_epochs':best_epoch
  186. }
  187. torch.save(ck_dict,checkpoint)
  188. # 训练结束,将模型赋予最优的参数
  189. model.load_state_dict(best_params)
  190. return model,train_losses,eval_losses,train_scores,eval_scores
  191. if __name__ == '__main__':
  192. dev = 'cuda:0' if torch.cuda.is_available() else 'cpu'
  193. device = torch.device(dev)
  194. # 数据加载
  195. vocab_2_id = get_vocab(Config.vocab_save_path) # 词汇表 50002
  196. embedding_table = get_pretrain_embedding(Config.embedding_path) # (50002,200),numpy float默认为64位,torch需要32位,需要转化为float32的tensor
  197. # DataSet DataLoader
  198. X_train,target_train = data_2_id(vocab_2_id,Config.max_seq,Config.train_data)
  199. kwargs = {'num_workers':Config.num_workers,'pin_memory':True} if torch.cuda.is_available() else {'num_workers':Config.num_workers}
  200. train_dataset = Data_Set(X_train,target_train)
  201. train_loader = DataLoader(dataset=train_dataset,
  202. batch_size=Config.batch_size,
  203. shuffle=True,
  204. collate_fn = collate_fn,
  205. **kwargs
  206. )
  207. print('dataloader 第一个batch的情况如下:')
  208. print(next(iter(train_loader)),next(iter(train_loader))[0].shape)
  209. X_val,target_val = data_2_id(vocab_2_id,Config.max_seq,Config.val_data)
  210. # TODO 为了避免batch长短不齐形成过多的PAD,这里对 eval 数据 按照真实的长度从小到大排序
  211. X_val,target_val = sort_eval(X_val,target_val)
  212. val_dataset = Data_Set(X_val,target_val)
  213. val_loader = DataLoader(dataset=val_dataset,
  214. batch_size=Config.batch_size,
  215. shuffle=False,
  216. collate_fn = collate_fn,
  217. **kwargs
  218. )
  219. # 模型搭建
  220. if Config.model_name == 'lstm_attention':
  221. model = LSTM_Attention( vocab_size = len(vocab_2_id),
  222. n_class = Config.num_classes,
  223. embedding_dim = Config.embedding_dim,
  224. hidden_dim = Config.hidden_dim,
  225. num_layers = Config.layer_num,
  226. dropout = Config.dropout,
  227. bidirectional = Config.bidirectional,
  228. embedding_weights = embedding_table,
  229. train_w2v = Config.w2v_grad
  230. )
  231. # print(model.embedding.weight)
  232. else:
  233. model = LSTM_Model(vocab_size = len(vocab_2_id),
  234. n_class = Config.num_classes,
  235. embedding_dim = Config.embedding_dim,
  236. hidden_dim = Config.hidden_dim,
  237. num_layers = Config.layer_num,
  238. dropout = Config.dropout,
  239. bidirectional = Config.bidirectional,
  240. embedding_weights = embedding_table,
  241. train_w2v = Config.w2v_grad
  242. )
  243. print('Model-"{}" 细节:\n'.format(Config.model_name),model)
  244. view_will_trained_params(model,model_name=Config.model_name)
  245. # 优化器、学习率调整器、LOSS函数,设置分层学习率
  246. special_layers = nn.ModuleList([model.embedding])
  247. # 获取特等层的参数列表的内存id列表
  248. special_layers_ids = list(map(lambda x: id(x), special_layers.parameters()))
  249. # 基础层的参数列表
  250. basic_params = filter(lambda x: id(x) not in special_layers_ids, model.parameters())
  251. optimizer = optim.Adam([{'params': filter(lambda p: p.requires_grad, basic_params)},
  252. {'params': filter(lambda p: p.requires_grad, special_layers.parameters()), 'lr': 8e-5}],
  253. lr=Config.learning_rate)
  254. scheduler_fun = WarmupCosineLR(optimizer,warmup_iter=4,lrs_min=[5e-5,1e-6],T_max=40)
  255. # train
  256. if Config.focal_loss:
  257. loss_fun = focal_loss
  258. else:
  259. loss_fun = cross_entropy
  260. train(model=model,
  261. device=device,
  262. optimizer=optimizer,
  263. scheduler_fun=scheduler_fun,
  264. loss_fun=loss_fun,
  265. epochs=Config.epochs,
  266. metric_fun=get_score,
  267. info_interval=Config.info_interval,
  268. checkpoint=Config.checkpoint,
  269. train_loader=train_loader,
  270. eval_loader=val_loader)

  8、测试集test

  1. from __future__ import print_function
  2. from __future__ import division
  3. from __future__ import absolute_import
  4. from __future__ import with_statement
  5. from model import LSTM_Attention,LSTM_Model
  6. from data_process import data_2_id
  7. from loader_utils import get_vocab, Data_Set, collate_fn,sort_eval
  8. from model_utils import get_score
  9. from create_config import Config
  10. from torch.utils.data import DataLoader
  11. import torch
  12. import numpy as np
  13. import os
  14. import re
  15. def eval_one_epoch(model, device, metric_fun, eval_loader):
  16. """
  17. ********** 一个epoch模型验证 ************
  18. """
  19. print('Predict ... ')
  20. model.eval() # 开启与dropout、BN层,它不会阻止梯度的计算,只不过回传参数,因此,eval 模式使用 with torch.no_grad() 还是很有必要的,加快计算速度。
  21. model.to(device)
  22. batch_targets = []
  23. batch_predicts = []
  24. error_samples = []
  25. with torch.no_grad():
  26. for idx, (input_x, target) in enumerate(eval_loader):
  27. input_x, target = input_x.to(device), target.to(device)
  28. output = model(input_x) # 前向传播
  29. # 计算score
  30. pre = torch.argmax(output, dim=1)
  31. error_x = input_x[target != pre]
  32. error_target = pre[target != pre]
  33. pre = pre.cpu().numpy().reshape(-1).tolist()
  34. target = target.cpu().numpy().reshape(-1).tolist()
  35. error_x = error_x.cpu().numpy().tolist()
  36. error_target = error_target.cpu().numpy().tolist()
  37. batch_targets.extend(target)
  38. batch_predicts.extend(pre)
  39. error_samples.append((error_target,error_x))
  40. # 计算一个epoch的score
  41. (epoch_acc, epoch_recall, epoch_precision, epoch_F1), con_matrix = metric_fun(target=batch_targets,
  42. predict=batch_predicts)
  43. print(
  44. "Epoch Info :\tScores: Scores: <\tacc:{:.3f}%\t macro_recall:{:.3f}%\t"\
  45. " macro_precision:{:.3f}%\t macro_F1:{:.3f}%\t>".format(100. * epoch_acc, 100. * epoch_recall,
  46. 100. * epoch_precision, 100. * epoch_F1
  47. ))
  48. return [epoch_acc, epoch_recall, epoch_precision, epoch_F1],con_matrix,error_samples
  49. def predict(model,device, metric_fun,checkpoint,predict_loader):
  50. """
  51. ********** 模型测试 ************
  52. """
  53. # 判断加载已保留的最优的模型参数
  54. if os.path.exists(checkpoint):
  55. if torch.cuda.is_available():
  56. ck_dict = torch.load(checkpoint, map_location=lambda storage, loc: storage.cuda()) # 使用 gpu 读取 模型参数
  57. else:
  58. ck_dict = torch.load(checkpoint, map_location=lambda storage, loc: storage) # 使用 cpu 读取模型参数
  59. best_scores = ck_dict['best_score']
  60. history_epoch,best_epoch = ck_dict['epochs'],ck_dict['best_epochs']
  61. model.load_state_dict(ck_dict['best_params'])
  62. print(
  63. 'From "{}" load history model params:\n\tTrained Epochs:{}\n\tBest Model Epoch:{}\n'\
  64. '\tBest Score:<\tacc:{:.3f}%\t macro_recall:{:.3f}%\t macro_precision:{:.3f}%\t macro_F1:{:.3f}%\t>\n\t'.format(
  65. checkpoint, history_epoch,best_epoch, 100. * best_scores[0], 100. * best_scores[1], 100. * best_scores[2],
  66. 100. * best_scores[3]))
  67. # predict
  68. eval_score,con_matrix,error_samples = eval_one_epoch(model=model,
  69. device=device,
  70. metric_fun=metric_fun,
  71. eval_loader=predict_loader)
  72. else:
  73. print('Model not exists .... ')
  74. eval_score = None
  75. con_matrix = None
  76. error_samples = None
  77. exit()
  78. return eval_score,con_matrix,error_samples
  79. if __name__ == '__main__':
  80. dev = 'cuda:0' if torch.cuda.is_available() else 'cpu'
  81. device = torch.device(dev)
  82. # 数据加载,#预测的情况下无需加载 预训练的embedding 表,load model paramters 会加载
  83. vocab_2_id = get_vocab(Config.vocab_save_path) # 词汇表 50002
  84. # DataSet DataLoader
  85. X_test, target_test = data_2_id(vocab_2_id, Config.max_seq, Config.test_data)
  86. # TODO 为了避免batch长短不齐形成过多的PAD,这里对 eval 数据 按照真实的长度从小到大排序
  87. X_test, target_test = sort_eval(X_test, target_test)
  88. kwargs = {'num_workers': Config.num_workers, 'pin_memory': True} if torch.cuda.is_available() else {
  89. 'num_workers': Config.num_workers}
  90. test_dataset = Data_Set(X_test, target_test)
  91. test_loader = DataLoader(dataset=test_dataset,
  92. batch_size=Config.batch_size,
  93. shuffle=False,
  94. collate_fn=collate_fn,
  95. **kwargs
  96. )
  97. print('dataloader 第一个batch的情况如下:')
  98. print(next(iter(test_loader)), next(iter(test_loader))[0].shape)
  99. # 模型搭建
  100. if Config.model_name == 'lstm_attention':
  101. model = LSTM_Attention(vocab_size=len(vocab_2_id),
  102. n_class=Config.num_classes,
  103. embedding_dim=Config.embedding_dim,
  104. hidden_dim=Config.hidden_dim,
  105. num_layers=Config.layer_num,
  106. dropout=Config.dropout,
  107. bidirectional=Config.bidirectional,
  108. embedding_weights=None, # 预测的情况下会加载
  109. train_w2v=Config.w2v_grad
  110. )
  111. # print(model.embedding.weight)
  112. else:
  113. model = LSTM_Model(vocab_size=len(vocab_2_id),
  114. n_class=Config.num_classes,
  115. embedding_dim=Config.embedding_dim,
  116. hidden_dim=Config.hidden_dim,
  117. num_layers=Config.layer_num,
  118. dropout=Config.dropout,
  119. bidirectional=Config.bidirectional,
  120. embedding_weights=None,
  121. train_w2v=Config.w2v_grad
  122. )
  123. print('Model-"{}" 细节:\n'.format(Config.model_name), model)
  124. # predict
  125. _,con_matrix,error_samples = predict(model=model,
  126. device=device,
  127. metric_fun=get_score,
  128. checkpoint=Config.checkpoint,
  129. predict_loader=test_loader)
  130. print('混淆矩阵:\n',con_matrix)
  131. # 保存 测试出错了样本
  132. print('保存测试集错误的样本:"{}"'.format('./data/test_error_sample.data'))
  133. error_target,error_x = zip(*error_samples)
  134. error_target_ = []
  135. error_x_ =[]
  136. for i in range(len(error_target)):
  137. for j in range(len(error_target[i])):
  138. error_target_.append(error_target[i][j])
  139. error_x_.append(error_x[i][j])
  140. print(len(error_target_),len(error_x_))
  141. vocab_keys = list(vocab_2_id.keys())
  142. error_x_ = [np.array(vocab_keys)[np.array(i)].tolist() for i in error_x_]
  143. with open('./data/test_error_sample.data','w',encoding='utf-8') as w:
  144. for idx in range(len(error_target_)):
  145. word_str = ''.join(error_x_[idx])
  146. word_str = re.sub('<PAD>\s*', '', word_str)
  147. w.write(str(error_target_[idx]))
  148. w.write('\t')
  149. w.write(word_str)
  150. w.write('\n')

  9、无target预测

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # @Time : 2021/8/20 23:06
  4. # @Author :
  5. # @Site :
  6. # @File : predict.py
  7. # @Software: PyCharm
  8. from __future__ import print_function
  9. from __future__ import division
  10. from __future__ import absolute_import
  11. from __future__ import with_statement
  12. from os import lseek
  13. from model import LSTM_Attention, LSTM_Model
  14. from data_process import data_2_id, DataProcessNoTarget
  15. from loader_utils import get_vocab, Data_Set, collate_fn
  16. from create_config import Config
  17. from torch.utils.data import DataLoader
  18. import torch
  19. import torch.nn.functional as F
  20. import os
  21. import numpy as np
  22. def eval_one_epoch(model, device, eval_loader):
  23. """
  24. ********** 一个epoch模型验证 ************
  25. """
  26. print('Predict ... ')
  27. model.eval() # 开启与dropout、BN层,它不会阻止梯度的计算,只不过回传参数,因此,eval 模式使用 with torch.no_grad() 还是很有必要的,加快计算速度。
  28. model.to(device)
  29. batch_predicts = []
  30. batch_probs = []
  31. with torch.no_grad():
  32. for idx, input_x in enumerate(eval_loader):
  33. input_x = input_x.to(device)
  34. output = model(input_x) # 前向传播
  35. output = F.softmax(output,dim=-1)
  36. # 计算score
  37. prob,pre = torch.max(output,dim=-1)
  38. prob = prob.cpu().numpy().reshape(-1).tolist()
  39. pre = pre.cpu().numpy().reshape(-1).tolist()
  40. batch_predicts.extend(pre)
  41. batch_probs.append(prob)
  42. return np.array(batch_predicts),np.array(batch_probs)
  43. def predict(model, device, checkpoint, predict_loader):
  44. """
  45. ********** 模型测试 ************
  46. """
  47. # 判断加载已保留的最优的模型参数
  48. if os.path.exists(checkpoint):
  49. if torch.cuda.is_available():
  50. ck_dict = torch.load(checkpoint, map_location=lambda storage, loc: storage.cuda()) # 使用 gpu 读取 模型参数
  51. else:
  52. ck_dict = torch.load(checkpoint, map_location=lambda storage, loc: storage) # 使用 cpu 读取模型参数
  53. best_scores = ck_dict['best_score']
  54. history_epoch, best_epoch = ck_dict['epochs'], ck_dict['best_epochs']
  55. model.load_state_dict(ck_dict['best_params'])
  56. print(
  57. 'From "{}" load history model params:\n\tTrained Epochs:{}\n\tBest Model Epoch:{}\n' \
  58. '\tBest Score:<\tacc:{:.3f}%\t macro_recall:{:.3f}%\t macro_precision:{:.3f}%\t macro_F1:{:.3f}%\t>\n\t'.format(
  59. checkpoint, history_epoch, best_epoch, 100. * best_scores[0], 100. * best_scores[1],
  60. 100. * best_scores[2],
  61. 100. * best_scores[3]))
  62. # predict
  63. predict_array,probs_array = eval_one_epoch(model=model,
  64. device=device,
  65. eval_loader=predict_loader)
  66. else:
  67. print('Model not exists .... ')
  68. predict_array = None
  69. probs_array = None
  70. exit()
  71. return predict_array,probs_array
  72. if __name__ == '__main__':
  73. dev = 'cuda:0' if torch.cuda.is_available() else 'cpu'
  74. device = torch.device(dev)
  75. # 数据加载
  76. vocab_2_id = get_vocab(Config.vocab_save_path) # 词汇表 50002
  77. # DataSet DataLoader
  78. X_predict = DataProcessNoTarget().forward(Config.predict_data,Config.stop_word_path,vocab_2_id,Config.max_seq)
  79. kwargs = {'num_workers': Config.num_workers, 'pin_memory': True} if torch.cuda.is_available() else {
  80. 'num_workers': Config.num_workers}
  81. predict_dataset = Data_Set(X_predict)
  82. predict_loader = DataLoader(dataset=predict_dataset,
  83. batch_size=Config.batch_size,
  84. shuffle=False,
  85. collate_fn=collate_fn,
  86. **kwargs
  87. )
  88. print('dataloader 第一个batch的情况如下:')
  89. print(next(iter(predict_loader)), next(iter(predict_loader))[0].shape)
  90. # 模型搭建
  91. if Config.model_name == 'lstm_attention':
  92. model = LSTM_Attention(vocab_size=len(vocab_2_id),
  93. n_class=Config.num_classes,
  94. embedding_dim=Config.embedding_dim,
  95. hidden_dim=Config.hidden_dim,
  96. num_layers=Config.layer_num,
  97. dropout=Config.dropout,
  98. bidirectional=Config.bidirectional,
  99. embedding_weights=None, # 预测的情况下会加载
  100. train_w2v=Config.w2v_grad
  101. )
  102. # print(model.embedding.weight)
  103. else:
  104. model = LSTM_Model(vocab_size=len(vocab_2_id),
  105. n_class=Config.num_classes,
  106. embedding_dim=Config.embedding_dim,
  107. hidden_dim=Config.hidden_dim,
  108. num_layers=Config.layer_num,
  109. dropout=Config.dropout,
  110. bidirectional=Config.bidirectional,
  111. embedding_weights=None,
  112. train_w2v=Config.w2v_grad
  113. )
  114. print('Model-"{}" 细节:\n'.format(Config.model_name), model)
  115. # predict
  116. predict_array,probs_array = predict(model=model,
  117. device=device,
  118. checkpoint=Config.checkpoint,
  119. predict_loader=predict_loader)
  120. print('predict 结果:\n结果:{}\n置信度:{}'.format(np.array(['讨厌','喜欢'])[predict_array],probs_array))

三、训练时候的问题

  3.1、模型过拟合

  1. 1. 首排除模型问题:
  2. 1.1 换成简单的模型,训练后模型过拟合依然存在,排除模型错误
  3. 1.1 双向的lstm怎么区分处理pad的输出与真实信息输出
  4. 正向传播:由于我们在文本前面进行PAD,所以正向传播的过程中最后一个时刻的输出应该是全部句子正向
  5. 传播的语义表征
  6. 反向传播:而bilstm中反向传播时,前面的时刻都是PAD的数据,因此PAD的数据的输出不应该使用,我们
  7. 应该根据每个batch中每个样本真实的数据的长度来获取句子的语义表征。
  8. 数据问题:
  9. 1. 数据是否不均衡
  10. 2. 样本存在一部分噪声样本,需要用机器学习的方法将其清洗,oneclassvm
  11. 3. 原数据是否打乱了排序?train_dataloader 是否打乱顺序,相对的 val_loader/test_loader 是否从大到小排序【为了避
  12. 免长短不齐加入过多的PAD标签】
  13. 4. 停止词字典与去除高频词可能过滤掉了有用词【情感分析场景下:哈、吗、呵,这些词也是有主观色彩的,不建议去除】
  14. 训练问题
  15. 5. dataloader 中的定义coffle_fn 函数剪切每一个batch的长度为最长真实长度,使的每一个batch的长度可能不同
  16. 6. 如果使用了预训练模型,是否使用分层学习率,过大了学习率会使预训练模型震荡
  17. 7. 学习率的定义、loss函数 focal loss、学习率衰减策略
  18. focal loss:主要为了解决样本不均衡的问题,还有一些标签正确却特征难学的样本,但是如果数据中脏数据较多可能
  19. 导致模型的准确率下降

  3.2、常见的问题

1. 为什么 PAD 标签对应的id 是0,UNK 不是0呢?

    这是因为后续对batch每个样本真实长度统计计算方便规定PAD的id0

2. 为什么PAD会放在前面? 

  1. PAD放在前面时,正向传播时最后一个时刻是真实的数据,不需要去针对真实的样本数据的长度去索引真实数据的最后
  2. 一个时刻的输出

3、验证集与测试集最优的 acc、macro reall、macro precison、macro f1 为 92.5%,进一步训练会过拟合 

  1. 原因:原始数据中存在一部分噪声数据,测试集将预测错误的样本保存发现:
  2. ① 有些标签原始数据标注错误
  3. ② 有些数据情感色彩处于模棱两可的状态
  4. 解决方法猜想:
  5. 数据预处理分割train、val、test集前应该做离群点检测,机器学习的方法有OneClassSVM、Isolation Forest等

          孤立森林(Isolation Forest) 

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号