赞
踩
Bert包括两个版本,12层的transformers与24层的transformers,官方提供了12层的中文模型,下文也将基于12层的transformers来讲解
每一层的transformers的输出值,理论来说都可以作为句向量,但是到底该取哪一层呢,根据hanxiao大神的实验数据,最佳结果是取倒数第二层,最后一层太过于接近目标,前面几层可能语义还未充分的学习到。
接下来从代码的角度来进行详解。
先看下args.py,介绍几个重要参数。这里主要说一下layer_indexes参数,layer_indexes表示的是使用第几层的输出作为句向量,-2代表的是倒数第二层。max_seq_len表示的是序列的最大长度,因为输入的长度是不固定的,所以我们需要设置一个最大长度才能确保输出的维度是一样的,如果最大长度是20,当输入的序列长度小于20的时候,就会补0,如果大于20则会截取前面的部分 ,通常该值会取语料的长度的平均值+2,加2的原因是因为需要拼接两个占位符[CLS](表示序列的开始)与[SEP](表示序列的结束)。在这里,为了提高句子间的区分度,把list里最长的句子的值作为max_seq_len。
- # -*- coding: utf-8 -*-
- # @Time : 2021/1/21 14:55
- # @Author : hcy
- # @File : args.py
-
- #配置文件
- import os
-
- root_path = os.path.dirname(__file__)
-
- model_dir = os.path.join(root_path, 'model/bert/chinese_L-12_H-768_A-12/')
- bert_config = os.path.join(model_dir, 'bert_config.json')
- bert_ckpt = os.path.join(model_dir, 'bert_model.ckpt')
- bert_vocab_file = os.path.join(model_dir, 'vocab.txt')
-
- output_dir = os.path.join(root_path, 'output/')
- data_dir = os.path.join(root_path, 'data/')
-
- num_train_epochs = 10
- batch_size = 128
- learning_rate = 0.00005
-
- # gpu使用率
- gpu_memory_fraction = 0.8
-
- # 默认取倒数第二层的输出值作为句向量
- layer_indexes = [-2]
-
- # # 序列的最大程度,取列表中最长句子的长度作为max_seq_len
- # max_seq_len = 128
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
定义三个占位符,分别表示的是对应文本的index,mask与segment,其中index表示的是在词典中的index,mask表示的是该位置是否有内容,举个例子,例如序列的最大长度是20,有效的字符只有10个字,加上[CLS]与[SEP]两个占位符,那有8个字符是空的,该8个位置设置为0其他位置设置为1,segment_ids表示的是是否是第一个句子,是第一个句子则设置为1,因为该项目只有一个句子,所以均为1。
- input_ids = tf.placeholder(tf.int32, shape=[None, None], name='input_ids')
- input_mask = tf.placeholder(tf.int32, shape=[None, None], name='input_masks')
- segment_ids = tf.placeholder(tf.int32, shape=[None, None], name='segment_ids')
根据上面定义的三个占位符,定义好输入的张量,实例化一个model对象,该对象就是预训练好的bert模型,然后从check_point文件中初始化权重
- input_tensors = [input_ids, input_mask, segment_ids]
-
- # 初始化BERT
- model = modeling.BertModel(
- config=bert_config,
- is_training=False,
- input_ids=input_ids,
- input_mask=input_mask,
- token_type_ids=segment_ids,
- use_one_hot_embeddings=False
- )
-
- # 加载BERT模型
- tf_vars = tf.trainable_variables()
- (assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tf_vars, args.bert_ckpt)
- tf.train.init_from_checkpoint(args.bert_ckpt, assignment)
-
- # 获取最后一层和倒数第二层
- encoder_last_layer = model.get_sequence_output()
- encoder_last2_layer = model.all_encoder_layers[-2]
-
- # 读取数据
- token = tokenization.FullTokenizer(vocab_file=args.bert_vocab_file)
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
接下来将args.index_layeres参数中的层数取出来,last2[:, 0, :]代表的就是句向量。
- # 获取最后一层和倒数第二层
- encoder_last_layer = model.get_sequence_output()
- encoder_last2_layer = model.all_encoder_layers[args.layer_indexes[0]]
-
- with tf.Session() as sess:
- sess.run(tf.global_variables_initializer())
- last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
- # last2 shape:(max_len, 1, 768)
- text_embeddings = last2[:, 0, :]
args.py
- # -*- coding: utf-8 -*-
- # @Time : 2021/1/21 14:55
- # @Author : hcy
- # @File : args.py
-
- #配置文件
- import os
-
- root_path = os.path.dirname(__file__)
-
- model_dir = os.path.join(root_path, 'model/bert/chinese_L-12_H-768_A-12/')
- bert_config = os.path.join(model_dir, 'bert_config.json')
- bert_ckpt = os.path.join(model_dir, 'bert_model.ckpt')
- bert_vocab_file = os.path.join(model_dir, 'vocab.txt')
-
- output_dir = os.path.join(root_path, 'output/')
- data_dir = os.path.join(root_path, 'data/')
-
- num_train_epochs = 10
- batch_size = 128
- learning_rate = 0.00005
-
- # gpu使用率
- gpu_memory_fraction = 0.8
-
- # 默认取倒数第二层的输出值作为句向量
- layer_indexes = [-2]
-
- # # 序列的最大程度,取列表中最长句子的长度作为max_seq_len
- # max_seq_len = 128
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
extract_features.py
- # -*- coding: utf-8 -*-
- # @Time : 2021/1/21 15:19
- # @Author : hcy
- # @File : sentences_features.py
- import modeling
- import tokenization
- import numpy as np
- from scipy.spatial.distance import cosine
- import tensorflow as tf
- import args
-
-
- bert_config = modeling.BertConfig.from_json_file(args.bert_config)
- # graph
- input_ids = tf.placeholder(tf.int32, shape=[None, None], name='input_ids')
- input_mask = tf.placeholder(tf.int32, shape=[None, None], name='input_masks')
- segment_ids = tf.placeholder(tf.int32, shape=[None, None], name='segment_ids')
-
-
- def get_data(sentences):
- """产生句子向量"""
-
- word_mask = [[1] * (args.max_seq_len + 2)]
- word_segment_ids = [[0] * (args.max_seq_len + 2)]
- return [sentences], word_mask, word_segment_ids
-
- def read_input(sentences):
- # sentences是一个list,每一个元素是一个str,代表输入文本
- # 现在需要转化成id_list
- word_id_list = []
- max_len = max([len(single) for single in sentences]) # 最大的句子长度
- args.max_seq_len = max_len
- for sentence in sentences:
- split_tokens = token.tokenize(sentence)
- # 在这里截取掉大于seq_len个句子的样本,保留其前seq_len个句子
- if len(split_tokens) > args.max_seq_len:
- split_tokens = split_tokens[:args.max_seq_len]
- else:
- while len(split_tokens) < args.max_seq_len:
- split_tokens.append('[PAD]')
- #句向量
- tokens = []
- tokens.append("[CLS]")
- for i_token in split_tokens:
- tokens.append(i_token)
- tokens.append("[SEP]")
- # 加个CLS头,加个SEP尾
- word_ids = token.convert_tokens_to_ids(tokens)
- word_id_list.append(word_ids)
- return word_id_list
-
- # 初始化BERT
- model = modeling.BertModel(
- config=bert_config,
- is_training=False,
- input_ids=input_ids,
- input_mask=input_mask,
- token_type_ids=segment_ids,
- use_one_hot_embeddings=False
- )
-
- # 加载BERT模型
- tf_vars = tf.trainable_variables()
- (assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tf_vars, args.bert_ckpt)
- tf.train.init_from_checkpoint(args.bert_ckpt, assignment)
-
- # 获取最后一层和倒数第二层
- encoder_last_layer = model.get_sequence_output()
- encoder_last2_layer = model.all_encoder_layers[args.layer_indexes[0]]
-
- # 读取数据
- token = tokenization.FullTokenizer(vocab_file=args.bert_vocab_file)
-
-
- def extract_features(sentences):
- """ 生成句向量"""
- embedding_features = []
- input_data = read_input(sentences)
- for sample in input_data:
- #生成句向量
- word_id, mask, segment = get_data(sample)
- print(word_id)
- feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)}
- with tf.Session() as sess:
- sess.run(tf.global_variables_initializer())
- last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
- # print(last2.shape)
- # last2 shape:(max_len, 1, 768)
- text_embeddings = last2[:, 0, :]
- embedding_features.append(text_embeddings)
- return embedding_features
-
-
- def similarity(sentences):
- """计算句向量的相似度"""
- distances = []
- similarity = []
- last_feature = None
- features = extract_features(sentences)
- for feature in features:
- if last_feature is None:
- last_feature = feature
- else:
- dis = cosine(feature, last_feature)
- last_feature = feature
- distances.append(dis)
- similarity.append(1-dis)
- return np.array(similarity)
-
-
- if __name__ == '__main__':
- sentences = ["今天天气不错,适合出行。",
- "今天是晴天,可以出去玩。"]
- # sentences = ["打开刘红的个人相情愿","分享名片"]
- sims = similarity(sentences)
- print(sims)
-
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
参考文章: 使用BERT生成句向量
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。