赞
踩
将一句话进行实体标注,如以下格式
John lives in New York
B-PER O O B-LOC I-LOC
我们的数据分为两个,sentence.txt
和labels.txt
:
#sentences.txt
John lives in New York
Where is John ?
#labels.txt
B-PER O O B-LOC I-LOC
O O B-PER O
我们假设运行build_vocab.py
在/data
中创建词汇表,将生成两个文件:
#words.txt
John
lives
in
...
```python
#tags.txt
B-PER
B-LOC
在NLP应用中,使用数字来代替词语。
假设我们的词汇库是{'is':1, 'John':2, 'Where':3, '.':4, '?':5}
,则 “Where is John ?”将被表示为[3,1,2,5]。
读取words.txt
词汇表,并给每个单词对应上数字。
word.txt
中包含了两个特别的tokens,一个是UNK
来表示词汇表中没有的词,另一个是PAD
在句子补全中使用。
vocab = {}
with open(words_path) as f:
for i, l in enumerate(f.read().splitlines()):
vocab[l] = i
同样的道理来处理tags.txt
。
接下来读取文本,并将文本转换为数字:
train_sentences = [] train_labels = [] with open(train_sentences_file) as f: for sentence in f.read().splitlines(): #replace each token by its index if it is in vocab #else use index of UNK s = [vocab[token] if token in self.vocab else vocab['UNK'] for token in sentence.split(' ')] train_sentences.append(s) with open(train_labels_file) as f: for sentence in f.read().splitlines(): #replace each label by its index l = [tag_map[label] for label in sentence.split(' ')] train_labels.append(l)
每个句子可能是不等长的,需要补充PAD
。
Let’s say we have a batch of sentences batch_sentences that is a Python list of lists, with its corresponding batch_tags which has a tag for each token in batch_sentences
1.首先计算每个batch中最长的语句,不够长的语句使用PAD
填充
2.然后使用(num_sentences,batch_max_len)
来初始化batch,由于Embedding layer需要输入long
type,所以转换为LongTensor
#compute length of longest sentence in batch batch_max_len = max([len(s) for s in batch_sentences]) #prepare a numpy array with the data, initializing the data with 'PAD' #and all labels with -1; initializing labels to -1 differentiates tokens #with tags from 'PAD' tokens batch_data = vocab['PAD']*np.ones((len(batch_sentences), batch_max_len)) batch_labels = -1*np.ones((len(batch_sentences), batch_max_len)) #copy the data to the numpy array for j in range(len(batch_sentences)): cur_len = len(batch_sentences[j]) batch_data[j][:cur_len] = batch_sentences[j] batch_labels[j][:cur_len] = batch_tags[j] #since all data are indices, we convert them to torch LongTensors batch_data, batch_labels = torch.LongTensor(batch_data), torch.LongTensor(batch_labels) #convert Tensors to Variables batch_data, batch_labels = Variable(batch_data), Variable(batch_labels)
import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self, params): super(Net, self).__init__() #maps each token to an embedding_dim vector self.embedding = nn.Embedding(params.vocab_size, params.embedding_dim) #the LSTM takens embedded sentence self.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True) #fc layer transforms the output to give the final output layer self.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags) def forward(self, s): #apply the embedding layer that maps each token to its embedding s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim #run the LSTM along the sentences of length batch_max_len s, _ = self.lstm(s) # dim: batch_size x batch_max_len x lstm_hidden_dim #reshape the Variable so that each row contains one token s = s.view(-1, s.shape[2]) # dim: batch_size*batch_max_len x lstm_hidden_dim #apply the fully connected layer and obtain the output for each token s = self.fc(s) # dim: batch_size*batch_max_len x num_tags return F.log_softmax(s, dim=1) # dim: batch_size*batch_max_len x num_tags
主要是去除PAD的影响。
def loss_fn(outputs, labels):
#reshape labels to give a flat vector of length batch_size*seq_len
labels = labels.view(-1)
#mask out 'PAD' tokens
mask = (labels >= 0).float()
#the number of tokens is the sum of elements in mask
num_tokens = int(torch.sum(mask).data[0])
#pick the values corresponding to labels and multiply by mask
outputs = outputs[range(outputs.shape[0]), labels]*mask
#cross entropy loss for all non 'PAD' tokens
return -torch.sum(outputs)/num_tokens
参考:
https://cs230.stanford.edu/blog/namedentity/#goals-of-this-tutorial
https://github.com/cs230-stanford/cs230-code-examples
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。