赞
踩
本文翻译自github高star项目Flair教程,传送门:
https://github.com/flairNLP/flair
Two types of objects: Sentence
and Token
, Sentence
holds a textual sentence and is a list of Token
.
from flair.data import Sentence
sentence=Sentence("The grass is green .", use_tokenizer=True)
print(sentence)
# Sentence: "The grass is green ." - 5 Tokens
print(sentence[3]) # Token: 4 green
for token in sentence:
print(token)
Add a tag to Token
token=sentence[3]
token.add_tag("ner", "color")
tag=token.get_tag("ner")
print(tag.value)
print(tag.score)
Our color tag has a score of 1.0 since we manually added it. If a tag is predicted by our sequence labeler, the score value will indicate classifier confidence.
Add a label to Sentence
sentence = Sentence('France is the current world cup winner.')
sentence.add_labels(['sports', 'world cup'])
for label in sentence.labels:
print(label)
from flair.models import SequenceTagger # We download or move model file to cached_dir (~/.flair) tagger = SequenceTagger.load('ner') # 针对其中每一个词分别给出tag标记 sentence = Sentence('George Washington went to Washington .') tagger.predict(sentence) # print(sentence.to_tagged_string()) for token in sentence: tag=token.get_tag("ner") print(tag) # 针对句子给出span标记 for entity in sentence.get_spans('ner'): print(entity) # 给出详细信息 print(sentence.to_dict(tag_type='ner')) # 针对一段文本,先进行分句子,再针对每个句子去NER text = "This is a sentence. This is another sentence. I love Berlin." # use a library to split into sentences from segtok.segmenter import split_single sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(text)] # predict tags for list of sentences tagger = SequenceTagger.load('ner') tagger.predict(sentences) # Using the mini_batch_size parameter of the .predict() method, you can set the size of mini batches passed to the tagger. for token in sentences[2]: tag=token.get_tag("ner") print(tag)
# Classic Word Embeddings from flair.embeddings import WordEmbeddings # init embedding # 需要下载两个文件https://github.com/zalandoresearch/flair/issues/651 glove_embedding = WordEmbeddings('glove') # Flair Embeddings # Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use. from flair.embeddings import FlairEmbeddings flair_embedding_forward = FlairEmbeddings('news-forward') # Stacked Embeddings # Stacked embeddings are one of the most important concepts of this library. You can use them to combine different embeddings together, for instance if you want to use both traditional embeddings together with contextual string embeddings. Stacked embeddings allow you to mix and match. We find that a combination of embeddings often gives best results. # For instance, lets combine classic GloVe embeddings with forward and backward Flair embeddings. This is a combination that we generally recommend to most users, especially for sequence labeling. from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings # init standard GloVe embedding glove_embedding = WordEmbeddings('glove') # init Flair forward and backwards embeddings flair_embedding_forward = FlairEmbeddings('news-forward') flair_embedding_backward = FlairEmbeddings('news-backward') # create a StackedEmbedding object that combines glove and forward/backward flair embeddings stacked_embeddings = StackedEmbeddings([ glove_embedding, flair_embedding_forward, flair_embedding_backward, ]) # combine BERT and Flair embeddings from flair.embeddings import FlairEmbeddings, BertEmbeddings # init Flair embeddings flair_forward_embedding = FlairEmbeddings('news-forward') flair_backward_embedding = FlairEmbeddings('news-backward') # init multilingual BERT bert_embedding = BertEmbeddings('bert-base-uncased') # from pytorch_transformers package from flair.embeddings import StackedEmbeddings stacked_embeddings = StackedEmbeddings( embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding]) # Document Embeddings # Our document embeddings are created from the embeddings of all words in the document. # Two methods # 1 Pooling # The first method calculates a pooling operation over all word embeddings in a document. The default operation is mean which gives us the mean of all words in the sentence. The resulting embedding is taken as document embedding. from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings, Sentence # initialize the word embeddings glove_embedding = WordEmbeddings('glove') flair_embedding_forward = FlairEmbeddings('news-forward') flair_embedding_backward = FlairEmbeddings('news-backward') # initialize the document embeddings, mode = mean document_embeddings = DocumentPoolEmbeddings([glove_embedding, flair_embedding_backward, flair_embedding_forward],pooling='mean') # if you only use simple word embeddings that are not task-trained you should probably use a 'nonlinear' transformation instead: # instantiate pre-trained word embeddings embeddings = WordEmbeddings('glove') # document pool embeddings document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear') # If on the other hand you use word embeddings that are task-trained (such as simple one hot encoded embeddings), you are often better off doing no transformation at all. Do this by passing 'none': document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none') # create an example sentence sentence = Sentence('The grass is green . And the sky is blue .') # embed the sentence with our document embedding document_embeddings.embed(sentence) # now check out the embedded sentence. print(sentence.get_embedding()) # 2 RNN from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings glove_embedding = WordEmbeddings('glove') document_embeddings = DocumentRNNEmbeddings([glove_embedding]) # USE WORD EMBEDDINGS # Words are now embedded using a concatenation of three different embeddings. sentence = Sentence('The grass is green .') SOME_embedding.embed(sentence) for token in sentence: print(token) print(token.embedding)
The Corpus
represents a dataset that you use to train a model. It consists of a list of train
sentences, a list of dev
sentences, and a list of test
sentences, which correspond to the training, validation and testing split during model training.
# 加载已存在的corpus import flair.datasets corpus = flair.datasets.UD_ENGLISH() # print the number of Sentences in the train split print(len(corpus.train)) # print the number of Sentences in the test split print(len(corpus.test)) # print the number of Sentences in the dev split print(len(corpus.dev)) # print the first Sentence in the training split print(corpus.train[0]) # downsample the corpus import flair.datasets downsampled_corpus = flair.datasets.UD_ENGLISH().downsample(0.1) print("--- 1 Original ---") print(corpus) print("--- 2 Downsampled ---") print(downsampled_corpus) # --- 1 Original --- # Corpus: 12543 train + 2002 dev + 2077 test sentences # --- 2 Downsampled --- # Corpus: 1255 train + 201 dev + 208 test sentences # For many learning tasks you need to create a target dictionary. Thus, the Corpus enables you to create your tag or label dictionary, depending on the task you want to learn. # create tag dictionary for an NER task corpus = flair.datasets.CONLL_03_DUTCH() print(corpus.make_tag_dictionary('ner')) stats = corpus.obtain_statistics() print(stats)
In cases you want to train over a sequence labeling dataset that is not in the above list, you can load them with the ColumnCorpus
object. Most sequence labeling datasets in NLP use some sort of column format in which each line is a word and each column is one level of linguistic annotation. See for instance this sentence:
# 1 The first column is the word itself, the second coarse PoS tags, and the third BIO-annotated NER tags.
# 2 Empty line separates sentences.
George N B-PER
Washington N I-PER
went V O
to P O
Washington N B-LOC
Sam N B-PER
Houston N I-PER
stayed V O
home N O
To read such a dataset, define the column structure as a dictionary and instantiate a ColumnCorpus
Note:
POS tags are not needed and in fact will be ignored by Flair if you provide them. The library directly goes from text to the tags you wish to predict and requires no extra features. So if you don’t have POS tags, you only need to change the column_format
to reflect this in the ColumnCorpus
and everything should be good to go!
from flair.data import Corpus
from flair.datasets import ColumnCorpus
# define columns
columns = {0: 'text', 1: 'pos', 2: 'ner'}
# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data/folder'
# init a corpus using column format, data folder and the names of the train, dev and test files
corpus = ColumnCorpus(data_folder, columns,
train_file='train.txt',
test_file='test.txt',
dev_file='dev.txt')
This gives you a Corpus
object that contains the train, dev and test splits, each has a list of Sentence
.
# access a sentence and check out annotations
print(corpus.train[0].to_tagged_string('ner'))
print(corpus.train[1].to_tagged_string('pos'))
Reading a Text Classification Dataset
use your own text classification dataset
----------------------------------------------------------------------------------
You can load a CSV format classification dataset using CSVClassificationCorpus
by passing in a column format (like in ColumnCorpus
above).
Note: You will need to save your split CSV data files in the data_folder
path with each file titled appropriately i.e. train.csv
test.csv
dev.csv
. This is because the corpus initializers will automatically search for the train, dev, test splits in a folder.
from flair.data import Corpus
from flair.datasets import CSVClassificationCorpus
# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data'
# column format indicating which columns hold the text and label(s)
column_name_map = {4: "text", 1: "label_topic", 2: "label_subtopic"}
# load corpus containing training, test and dev data and if CSV has a header, you can skip it
corpus: Corpus = CSVClassificationCorpus(data_folder,
column_name_map,
skip_header=True,
delimiter='\t', # tab-separated files
)
----------------------------------------------------------------------------------
You may format your data to the FastText format, in which each line in the file represents a text document.
Training a Sequence Labeling Model
Here is example code for a small NER model trained over NCBI data, using simple GloVe embeddings.
We can downsample it 10% for test.
from flair.data import Sentence from flair.datasets import ColumnCorpus from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings from flair.models import SequenceTagger from flair.trainers import ModelTrainer columns = {0: 'text', 1: 'pos', 2: 'ner'} data_folder = '/home/fyh/.flair/datasets/ncbi' corpus = ColumnCorpus(data_folder, columns, train_file='train.txt', test_file='test.txt', dev_file='dev.txt').downsample(0.1) stats = corpus.obtain_statistics() print(stats) tag_dictionary = corpus.make_tag_dictionary(tag_type="ner") glove_embedding = WordEmbeddings('glove') flair_embedding_forward = FlairEmbeddings('news-forward') flair_embedding_backward = FlairEmbeddings('news-backward') stacked_embeddings = StackedEmbeddings([ glove_embedding, flair_embedding_forward, flair_embedding_backward, ]) tagger= SequenceTagger(hidden_size=256, embeddings=stacked_embeddings, tag_dictionary=tag_dictionary, tag_type="ner", use_crf=True) trainer = ModelTrainer(tagger, corpus) # start training trainer.train('resources/taggers/example_ncbi-ner', learning_rate=0.1, mini_batch_size=32, max_epochs=150) # 会在运行路径下的上述路径生成模型文件,使用ctrl+c提前终止 # Alternatively, try using a stacked embedding with FlairEmbeddings and GloVe, over the full data, for 150 epochs. This will give you the state-of-the-art accuracy we report in the paper. # load the model you trained model = SequenceTagger.load('resources/taggers/example_ncbi-ner/final-model.pt') # create example sentence sentence = Sentence('A common human skin tumour is caused by activating mutations in beta-catenin.') # predict tags and print model.predict(sentence) print(sentence.to_tagged_string()) # Plotting Training Curves and Weights from flair.visual.training_curves import Plotter plotter = Plotter() plotter.plot_training_curves('loss.tsv') plotter.plot_weights('weights.txt')
Training a Text Classification Model / Multi-Dataset Training (Link)
If you want to stop the training at some point and resume it at a later point, you should train with the parameter checkpoint
set to True
. This will save the model plus training parameters after every epoch. Thus, you can load the model plus trainer at any later point and continue the training exactly there where you have left.
The example code below shows how to train, stop, and continue training of a SequenceTagger
.
# 7. start training trainer.train('resources/taggers/example-ner', learning_rate=0.1, mini_batch_size=32, max_epochs=150, checkpoint=True) # 8. stop training at any point # 9. continue trainer at later point from pathlib import Path checkpoint = tagger.load_checkpoint(Path('resources/taggers/example-ner/checkpoint.pt')) trainer = ModelTrainer.load_from_checkpoint(checkpoint, corpus) trainer.train('resources/taggers/example-ner', learning_rate=0.1, mini_batch_size=32, max_epochs=150, checkpoint=True)
The main parameter you need to set is the embeddings_storage_mode
in the train() method of the ModelTrainer. It can have one of three values:
'none'
: If you set embeddings_storage_mode=‘none’, embeddings do not get stored in memory. Instead they are generated on-the-fly in each training mini-batch (during training). The main advantage is that this keeps your memory requirements low.
'cpu'
: If you set embeddings_storage_mode=‘cpu’, embeddings will get stored in regular memory.
during inference: this slow down your inference when used with a GPU as embeddings need to be moved from GPU memory to regular memory. The only reason to use this option during inference would be to not only use the predictions but also the embeddings after prediction.
'gpu'
: If you set embeddings_storage_mode=‘gpu’, embeddings will get stored in CUDA memory. This will often be the fastest one since this eliminates the need to shuffle tensors from CPU to CUDA over and over again. Of course, CUDA memory is often limited so large datasets will not fit into CUDA memory. However, if the dataset fits into CUDA memory, this option is the fastest one.
# you need to define the search space of parameters.
from hyperopt import hp
from flair.hyperparameter.param_selection import SearchSpace, Parameter
# define your search space
search_space = SearchSpace()
search_space.add(Parameter.EMBEDDINGS, hp.choice, options=[
[ WordEmbeddings('en') ],
[ FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward') ]
])
search_space.add(Parameter.HIDDEN_SIZE, hp.choice, options=[32, 64, 128])
search_space.add(Parameter.RNN_LAYERS, hp.choice, options=[1, 2])
search_space.add(Parameter.DROPOUT, hp.uniform, low=0.0, high=0.5)
search_space.add(Parameter.LEARNING_RATE, hp.choice, options=[0.05, 0.1, 0.15, 0.2])
search_space.add(Parameter.MINI_BATCH_SIZE, hp.choice, options=[8, 16, 32])
Attention: You should always add your embeddings to the search space (as shown above). If you don’t want to test different kind of embeddings, simply pass just one embedding option to the search space, which will then be used in every test run. Here is an example:
search_space.add(Parameter.EMBEDDINGS, hp.choice, options=[
[ FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward') ]
])
In the last step you have to create the actual parameter selector. Depending on the task you need either to define a TextClassifierParamSelector
or a SequenceTaggerParamSelector
and start the optimization.
define the maximum number of evaluation runs hyperopt should perform (max_evals
)
A evaluation run performs the specified number of epochs (max_epochs
)
specify the number of runs per evaluation run (training_runs
)
If you specify more than one training run, one evaluation run will be executed the specified number of times. The final evaluation score will be the average over all those runs.
from flair.hyperparameter.param_selection import TextClassifierParamSelector, OptimizationValue
# create the parameter selector
param_selector = TextClassifierParamSelector(
corpus,
False,
'resources/results',
'lstm',
max_epochs=50,
training_runs=3,
optimization_value=OptimizationValue.DEV_SCORE
)
# start the optimization
param_selector.optimize(search_space, max_evals=100)
The parameter settings and the evaluation scores will be written to param_selection.txt
in the result directory. Selecting the best parameter combination we do not store any model to disk.
Finding the best Learning Rate Link
Custom Optimizers
You can now use any of PyTorch’s optimizers for training when initializing a ModelTrainer
. To give the optimizer any extra options just specify it as shown with the weight_decay
example:
from torch.optim.adam import Adam
trainer = ModelTrainer(tagger, corpus,
optimizer=Adam)
trainer.train(
"resources/taggers/example",
weight_decay=1e-4
)
Flair Embeddings are the secret sauce in Flair, allowing us to achieve state-of-the-art accuracies across a range of NLP tasks. This tutorial shows you how to train your own Flair embeddings, which may come in handy if you want to apply Flair to new languages or domains.
Preparing a Text Corpus
To train your own model, you first need to identify a suitably large corpus. In our experiments, we used corpora that have about 1 billion words.
You need to split your corpus into train, validation and test portions. Our trainer class assumes that there is a folder for the corpus in which there is a test.txt
and a valid.txt
with test and validation data. Importantly, there is also a folder called train
that contains the training data in splits. For instance, the billion word corpus is split into 100 parts. The splits are necessary if all the data does not fit into memory, in which case the trainer randomly iterates through all splits.
corpus/
corpus/train/
corpus/train/train_split_1
corpus/train/train_split_2
corpus/train/...
corpus/train/train_split_X
corpus/test.txt
corpus/valid.txt
Training the Language Model
Once you have this folder structure, simply point the LanguageModelTrainer
class to it to start learning a model
from flair.data import Dictionary from flair.models import LanguageModel from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus # are you training a forward or backward LM? is_forward_lm = True # load the default character dictionary dictionary: Dictionary = Dictionary.load('chars') # get your corpus, process forward and at the character level corpus = TextCorpus('/path/to/your/corpus', dictionary, is_forward_lm, character_level=True) # instantiate your language model, set hidden size and number of layers language_model = LanguageModel(dictionary, is_forward_lm, hidden_size=128, nlayers=1) # train your language model trainer = LanguageModelTrainer(language_model, corpus) trainer.train('resources/taggers/language_model', sequence_length=10, mini_batch_size=10, max_epochs=10)
The parameters in this script are very small. We got good results with a hidden size of 1024 or 2048, a sequence length of 250, and a mini-batch size of 100.
Using the LM as Embeddings
Just load the model into the FlairEmbeddings
class and use as you would any other embedding in Flair:
sentence = Sentence('I love Berlin')
# init embeddings from your trained LM
char_lm_embeddings = FlairEmbeddings('resources/taggers/language_model/best-lm.pt')
# embed sentence
char_lm_embeddings.embed(sentence)
Fine-Tuning an Existing LM
Sometimes it makes sense to fine-tune an existing language model instead of training from scratch. For instance, if you have a general LM for English and you would like to fine-tune for a specific domain.
To fine tune a LanguageModel
, you only need to load an existing LanguageModel
instead of instantiating a new one.
from flair.data import Dictionary from flair.embeddings import FlairEmbeddings from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus # instantiate an existing LM, such as one from the FlairEmbeddings language_model = FlairEmbeddings('news-forward').lm # are you fine-tuning a forward or backward LM? is_forward_lm = language_model.is_forward_lm # get the dictionary from the existing language model dictionary: Dictionary = language_model.dictionary # get your corpus, process forward and at the character level corpus = TextCorpus('path/to/your/corpus', dictionary, is_forward_lm, character_level=True) # use the model trainer to fine-tune this model on your corpus trainer = LanguageModelTrainer(language_model, corpus) trainer.train('resources/taggers/language_model', sequence_length=100, mini_batch_size=100, learning_rate=20, patience=10, checkpoint=True)
Note that when you fine-tune, you must use the same character dictionary as before and copy the direction
Fine-Tuning the language model on a specific domain
model = LanguageModel.load_language_model('your/saved/model.pt')
# make sure to use the same dictionary from saved model
dictionary = model.dictionary
# load your new corpus
corpus = Corpus('path/to/your/corpus', dictionary, forward, character_level=True)
# pass corpus and pre-trained language model to trainer
trainer = LanguageModelTrainer(language_model, corpus)
# train with your favorite parameters
trainer.train('resources/taggers/language_model', learning_rate=5)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。