赞
踩
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(
tokenizer.encode("Hello, my dog is cuting",
add_special_tokens=True)).unsqueeze(0)
我们将将在下面进行一一探索于整理。
sentence = "Hello, my son is cuting."
input_ids_method1 = torch.tensor(
tokenizer.encode(sentence, add_special_tokens=True)) # Batch size 1
# tensor([ 101, 7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012, 102])
input_token2 = tokenizer.tokenize(sentence)
# ['hello', ',', 'my', 'son', 'is', 'cut', '##ing', '.']
input_ids_method2 = tokenizer.convert_tokens_to_ids(input_token2)
# tensor([7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012])
# 并没有开头和结尾的标记:[cls]、[sep]
def encode( self, text: str, # 需要转化的句子 text_pair: Optional[str] = None, add_special_tokens: bool = True, max_length: Optional[int] = None, stride: int = 0, truncation_strategy: str = "longest_first", pad_to_max_length: bool = False, return_tensors: Optional[str] = None, **kwargs ): """ Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``. Args: text (:obj:`str` or :obj:`List[str]`): The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids` method) text_pair (:obj:`str` or :obj:`List[str]`, `optional`, defaults to :obj:`None`): Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids` method) add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`): If set to ``True``, the sequences will be encoded with the special tokens relative to their model. max_length (:obj:`int`, `optional`, defaults to :obj:`None`): If set to a number, will limit the total sequence returned so that it has a maximum length. If there are overflowing tokens, those will be added to the returned dictionary stride (:obj:`int`, `optional`, defaults to ``0``): If set to a number along with max_length, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens. truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`): String selected in the following options: - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length starting from the longest one at each token (when there is a pair of input sequences) - 'only_first': Only truncate the first sequence - 'only_second': Only truncate the second sequence - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length) pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`): If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their max length. If no max length is specified, the padding is done up to the model's max length. The tokenizer padding sides are handled by the class attribute `padding_side` which can be set to the following strings: - 'left': pads on the left of the sequences - 'right': pads on the right of the sequences Defaults to False: no padding. return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`): Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant` or PyTorch :obj:`torch.Tensor` instead of a list of python integers. **kwargs: passed to the `self.tokenize()` method """ encoded_inputs = self.encode_plus( text, text_pair=text_pair, max_length=max_length, add_special_tokens=add_special_tokens, stride=stride, truncation_strategy=truncation_strategy, pad_to_max_length=pad_to_max_length, return_tensors=return_tensors, **kwargs, ) return encoded_inputs["input_ids"]
add_special_tokens: bool = True
将句子转化成对应模型的输入形式,默认开启
max_length
设置最大长度,如果不设置的话原模型设置的最大长度是512,此时,如果句子长度超过512会报下面的错:
Token indices sequence length is longer than the specified maximum sequence length for this model (5904 > 512). Running this sequence through the model will result in indexing errors
这时候我们需要做切断句子操作,或者启用这个参数,设置我们想要的最大长度,这样函数将只保留长度-2
(除去[cls][sep])个token并转化成id。
pad_to_max_length: bool = False
是否按照最长长度补齐,默认关闭,此处可以通过tokenizer.padding_side='left'
设置补齐的位置在左边插入。
truncation_strategy: str = "longest_first"
截断机制,有四种方式来读取句子内容:
return_tensors: Optional[str] = None
返回的数据类型,默认是None
,可以选择tensorflow版本('tf'
)和pytorch版本('pt'
)。
武功再高,也怕菜刀
)。Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。