赞
踩
DataCollator:如果不指定也会有个默认的DataCollator,默认的DataCollator作用是将输入转换为tensor,常见的需要手动指定的时候就是数据没有做padding的时候,要动态padding。也就是说如果在data_process中做了padding,并且没有特殊处理需求,那么也许就不需要collator了。
DataCollatorForSeq2Seq: Data collator that will dynamically pad the inputs received, as well as the labels.(区分input和output)
class DataCollatorWithPadding:Data collator that will dynamically pad the inputs received.
class DataCollatorForTokenClassification:Data collator that will dynamically pad the inputs received, as well as the labels.
class DataCollatorForLanguageModeling:Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.
args: ①mlm (`bool`, *optional*, defaults to `True`):Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.②mlm_probability (`float`, *optional*, defaults to 0.15):The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.
Tip: For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。