当前位置:   article > 正文

什么时候需要DataCollator以及一些常见的DataCollator

datacollator

DataCollator:如果不指定也会有个默认的DataCollator,默认的DataCollator作用是将输入转换为tensor,常见的需要手动指定的时候就是数据没有做padding的时候,要动态padding。也就是说如果在data_process中做了padding,并且没有特殊处理需求,那么也许就不需要collator了。

DataCollatorForSeq2Seq: Data collator that will dynamically pad the inputs received, as well as the labels.(区分input和output)

class DataCollatorWithPadding:Data collator that will dynamically pad the inputs received.

class DataCollatorForTokenClassification:Data collator that will dynamically pad the inputs received, as well as the labels.

class DataCollatorForLanguageModeling:Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

  args: ①mlm (`bool`, *optional*, defaults to `True`):Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked  tokens and the value to predict for the masked token.②mlm_probability (`float`, *optional*, defaults to 0.15):The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.

Tip:  For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/寸_铁/article/detail/835374
推荐阅读
相关标签
  

闽ICP备14008679号