当前位置:   article > 正文

[linux] 超长文本训练tokenizer报错 训练数据格式不正确_文本导致tokenizer以后异常

文本导致tokenizer以后异常

Traceback (most recent call last):
  File "/xxxtext_generation_train/preprocess/token_preprocess/train_tokenizer.py", line 170, in <module>
    spm.SentencePieceTrainer.train(
  File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 989, in Train
    SentencePieceTrainer._Train(arg=arg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 973, in _Train
    model_proto = SentencePieceTrainer._TrainFromMap4(new_kwargs,
  File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 939, in _TrainFromMap4
    return _sentencepiece.SentencePieceTrainer__TrainFromMap4(args, iter)
RuntimeError: Internal: src/trainer_interface.cc(428) [!sentences_.empty()]  

检查训练数据格式,不要直接print ,要加到dict里面json.dumps(d, ensure_ascii=False)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/361224
推荐阅读
相关标签
  

闽ICP备14008679号