当前位置:   article > 正文

[nlp] Token indices sequence length is longer than the specified maximum sequence length for this

token indices sequence length is longer than the specified maximum sequence

问题:

Token indices sequence length is longer than the specified maximum sequence length for this model (170835 > 32768). Running this sequence through the model will result in indexing errors

解决办法:

truncation=True

sentence_ids = Encoder.tokenizer(sentence, truncation=True)['input_ids']
  1. def encode(self, json_line):
  2. try: # json.loads may raise an exception
  3. data = json.loads(json_line)
  4. ids = {}
  5. tokens = 0
  6. # print("json_keys", args.json_keys)
  7. for key in self.args.json_keys:
  8. text = data[key]
  9. doc_ids = []
  10. for sentence in Encoder.splitter.tokenize(text):
  11. if args.tokenizer_type == "QwenTokenizer":
  12. sentence_ids = Encoder.tokenizer(sentence, truncation=True)['input_ids']
  13. else:
  14. sentence_ids = Encoder.tokenizer.tokenize(sentence)
  15. tokens += len(sentence_ids)
  16. if len(sentence_ids) > 0:
  17. doc_ids.append(sentence_ids)
  18. if len(doc_ids) > 0 and self.args.append_eod:
  19. doc_ids[-1].append(Encoder.tokenizer.eos_token_id) #qwen1.5:eos_token_id, qwen:eod_id, llama:eod
  20. ids[key] = doc_ids
  21. return ids, len(json_line), tokens
  22. except:
  23. print("error in token_raw_data_for_dsw_qwen.py, please check lines 81 to 100")
  24. return {}, 0, 0

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/477211
推荐阅读
相关标签
  

闽ICP备14008679号