当前位置:   article > 正文

nlp 分词 构建词表_idx for idx

idx for idx
  1. MAX_VOCAB_SIZE = 10000
  2. UNK, PAD = '<UNK>', '<PAD>'
  3. def build_vocab(file_name, tokenize, max_size, min_freq):
  4. vocab_dic = {}
  5. with open(file_name, 'r', encoding='utf-8') as f:
  6. for line in f:
  7. lin = line.strip()
  8. if not lin:
  9. continue
  10. content = lin.split('\t')[0]
  11. # print(content)
  12. for word in tokenize(content):
  13. # print(word)
  14. vocab_dic[word] = vocab_dic.get(word, 0) +1
  15. # break
  16. vocab_list = sorted([_ for _ in vocab_dic.items() if _[1] >= min_freq], key=lambda x: x[1], reverse=True)[
  17. :max_size]
  18. vocab_dic = {word_count[0]: idx for idx, word_count in enumerate(vocab_list)}
  19. print(vocab_dic)
  20. print(vocab_list)
  21. file_name = '../text/THUCNews/data/train.txt'
  22. tokenize = lambda x: x.strip(' ')
  23. build_vocab(file_name, tokenize, max_size=MAX_VOCAB_SIZE, min_freq=1)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/589111
推荐阅读
相关标签
  

闽ICP备14008679号