当前位置:   article > 正文

NLTK包和语料库的准备_nltk中文语料库建立 conditionalfreqdist

nltk中文语料库建立 conditionalfreqdist

NLTK包和语料库的准备

 

  1. import pandas as pd
  2. raw = pd.read_table('../data/金庸-射雕英雄传txt精校版.txt', names=['txt'], encoding="GBK")
  3. print(len(raw))
  4. raw

  1. # 章节判断用变量预处理
  2. def m_head(tmpstr):
  3. return tmpstr[:1]
  4. def m_mid(tmpstr):
  5. return tmpstr.find("回 ")
  6. raw['head'] = raw.txt.apply(m_head)
  7. raw['mid'] = raw.txt.apply(m_mid)
  8. raw['len'] = raw.txt.apply(len)
  9. raw.head(50)

  1. # 章节判断
  2. chapnum = 0
  3. for i in range(len(raw)):
  4. if raw['head'][i] == "第" and raw['mid'][i] > 0 and raw['len'][i] < 30:
  5. chapnum += 1
  6. if chapnum >= 40 and raw['txt'][i] == "附录一:成吉思汗家族":
  7. chapnum = 0
  8. raw.loc[i, 'chap'] = chapnum
  9. raw.head(50)

  1. # 删除临时变量
  2. del raw['head']
  3. del raw['mid']
  4. del raw['len']
  5. raw.head(50)

  1. rawgrp = raw.groupby('chap')
  2. chapter = rawgrp.agg(sum) # 只有字符串的情况下,sum函数自动转为合并字符串
  3. chapter = chapter[chapter.index != 0]
  4. chapter

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/357812
推荐阅读
相关标签
  

闽ICP备14008679号