当前位置:   article > 正文

Spark 中文分词

spark 中文分词

Spark 中文分词

一、导入需要的分词包

  1. import org.ansj.domain.Term
  2. import org.ansj.recognition.impl.StopRecognition
  3. import org.ansj.splitWord.analysis.ToAnalysis

二、停用词过滤

  1. def filter(stopWords: Array[String]): StopRecognition = {
  2. // add stop words
  3. val filter = new StopRecognition
  4. filter.insertStopNatures("w") // filter punctuation
  5. filter.insertStopNatures("m") // filter m pattern
  6. filter.insertStopNatures("null") // filter null
  7. filter.insertStopNatures("<br />") // filter <br />
  8. filter.insertStopRegexes("^[a-zA-Z]{1,}") //filter English alphabet
  9. filter.insertStopRegexes("^[0-9]+") //filter number
  10. filter.insertStopRegexes("[^a-zA-Z0-9\\u4e00-\\u9fa5]+")
  11. filter.insertStopRegexes("\t")
  12. for (x <- stopWords) {
  13. filter.insertStopWords(x)
  14. }
  15. filter
  16. }

三、分词

  1. def getWords(text: String, filter: StopRecognition): ArrayBuffer[String] = {
  2. val words = new mutable.ArrayBuffer[String]()
  3. val terms: java.util.List[Term] = ToAnalysis.parse(text).recognition(filter).getTerms
  4. for (i <- 0 until terms.size()) {
  5. val word = terms.get(i).getName
  6. if (word.length >= MIN_WORD_LENGTH) {
  7. words += word
  8. }
  9. }
  10. words
  11. }

转载于:https://blog.51cto.com/9283734/2349452

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/564711
推荐阅读
相关标签
  

闽ICP备14008679号