当前位置:   article > 正文

使用opennlp自定义命名实体

opennlp 小明

本文主要研究一下如何使用opennlp自定义命名实体,标注训练及模型运用。

maven

  1. <dependency>
  2. <groupId>org.apache.opennlp</groupId>
  3. <artifactId>opennlp-tools</artifactId>
  4. <version>1.8.4</version>
  5. </dependency>
  6. 复制代码

实践

训练模型

  1. // train the name finder
  2. String typedEntities = "<START:organization> NATO <END>\n" +
  3. "<START:location> United States <END>\n" +
  4. "<START:organization> NATO Parliamentary Assembly <END>\n" +
  5. "<START:location> Edinburgh <END>\n" +
  6. "<START:location> Britain <END>\n" +
  7. "<START:person> Anders Fogh Rasmussen <END>\n" +
  8. "<START:location> U . S . <END>\n" +
  9. "<START:person> Barack Obama <END>\n" +
  10. "<START:location> Afghanistan <END>\n" +
  11. "<START:person> Rasmussen <END>\n" +
  12. "<START:location> Afghanistan <END>\n" +
  13. "<START:date> 2010 <END>";
  14. ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
  15. new PlainTextByLineStream(new MockInputStreamFactory(typedEntities), "UTF-8"));
  16. TrainingParameters params = new TrainingParameters();
  17. params.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
  18. params.put(TrainingParameters.ITERATIONS_PARAM, 70);
  19. params.put(TrainingParameters.CUTOFF_PARAM, 1);
  20. TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
  21. params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
  22. 复制代码

opennlp使用<START> 及 <END>来进行自定义标注实体,命名实体的话则在START之后用冒号标明,比如<START:person>

参数说明

  • ALGORITHM_PARAM

On the engineering level, using maxent is an excellent way of creating programs which perform very difficult classification tasks very well.

  • ITERATIONS_PARAM

number of training iterations, ignored if -params is used.

  • CUTOFF_PARAM

minimal number of times a feature must be seen

使用模型

上面训练完模型之后,就可以使用该模型进行解析

  1. NameFinderME nameFinder = new NameFinderME(nameFinderModel);
  2. // now test if it can detect the sample sentences
  3. String[] sentence = "NATO United States Barack Obama".split("\\s+");
  4. Span[] names = nameFinder.find(sentence);
  5. Stream.of(names)
  6. .forEach(span -> {
  7. String named = IntStream.range(span.getStart(),span.getEnd())
  8. .mapToObj(i -> sentence[i])
  9. .collect(Collectors.joining(" "));
  10. System.out.println("find type: "+ span.getType()+",name: " + named);
  11. });
  12. 复制代码

输出如下:

  1. find type: organization,name: NATO
  2. find type: location,name: United States
  3. find type: person,name: Barack Obama
  4. 复制代码

小结

opennlp的自定义命名实体的标注,给以了一定定制空间,方便开发者定制各自领域特殊的命名实体,以提高特定命名实体分词的准确性。

doc

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/703098
推荐阅读
相关标签
  

闽ICP备14008679号