序
本文主要研究一下如何使用opennlp自定义命名实体,标注训练及模型运用。
maven
- <dependency>
- <groupId>org.apache.opennlp</groupId>
- <artifactId>opennlp-tools</artifactId>
- <version>1.8.4</version>
- </dependency>
- 复制代码
实践
训练模型
- // train the name finder
- String typedEntities = "<START:organization> NATO <END>\n" +
- "<START:location> United States <END>\n" +
- "<START:organization> NATO Parliamentary Assembly <END>\n" +
- "<START:location> Edinburgh <END>\n" +
- "<START:location> Britain <END>\n" +
- "<START:person> Anders Fogh Rasmussen <END>\n" +
- "<START:location> U . S . <END>\n" +
- "<START:person> Barack Obama <END>\n" +
- "<START:location> Afghanistan <END>\n" +
- "<START:person> Rasmussen <END>\n" +
- "<START:location> Afghanistan <END>\n" +
- "<START:date> 2010 <END>";
- ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
- new PlainTextByLineStream(new MockInputStreamFactory(typedEntities), "UTF-8"));
-
- TrainingParameters params = new TrainingParameters();
- params.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
- params.put(TrainingParameters.ITERATIONS_PARAM, 70);
- params.put(TrainingParameters.CUTOFF_PARAM, 1);
-
- TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
- params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
- 复制代码
opennlp使用<START> 及 <END>来进行自定义标注实体,命名实体的话则在START之后用冒号标明,比如<START:person>
参数说明
- ALGORITHM_PARAM
On the engineering level, using maxent is an excellent way of creating programs which perform very difficult classification tasks very well.
- ITERATIONS_PARAM
number of training iterations, ignored if -params is used.
- CUTOFF_PARAM
minimal number of times a feature must be seen
使用模型
上面训练完模型之后,就可以使用该模型进行解析
- NameFinderME nameFinder = new NameFinderME(nameFinderModel);
-
- // now test if it can detect the sample sentences
-
- String[] sentence = "NATO United States Barack Obama".split("\\s+");
-
- Span[] names = nameFinder.find(sentence);
-
- Stream.of(names)
- .forEach(span -> {
- String named = IntStream.range(span.getStart(),span.getEnd())
- .mapToObj(i -> sentence[i])
- .collect(Collectors.joining(" "));
- System.out.println("find type: "+ span.getType()+",name: " + named);
- });
- 复制代码
输出如下:
- find type: organization,name: NATO
- find type: location,name: United States
- find type: person,name: Barack Obama
- 复制代码
小结
opennlp的自定义命名实体的标注,给以了一定定制空间,方便开发者定制各自领域特殊的命名实体,以提高特定命名实体分词的准确性。