赞
踩
官网地址:https://www.hanlp.com/index.html
Github地址:https://github.com/hankcs/HanLP/tree/v1.7.8
一款包含中英文分词、自定义分词、词性标注、关键词提取、情感分析等nlp功能的开源三方包。以快速上手,简单配置为突出特点。亲测0基础可上手。
如项目中有数据清洗,数据分析,数据感情分析类似需求时,可考虑直接使用该包进行数据处理。
以下为java spring 项目的使用方法。
详细使用手册见官网和github文档
如果使用的maven库已有该三方包,直接进行依赖引用。如果没有,先到github下载完整Hanlp包上传maven库。
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.7.8</version>
</dependency>
引入后刷新maven install即可。可选版本见官方网站,这里使用portable-1.7.8
一般项目可直接下载官方数据包作为基础数据包使用。在基础数据包基础上,扩展自定义或其他领域数据包。
https://github.com/hankcs/HanLP/archive/refs/tags/v1.7.8.zip
如果已经引入依赖,三方包中已包含了data目录和基础的词库,但是如果需要自定义词库则必须自己增加该data目录到你的项目目录下。
放在哪里都行,但是建议放在resources目录下。
data包下,dictionary包含了基础数据包,并且后面要增加自定义数据包建议也放在这里统一管理。
hanlp.properties是是自定义词库配置需要写的配置文件,建议同放在resources目录下。
该文件用于指定各自定义内容路径。根据源码我们可以看到:
try { p.load(new InputStreamReader(Predefine.HANLP_PROPERTIES_PATH == null ? loader.getResourceAsStream("hanlp.properties") : new FileInputStream(Predefine.HANLP_PROPERTIES_PATH) , "UTF-8")); } catch (Exception e) { String HANLP_ROOT = System.getProperty("HANLP_ROOT"); if (HANLP_ROOT == null) HANLP_ROOT = System.getenv("HANLP_ROOT"); if (HANLP_ROOT != null) { HANLP_ROOT = HANLP_ROOT.trim(); p = new Properties(); p.setProperty("root", HANLP_ROOT); logger.info("使用环境变量 HANLP_ROOT=" + HANLP_ROOT); } else throw e; } String root = p.getProperty("root", "").replaceAll("\\\\", "/"); if (root.length() > 0 && !root.endsWith("/")) root += "/"; CoreDictionaryPath = root + p.getProperty("CoreDictionaryPath", CoreDictionaryPath); CoreDictionaryTransformMatrixDictionaryPath = root + p.getProperty("CoreDictionaryTransformMatrixDictionaryPath", CoreDictionaryTransformMatrixDictionaryPath); BiGramDictionaryPath = root + p.getProperty("BiGramDictionaryPath", BiGramDictionaryPath); CoreStopWordDictionaryPath = root + p.getProperty("CoreStopWordDictionaryPath", CoreStopWordDictionaryPath); CoreSynonymDictionaryDictionaryPath = root + p.getProperty("CoreSynonymDictionaryDictionaryPath", CoreSynonymDictionaryDictionaryPath); PersonDictionaryPath = root + p.getProperty("PersonDictionaryPath", PersonDictionaryPath); PersonDictionaryTrPath = root + p.getProperty("PersonDictionaryTrPath", PersonDictionaryTrPath); String[] pathArray = p.getProperty("CustomDictionaryPath", "data/dictionary/custom/CustomDictionary.txt").split(";"); String prePath = root; for (int i = 0; i < pathArray.length; ++i) { if (pathArray[i].startsWith(" ")) { pathArray[i] = prePath + pathArray[i].trim(); } else { pathArray[i] = root + pathArray[i]; int lastSplash = pathArray[i].lastIndexOf('/'); if (lastSplash != -1) { prePath = pathArray[i].substring(0, lastSplash + 1); } } } CustomDictionaryPath = pathArray; tcDictionaryRoot = root + p.getProperty("tcDictionaryRoot", tcDictionaryRoot); if (!tcDictionaryRoot.endsWith("/")) tcDictionaryRoot += '/'; PinyinDictionaryPath = root + p.getProperty("PinyinDictionaryPath", PinyinDictionaryPath); TranslatedPersonDictionaryPath = root + p.getProperty("TranslatedPersonDictionaryPath", TranslatedPersonDictionaryPath); JapanesePersonDictionaryPath = root + p.getProperty("JapanesePersonDictionaryPath", JapanesePersonDictionaryPath); PlaceDictionaryPath = root + p.getProperty("PlaceDictionaryPath", PlaceDictionaryPath); PlaceDictionaryTrPath = root + p.getProperty("PlaceDictionaryTrPath", PlaceDictionaryTrPath); OrganizationDictionaryPath = root + p.getProperty("OrganizationDictionaryPath", OrganizationDictionaryPath); OrganizationDictionaryTrPath = root + p.getProperty("OrganizationDictionaryTrPath", OrganizationDictionaryTrPath); CharTypePath = root + p.getProperty("CharTypePath", CharTypePath); CharTablePath = root + p.getProperty("CharTablePath", CharTablePath); PartOfSpeechTagDictionary = root + p.getProperty("PartOfSpeechTagDictionary", PartOfSpeechTagDictionary); WordNatureModelPath = root + p.getProperty("WordNatureModelPath", WordNatureModelPath); MaxEntModelPath = root + p.getProperty("MaxEntModelPath", MaxEntModelPath); NNParserModelPath = root + p.getProperty("NNParserModelPath", NNParserModelPath); PerceptronParserModelPath = root + p.getProperty("PerceptronParserModelPath", PerceptronParserModelPath); CRFSegmentModelPath = root + p.getProperty("CRFSegmentModelPath", CRFSegmentModelPath); HMMSegmentModelPath = root + p.getProperty("HMMSegmentModelPath", HMMSegmentModelPath); CRFCWSModelPath = root + p.getProperty("CRFCWSModelPath", CRFCWSModelPath); CRFPOSModelPath = root + p.getProperty("CRFPOSModelPath", CRFPOSModelPath); CRFNERModelPath = root + p.getProperty("CRFNERModelPath", CRFNERModelPath); PerceptronCWSModelPath = root + p.getProperty("PerceptronCWSModelPath", PerceptronCWSModelPath); PerceptronPOSModelPath = root + p.getProperty("PerceptronPOSModelPath", PerceptronPOSModelPath); PerceptronNERModelPath = root + p.getProperty("PerceptronNERModelPath", PerceptronNERModelPath); ShowTermNature = "true".equals(p.getProperty("ShowTermNature", "true")); Normalization = "true".equals(p.getProperty("Normalization", "false")); String ioAdapterClassName = p.getProperty("IOAdapter");
我们可在配置文件中自定义的内容包括
其他内容目前未涉猎,大家可自行研究。
分词示例(官方):
public static void main(String[] args) { String[] testCase = new String[]{ "商品和服务", "当下雨天地面积水分外严重", "结婚的和尚未结婚的确实在干扰分词啊", "买水果然后来世博园最后去世博会", "中国的首都是北京", "欢迎新老师生前来就餐", "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作", "随着页游兴起到现在的页游繁盛,依赖于存档进行逻辑判断的设计减少了,但这块也不能完全忽略掉。", }; for (String sentence : testCase) { List<Term> termList = HanLP.segment(sentence); System.out.println(termList); } }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。