java对字符串进行分词并根据每个词出现的次数计算权重

作者：繁依Fanyi0 | 2024-04-08 01:18:34

踩

对字符串进行分词

如果要对字符串进行分词的操作,我们可以借助很多成熟的工具包,例如Stanford, CoreNLP, Jieba, IK Analyzer, HanLP, lucene等诸多工具此处我是用的是lucene,原本我是想用IK分词器的,但是不知道为什么maven我始终下载不下来

lucene介绍

Lucene是一个基于Java语言开发的开源全文搜索引擎库。也就是说，它能够提供全文搜索、近似搜索、词法分析、查询解析以及索引等多种文本处理功能。Lucene 作为一个 Java开发包，强大、高效、精确地完成各种搜索任务，深受Java开发者的青睐。

Lucene采用简单明了的接口，易于操作，任何一名开发工程师都可以快速上手。通过使用Lucene，开发人员可以轻松地实现文本搜索、语言处理，提高其软件产品的质量和价值。它与 Solr和 ElasticSearch等搜索引擎的API兼容性良好，常用于企业级应用中，例如商务搜索引擎、文档管理系统、知识管理系统、邮件服务器、文件格式转换器等多种场景。

引入依赖


        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>8.10.0</version>
        </dependency>

测试代码


public class Test {
 
    public static void main(String[] args) throws IOException {
        String text = "这句话是有回音的," +
                "是有回音的," +
                "有回音的," +
                "的";
        List<String> list = tokenizeString(text);
        // 遍历List，并统计每个字符串的出现次数
        Map<String, Integer> statistics = new HashMap<>(list.size());
        for (String str : list) {
            statistics.put(str, statistics.getOrDefault(str, 0) + 1);
        }
 
        for (Map.Entry<String, Integer> entry : statistics.entrySet()) {
            System.out.println(entry.getKey() + ":" + entry.getValue());
        }
    }
 
    public static List<String> tokenizeString(String str) throws IOException {
        //中文分词器
        Analyzer analyzer = new SmartChineseAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("text", new StringReader(str));
 
        List<String> tokens = new ArrayList<>();
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        tokenStream.reset();
 
        while (tokenStream.incrementToken()) {
            String term = charTermAttribute.toString();
            tokens.add(term);
        }
 
        analyzer.close();
 
        return tokens;
    }
}

测试结果

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/381955