当前位置: article > 正文

nltk 离线下载 wordnet 测试meteor评估指标，支持中文本生成数据集使用_nltk wordnet包

作者：菜鸟追梦旅行 | 2024-05-26 16:48:25

踩

nltk wordnet包

pycocoeval使用的java外部链接很麻烦，同时不支持中文！

使用 nltk 库就可以进行 Meteor 文本生成评估指标测试
但是 nltk 使用中文需要下载额外的 wordnet 库
由于集群断外部网，需要离线下载

**********************************************************************
  Resource python-BaseException
wordnet not found.
  Please use the NLTK Downloader to obtain the resource:
  >>> import nltk
  >>> nltk.download('wordnet')
  
  For more information see: https://www.nltk.org/data.html
  Attempted to load corpora/wordnet.zip/wordnet/
  Searched in:
    - '/Users/xq/nltk_data'
    - '/Users/xq/.conda/envs/pycharm_env/nltk_data'
    - '/Users/xq/.conda/envs/pycharm_env/share/nltk_data'
    - '/Users/xq/.conda/envs/pycharm_env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

到网站：http://www.nltk.org/nltk_data/，搜索需要的工具包，点击download，下载zip的压缩包
链接: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip

 Searched in:
    - '/Users/xq/nltk_data'
    - '/Users/xq/.conda/envs/pycharm_env/nltk_data'
    - '/Users/xq/.conda/envs/pycharm_env/share/nltk_data'
    - '/Users/xq/.conda/envs/pycharm_env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
1
2
3
4
5
6
7
8
9

在上面的目录中创建一个 nltk_data 文件夹
结尾的目录是/corpora/wordnet.zip，所以就把这个文件解压到/usr/local/share/nltk_data/corpora目录，使用命令：

unzip wordnet.zip -d /usr/local/share/nltk_data/corpora
1

不需要修改已有的python代码，再次运行即可

示例（处理中文，不要忘记分词tokenized，使用jieba库）

针对中文数据集，我们需要重写Meteor类以适应nltk.translate.meteor_score.single_meteor_score的使用，并解决传递verbose参数导致的TypeError。同时，我们也要确保正确地对中文文本进行预处理，即分词处理。以下是针对中文数据集调整后的Meteor类：

import jieba
from nltk.translate.meteor_score import single_meteor_score

class MyMeteor:
    def __init__(self):
        # 初始化不需要特定的命令行参数或启动外部Java程序
        pass

    def compute_score(self, gts, res):
        """
        计算中文数据集的METEOR评分。
        :param gts: 真实描述字典，键为图像ID，值为描述列表。
        :param res: 生成描述字典，键为图像ID，值为单个描述字符串。
        :return: 平均分数和每个图像的分数列表。
        """
        assert(gts.keys() == res.keys())
        imgIds = list(gts.keys())
        scores = []

        for i in imgIds:
            hypothesis = res[i][0]  # 假设描述
            references = gts[i]     # 参考描述列表

            # 对假设描述进行分词
            hypothesis_tokens = list(jieba.cut(hypothesis))
            # 计算当前假设描述与所有参考描述的METEOR评分，取最高分作为该图像的分数
            img_scores = [single_meteor_score(' '.join(list(jieba.cut(ref))), hypothesis_tokens) for ref in references]
            max_score = max(img_scores)
            scores.append(max_score)

        # 计算平均分数
        average_score = sum(scores) / len(scores) if scores else 0

        return average_score, scores

    def method(self):
        return "METEOR"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/627521

nltk 离线下载 wordnet 测试meteor评估指标，支持中文本生成数据集使用_nltk wordnet包

pycocoeval使用的java外部链接很麻烦，同时不支持中文！

示例 （处理中文，不要忘记分词tokenized，使用jieba库）

示例（处理中文，不要忘记分词tokenized，使用jieba库）