当前位置:   article > 正文

AI大模型如何测评代码生成能力 human-eval详解_humaneval

humaneval

https://github.com/open-compass/human-eval这个代码仓库测评代码生成能力。

1 open-compass的humaneval.py

humaneval.py是opencompass大模型的代码评测能力的评测代码
代码路径opencompass/opencompass/datasets/humaneval.py

class HumanEvaluator(BaseEvaluator):
    """Evaluator for HumanEval or EvalPlus."""
    def __init__(self,
                 k: List[int] = [1, 10, 100],
                 metric: str = 'HumanEval') -> None:
        self.metric = metric
        assert self.metric in ['HumanEval', 'EvalPlus']
        if self.metric == 'HumanEval':
            try:
                from human_eval.data import HUMAN_EVAL, write_jsonl
                from human_eval.evaluation import \
                    evaluate_functional_correctness
                self.write_jsonl = write_jsonl
                self.HUMAN_EVAL = HUMAN_EVAL
                self.eval = evaluate_functional_correctness
            except ImportError:
                raise ImportError(
                    'Please install human_eval use following steps:\n'
                    'git clone git@github.com:open-compass/human-eval.git\n'
                    'cd human-eval && pip install -e .')
        else:
            try:
                from evalplus.data import write_jsonl
                from evalplus.evaluate import evaluate
                self.write_jsonl = write_jsonl
                self.eval = evaluate
            except ImportError:
                raise ImportError(
                    'Please install evalplus use following steps:\n'
                    'git clone --recurse-submodules git@github.com:open-compass/human-eval.git\n'  # noqa
                    'cd human-eval\n'
                    'pip install -e .\n'
                    'pip install -e evalplus\n')
        self.k = k
        super().__init__()

    def score(self, predictions, references, test_set):
        prompts = [item['prompt'] for item in test_set]
        humaneval_preds = []
        if self.metric == 'HumanEval':
            # create json file in human_eval format
            for preds, refer in zip(predictions, references):
                # suits for two case
                # 1. use repeated dataset
                # 2. use `num_return_sequences` to generate multiple responses
                if not isinstance(preds, list):
                    preds = [preds]
                for pred in preds:
                    humaneval_preds.append({
                        'task_id': refer,
                        'completion': pred
                    })
            with tempfile.TemporaryDirectory() as tmp_dir:
                out_dir = osp.join(tmp_dir, 'human_eval.json')
                self.write_jsonl(out_dir, humaneval_preds)
                score = self.eval(out_dir,
                                  self.k,
                                  n_workers=4,
                                  timeout=3.0,
                                  problem_file=self.HUMAN_EVAL)
                return {f'humaneval_{k}': score[k] * 100 for k in score}
        else:
            for preds, refer, prompt in zip(predictions, references, prompts):
                if not isinstance(preds, list):
                    preds = [preds]
                for pred in preds:
                    humaneval_preds.append({
                        'task_id': refer,
                        'solution': prompt + pred
                    })
            with tempfile.TemporaryDirectory() as tmp_dir:
                out_dir = osp.join(tmp_dir, 'human_eval.jsonl')
                self.write_jsonl(out_dir, humaneval_preds)
                flags = dict(dataset='humaneval',
                             samples=out_dir,
                             base_only=None,
                             parallel=None,
                             i_just_wanna_run=None,
                             test_details=0.2,
                             min_time_limit=0.2,
                             gt_time_limit_factor=4.0,
                             mini=None)
                score = self.eval(flags)
                return {f'humaneval_plus_{k}': score[k] * 100 for k in score}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84

HumanEvaluator 类是一个用于评估编程任务解决方案的类,主要处理两种类型的评估数据集:HumanEvalEvalPlus。这个类从 BaseEvaluator 继承,并在初始化时进行了一些配置,同时包含方法来处理和评估预测结果。

1.1 类的初始化方法__init__

  • k: 默认为 [1, 10, 100],这个参数定义了评估时考虑的不同的 k 值,用于计算不同水平的精确度
  • metric: 指定评估的数据集,默认为 'HumanEval',可以选择 'EvalPlus' 作为另一种选项。
    在初始化方法中,根据 metric 的值动态加载相关模块和函数。如果是 'HumanEval',则尝试从 human_eval 模块中导入必要的组件;如果是 'EvalPlus',则尝试从 evalplus 模块中导入

1.2 方法score

这个方法接收三个参数:predictions, references, 和 test_set。这些参数分别包含了模型的预测结果、参考答案标识符和一个测试数据集。调用过程中关键变量的快照如下

predictions包含了模型生成的代码片段。
[’ “”" Check if in given list of numbers, are any
two numbers closer to each other than\n given threshold.\n >>>
has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>>
has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n
“”“\n for i in range(len(numbers)):\n for j in range(i+1,
len(numbers)):\n if abs(numbers[i] - numbers[j]) <
threshold:\n return True\n return False’, ’ “””

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/喵喵爱编程/article/detail/811018
推荐阅读
  

闽ICP备14008679号