赞
踩
本文不推荐打算认真做本领域的同学阅读,本文只是作为一个简单的概要对于大模型的评测做一个十分简要的介绍与当前的成果展示,预期在未来几天,笔者会进一步的撰写相关的一些具体的评测指标的介绍,本文主要基于arxiv上的论文:A Survey on Evaluation of Large Language Models,推荐打算进一步深耕本领域的同学进行进一步的阅读,如果您只是做大模型相关的工作,想要简单的了解以下目前在评测上的工作,又苦于英文水平不足,那本文还算是比较合适的。
为什么文明需要对于大模型进行评估,并对其进行如此多的研究?
对于大模型的评估,可以用三个单次进行简单的概括,即:What、Where、How。
目前的基础AI评估标准方法可以该数为以下几个:
但是,我们关注到,随着训练规模的越来越大,一些传统的评估方案可能无法对于深度学习模型尤其是大模型进行有效的评估,因此,我们发展出了对于静态的验证集进行评估作为深度学习评估的标准方案,如GLUE等等。
本部分的目的在于展示,我们的对于大模型的评估应该聚焦在哪些地方,在开头的综述中,作者对于这些工作进行了极其系统的陈述,本文在此仅仅只是对其进行了简单的列举并附加了少量的现有结果。
目前绝大多数的工作都集中在自然语言的处理上,总的来说,可以概括为理解、推理、生成和性能四个维度。
自然语言理解
推理(Reasoning.)
与NLI的区别:
NLI表达的是确定给定的“假设”是否在逻辑上遵循“前提”。
而推理却可以描述为以下四个模块:
自然语言生成
多语言任务
事实性
基本的测试水平:GPT-4、BingChat距离完全准确目前只有15%左右的差距。
目前对于事实一致性的评价方法缺乏统一的比较框架,相关分数与二元的标签参考价值有限。
关于事实评估,目前的一些有趣的工作结果包含:
鲁棒性
偏见问题
伦理问题
可信问题
本部分笔者只进行了简单的了解,有兴趣的推荐阅读原文,也可以根据自己目前的需要,阅读下文对应的一些Benchmark。
Benchmark | Focus | Domain | Evaluation Criteria |
---|---|---|---|
SOCKET | Social knowledge | Specific downstream task | Social language understanding |
MME | Multimodal LLMs | Multi-modal task | Ability of perception and cognition |
Xiezhi | Comprehensive domain knowledge | General language task | Overall performance across multiple benchmarks |
Choice-75 | Script learning | Specific downstream task | Overall performance of LLMs |
CUAD | Legal contract review | Specific downstream task | Legal contract understanding |
TRUSTGPT | Ethic | Specific downstream task | Toxicity bias and value-alignment |
MMLU | Text models | General language task | Multitask accuracy |
MATH | Mathematical problem | Specific downstream task | Mathematical ability |
APPS | Coding challenge competence | Specific downstream task | Code generation ability |
CELLO | Complex instructions | Specific downstream task | Four designated evaluation criteria |
C-Eval | Chinese evaluation | General language task | 52 Exams in a Chinese context |
EmotionBench | Empathy ability | Specific downstream task | Emotional changes |
OpenLLM | Chatbots | General language task | Leaderboard rankings |
DynaBench | Dynamic evaluation | General language task | NLI QA sentiment and hate speech |
Chatbot Arena | Chat assistants | General language task | Crowdsourcing and Elo rating system |
AlpacaEval | Automated evaluation | General language task | Metrics robustness and diversity |
CMMLU | Chinese multi-tasking | Specific downstream task | Multi-task language understanding capabilities |
HELM | Holistic evaluation | General language task | Multi-metric |
API-Bank | Tool utilization | Specific downstream task | API call retrieval and planning |
M3KE | Multi-task | Specific downstream task | Multi-task accuracy |
MMBench | Large vision-language models | Multi-modal task | Multifaceted capabilities of VLMs |
SEED-Bench | Multimodal Large Language Models | Multi-modal task | Generative understanding of MLLMs |
UHGEval | Hallucination of Chinese LLMs | Specific downstream task | Form metric and granularity |
ARB | Advanced reasoning ability | Specific downstream task | Multidomain advanced reasoning ability |
BIG-bench | Capabilities and limitations of LMs | General language task | Model performance and calibration |
MultiMedQA | Medical QA | Specific downstream task | Accuracy and human evaluation |
CV ALUES | Safety and responsibility | Specific downstream task | Alignment ability of LLMs |
LVLM-eHub | LVLMs | Multi-modal task | Multimodal capabilities of LVLMs |
ToolBench | Software tools | Specific downstream task | Execution success rate |
FRESHQA | Dynamic QA | Specific downstream task | Correctness and hallucination |
CMB | Chinese comprehensive medicine | Specific downstream task | Expert evaluation and automatic evaluation |
PandaLM | Instruction tuning | General language task | Winrate judged by PandaLM |
Dialogue CoT | In-depth dialogue | Specific downstream task | Helpfulness and acceptness of LLMs |
BOSS | OOD robustness in NLP | General language task | OOD robustness |
MM-Vet | Complicated multi-modal tasks | Multi-modal task | Integrated vision-language capabilities |
LAMM | Multi-modal point clouds | Multi-modal task | Task-specific metrics |
GLUE-X | OOD robustness for NLP tasks | General language task | OOD robustness |
KoLA | Knowledge-oriented evaluation | General language task | Self-contrast metrics |
AGIEval | Human-centered foundational models | General language task | General |
PromptBench | Adversarial prompt resilience | General language task | Adversarial robustness |
MT-Bench | Multi-turn conversation | General language task | Winrate judged by GPT-4 |
M3Exam | Multilingual | multimodal and multilevel | Specific downstream task Task-specific metrics |
GAOKAO-Bench | Chinese Gaokao examination | Specific downstream task | Accuracy and scoring rate |
SafetyBench | Safety | Specific downstream task | Safety abilities of LLMs |
LLMEval2 | LLM Evaluator | General language task | Acc macro-f1 and kappa correlation coefficient |
评估可以简单的划分为自动化评估和人类评估两类,他们的具体异同与需要完成的工作如下图所示。
评价指标 | 自动化评估 | 人类评估 |
---|---|---|
Accuracy | Exact match, Quasi-exact match, F1 score, ROUGE score | 主要检查事实一致性和准确性 |
Calibrations | Expected calibration error, Area under the curve | None |
Fairness | Demographic parity difference, Equalized odds difference | None |
Robustness | Attack success rate, Performance drop rate | None |
Relevance | None | 字面意思 |
Fluency | None | 字面意思 |
Transparency | None | 决策过程的透明程度,即为什么会产生这样的响应 |
Safety | None | 检查生成文本的潜在危害性 |
Human alignment | None | 检查人类价值观、偏好和期望的一致性程度 |
Number of evaluators | None | Adequate representation, Statistical significance |
Evaluator’s expertise level | None | Relevant domain expertise, Task familiarity, Methodological training |
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。