当前位置:   article > 正文

【论文阅读】《Baichuan 2: Open Large-scale Language Models》_百川大模型研究论文

百川大模型研究论文

Baichuan 2: Open Large-scale Language Models

百川2:开放⼤规模语⾔模型

Baichuan Inc.
百川股份有限公司

Abstract
摘要

       Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2,a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU,GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.
    大型语言模型 (LLM) 在自然语言任务上展现了出色的性能,减少了对大量特征工程的需求。但大多数高效的LLM主要针对英语或为闭源。本技术报告介绍了Baichuan 2,一个大规模多语言模型系列,包含70亿和130亿参数,基于2.6万亿tokens从零开始训练。Baichuan 2在公开基准测试如MMLU、CMMLU、GSM8K和HumanEval上达到或超过了其他同类开源模型的性能,并在医学和法律等垂直领域表现优异。我们发布所有预训练模型checkpoints,帮助研究社区更好地理解Baichuan 2的训练过程。

1 Introduction
1 简介

       The field of large language models has witnessed promising and remarkable progress in recent years.The size of language models has grown from millions of parameters, such as ELMo (Peters et al., 2018), GPT-1 (Radford et al., 2018), to billions or even trillions of parameters such as GPT-3 (Brown et al., 2020), PaLM (Chowdhery et al.,2022; Anil et al., 2023) and Switch Transformers (Fedus et al., 2022). This increase in scale has led to significant improvements in the capabilities of language models, enabling more human-like fluency and the ability to perform a diverse range of natural language tasks. With the introduction of ChatGPT (OpenAI, 2022) from OpenAI, the power of these models to generate human-like text has captured widespread public attention. ChatGPT demonstrates strong language proficiency across a variety of domains, from conversing casually to explaining complex concepts. This breakthrough highlights the potential for large language models to automate tasks involving natural language generation and comprehension.
    近年来,大型语言模型领域取得了令人瞩目的进步。从数百万参数的ELMo(Peters等人,2018)、GPT-1(Radford等人,2018)到数十亿甚至数万亿参数的GPT-3(Brown等人,2020)、PaLM(Chowdhery等人,2022;Anil等人,2023)和Switch Transformer(Fedus等人,2022),这些模型的规模不断扩大,使得它们的能力得到了显著提高,能够生成更加接近人类水平的流畅文本,并执行各种自然语言任务。随着OpenAI推出ChatGPT(2022年),这些模型生成类人类文本的能力引起了广泛关注。ChatGPT在各种领域的语言能力表现出色,无论是进行日常对话还是解释复杂概念。这一突破展示了大型语言模型在自动化涉及自然语言生成和理解的任务方面的潜力。

       While there have been exciting breakthroughs and applications of LLMs, most leading LLMs like GPT-4 (OpenAI, 2023), PaLM-2 (Anil et al., 2023),and Claude (Claude, 2023) remain closed-sourced. Developers and researchers have limited access to the full model parameters, making it difficult for the community to deeply study or fine-tune these systems. More openness and transparency around LLMs could accelerate research and responsible development within this rapidly advancing field. LLaMA (Touvron et al., 2023a), a series of large language models developed by Meta containing up to 65 billion parameters, has significantly benefited the LLM research community by being fully open-sourced. The open nature of LLaMA, along with other open-source LLMs such as OPT (Zhang et al., 2022), Bloom (Scao et al., 2022), MPT(MosaicML, 2023) and Falcon (Penedo et al.,2023), enables researchers to freely access the models for examination, experimentation, and further development. This transparency and access distinguishes LLaMA from other proprietary LLMs. By providing full access, the open-source LLMs have accelerated research and advances in the field, leading to new models like Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), and others (Wang et al., 2022; Zhu et al., 2023; Anand et al., 2023).
    尽管大型语言模型(LLM)已经取得了令人兴奋的突破和应用,但大多数领先的大型语言模型,如GPT-4(OpenAI,2023年)、PaLM-2(Anil等人,2023年)和Claude(Claude,2023年)等,仍然采用封闭源代码。开发者和研究人员对这些完整模型参数的访问有限,这使得社区难以深入研究或微调这些系统。在快速发展的领域中,对大型语言模型的更多开放性和透明度可以促进研究和负责任的开发。LLaMA(Touvron等人,2023年a)是由Meta开发的一系列大型语言模型,包含多达650亿个参数,已经完全开源,为大型语言模型研究社区带来了显著的好处。LLaMA的开放性质以及其他开源大型语言模型,如OPT(Zhang等人,2022年)、Bloom(Scao等人,2022年)、MPT(MosaicML,2023年)和Falcon(Penedo等人,2023年),使研究人员能够自由地访问这些模型进行审查、实验和进一步开发。这种透明度和访问权限使LLaMA与其他专有大型语言模型区分开来。通过提供全面访问,开源大型语言模型加速了该领域的研究和进步,导致了Alpaca(Taori等人,2023年)、Vicuna(Chiang等人,2023年)等新型模型的出现(Wang等人,2022年;Zhu等人,2023年;Anand等人,2023年)。

       However, most open-source large language models have focused primarily on English. For instance, the main data source for LLaMA is Common Crawl 1 ^1 1, which comprises 67% of LLaMA’s pre-training data but is filtered to English content only. Other open source LLMs such as MPT (MosaicML, 2023) and Falcon (Penedo et al.,2023) are also focused on English and have limited capabilities in other languages. This hinders the development and application of LLMs in specific languages, such as Chinese.
    然而,大多数开源大型语言模型主要关注英语。例如,LLaMA的主要数据来源是Common Crawl 1 ^1 1,占LLaMA预训练数据的67%,但只包括英语内容。其他开源大型语言模型,如MPT(MosaicML,2023年)和Falcon(Penedo等人,2023年)也专注于英语,对其他语言的能力有限。这阻碍了特定语言(如中文)的大型语言模型的开发与应用。


1 ^1 1 https://commoncrawl.org/


       In this technical report, we introduce Baichuan 2, a series of large-scale multilingual language models. Baichuan 2 has two separate models, Baichuan 2-7B with 7 billion parameters and Baichuan 2-13B with 13 billion parameters. Both models were trained on 2.6 trillion tokens, which to our knowledge is the largest to date, more than double that of Baichuan 1 (Baichuan, 2023b,a). With such a massive amount of training data, Baichuan 2 achieves significant improvements over Baichuan 1. On general benchmarks like MMLU(Hendrycks et al., 2021a), CMMLU (Li et al.,2023), and C-Eval (Huang et al., 2023), Baichuan 2-7B achieves nearly 30% higher performance compared to Baichuan 1-7B. Specifically, Baichuan 2 is optimized to improve performance on math and code problems. On the GSM8K (Cobbe et al., 2021) and HumanEval (Chen et al., 2021) evaluations, Baichuan 2 nearly doubles the results of the Baichuan 1. In addition, Baichuan 2 also demonstrates strong performance on medical and legal domain tasks. On benchmarks such as MedQA (Jin et al., 2021) and JEC-QA (Zhong et al., 2020), Baichuan 2 outperforms other open-source models, making it a suitable foundation model for domain-specific optimization.
    在这篇技术报告中,我们介绍了百川 2,一系列大规模的多语言语言模型。百川 2 有两个独立的模型,分别是具有 70 亿参数的百川 2-7B 和具有 130 亿参数的百川 2-13B。这两个模型都在 2.6 万亿个标记上训练,据我们所知,这是迄今为止最大的一次,超过了百川 1 的两倍还多(百川,2023b,a)。凭借如此庞大的训练数据,百川 2 在百川 1 上取得了显著的改进。在通用基准测试如 MMLU(Hendrycks 等人,2021a),CMMLU(Li 等人,2023)和 C-Eval(Huang 等人,2023)上,百川 2-7B 的性能比百川 1-7B 高出近 30%。具体来说,百川 2 针对数学和编程问题进行了优化。在 GSM8K(Cobbe 等人,2021)和人类评价(Chen 等人,2021)的评估中,百川 2 几乎使百川 1 的结果翻了一番。此外,百川 2 在医疗和法律领域的任务上也表现出强大的性能。在诸如 MedQA(金等人,2021)和 JEC-QA(钟等人,2020)等基准测试中,百川 2 优于其他开源模型,成为领域特定优化的合适基础模型。

       Additionally, we also released two chat models, Baichuan 2-7B-Chat and Baichuan 2-13B-Chat, optimized to follow human instructions. These models excel at dialogue and context understanding. We will elaborate on our approaches to improve the safety of Baichuan 2. By open-sourcing these models, we hope to enable the community to further improve the safety of large language models, facilitating more research on responsible LLMs development.
    此外,我们还发布了两款聊天模型,即 Baichuan 2-7B-Chat 和 Baichuan 2-13B-Chat,它们针对遵循人类指令进行优化。这些模型在对话和上下文理解方面表现优异。我们将详细介绍我们提高 Baichuan 2 安全性的方法。通过开源这些模型,我们希望让社区能够进一步提高大型语言模型的安全性,促进更多关于负责任 LLM 开发的研究。

       Furthermore, in spirit of research collaboration and continuous improvement, we are also releasing the checkpoints of Baichuan 2 at various stages of training from 200 billion tokens up to the full 2.6 trillion tokens. We found that even for the 7 billion parameter model, performance continued to improve after training on more than 2.6 trillion tokens. By sharing these intermediary results,we hope to provide the community with greater insight into the training dynamics of Baichuan 2. Understanding these dynamics is key to unraveling the inner working mechanism of large language models (Biderman et al., 2023a; Tirumala et al.,2022). We believe the release of these checkpoints will pave the way for further advances in this rapidly developing field.
    此外,本着研究合作与持续改进的精神,我们还将发布百川大模型在训练过程中的各个阶段的checkpoints ,从2000亿个token到完整的2.6万亿个token。我们发现,即使对于700亿参数的模型,性能在训练超过2.6万亿个token后仍然有所提升。通过分享这些中间成果,我们希望为社区提供更深入的了解,以洞察百川大模型的训练动态。理解这些动态是揭示大型语言模型内部工作机制的关键(如Biderman等人,2023年;Tirumala等人,2022年)。我们相信,发布这些checkpoints 将为这个快速发展的领域带来进一步的进步。

       In this technical report, we will also share some of the trials, errors, and lessons learned through training Baichuan 2. In the following sections, we will present detailed modifications made to the vanilla Transformer architecture and our training methodology. We will then describe our fine-tuning methods to align the foundation model with human preferences. Finally, we will benchmark the performance of our models against other LLMs on a set of standard tests. Throughout the report, we aim to provide transparency into our process, including unsuccessful experiments, to advance collective knowledge in developing LLMs. Baichuan 2’s foundation models and chat models are available for both research and commercial use at https://github.com/baichuan-inc/Baichuan2
    在本技术报告中,我们还将分享在训练百川大模型过程中所经历的一些尝试、错误和学习到的经验教训。在接下来的章节中,我们将展示对原始Transformer架构所做的详细修改以及我们的训练方法。然后,我们将描述我们对模型进行微调的方法,以使其更符合人类偏好。最后,我们将在一组标准测试上,将我们的模型性能与其他大型语言模型进行基准测试。在整个报告过程中,我们致力于提供关于我们过程的透明度,包括不成功的实验,以推动在开发大型语言模型方面的集体知识。百川大模型的基础模型和聊天模型可以在以下网址获取,供研究和商业使用:https://github.com/baichuan-inc/Baichuan2

2 Pre-training
2 预训练

       This section introduces the training procedure for the Baichuan 2 foundation models. Before diving into the model details, we first show the overall performance of the Baichuan 2 base models compared to other open or closed-sourced models in Table 1. We then describe our pre-training data and data processing methods. Next, we elaborate on the Baichuan 2 architecture and scaling results.Finally, we describe the distributed training system.
    本节将介绍百川 2 基础模型的训练流程。在详细介绍模型之前,我们首先在表 1 中展示了百川 2 基础模型与其他开源或闭源模型的整体性能对比。接下来,我们将介绍预训练数据和数据处理方法。然后,我们将详细介绍百川 2 的架构和扩展结果。最后,我们将描述分布式训练系统。
在这里插入图片描述

Table 1: Overall results of Baichuan 2 compared with other similarly sized LLMs on general benchmarks. * denotes results derived from official websites.
表 1:百川 2 号与其他类似大小的 LLM 在通用基准测试中的总体结果对比。*表示来源于官方网站的数据。

2.1 Pre-training Data
2.1 预训练数据

       Data sourcing: During data acquisition, our objective is to pursue comprehensive data scalability and representativeness. We gather data from diverse sources including general internet webpages, books, research papers, codebases,and more to build an extensive world knowledge system. The composition of the training corpus is shown in Figure 1.
    数据来源:在数据收集过程中,我们的目标是追求数据的全面可扩展性和代表性。我们从包括普通互联网网页、书籍、研究论文、代码库等多种来源收集数据,以建立一个全面的世界知识体系。训练语料的组成如图 1 所示。

在这里插入图片描述

Figure 1: The distribution of different categories of Baichuan 2 training data.
图 1:百川 2 训练数据的不同类别分布

       Data processing: For data processing, we focus on data frequency and quality. Data frequency relies on clustering and deduplication. We built a large-scale deduplication and clustering system supporting both LSH-like features and dense embedding features. This system can cluster and deduplicate trillion-scale data within hours.Based on the clustering, individual documents,paragraphs, and sentences are deduplicated and scored. Those scores are then used for data sampling in pre-training. The size of the training data at different stages of data processing is shown in Figure 2.
    数据处理:在数据处理方面,我们重点关注数据频率和质量。数据频率依赖于聚类和去重。我们构建了一个支持 LSH 类似特征和密集嵌入特征的大规模去重和聚类系统。这个系统可以在几小时内对千亿规模的数据进行聚类和去重。基于聚类,单独的文档、段落和句子会被去重并打分。这些分数随后会被用于预训练的数据采样。数据处理过程中不同阶段的训练数据大小如图 2 所示。

在这里插入图片描述

Figure 2: The data processing procedure of Baichuan 2’s pre-training data.
图 2:百川 2 预训练数据的处理流程

2.2 Architecture
2.2 体系结构

       The model architecture of Baichuan 2 is based on the prevailing Transformer (Vaswani et al.,2017).Nevertheless, we made several modifications which we detailed below.
    百川 2 的模型架构基于流行的 Transformer(Vaswani 等,2017)。不过,我们在下文中详细说明了几个修改。

2.3 Tokenizer
2.3 分词器

       A tokenizer needs to balance two critical factors: a high compression rate for efficient inference,and an appropriately sized vocabulary to ensureadequate training of each word embedding. We have taken both these aspects into account. We have expanded the vocabulary size from 64,000 in Baichuan 1 to 125,696, aiming to strike a balance between computational efficiency and model performance.
    词法分析器需要平衡两个关键因素:高效推理的高压缩率和确保每个词嵌入充分训练的适当大小的词汇表。我们已经考虑到了这两个方面。我们将词汇量从 Baichuan 1 的 64,000 扩大到了 125,696,旨在在计算效率和模型表现之间达到平衡。

       We use byte-pair encoding (BPE) (Shibata et al.,1999) from SentencePiece (Kudo and Richardson,2018) to tokenize the data. Specifically, we do not apply any normalization to the input text and we do not add a dummy prefix as in Baichuan 1. We split numbers into individual digits to better encode numeric data. To handle code data containing extra whitespaces, we add whitespace-only tokens to the tokenizer. The character coverage is set to 0.9999, with rare characters falling back to UTF-8 bytes.We set the maximum token length to 32 to account for long Chinese phrases. The training data for the Baichuan 2 tokenizer comes from the Baichuan 2 pre-training corpus, with more sampled code examples and academic papers to improve coverage(Taylor et al., 2022). Table 2 shows a detailed comparison of Baichuan 2’s tokenizer with others.
    我们使用来自 SentencePiece(Kudo 和 Richardson,2018)的 Byte-pair encoding(BPE)(Shibata 等,1999)来对数据进行分词。具体来说,我们不对输入文本进行任何归一化处理,也不像在 Baichuan 1 中添加一个虚拟前缀。我们将数字拆分成单个数字,以便更好地编码数值数据。为了处理包含额外空白的代码数据,我们在分词器中添加了仅包含空白的令牌。字符覆盖率设为 0.9999,罕见字符降级为 UTF-8 字节。我们将最大令牌长度设为 32,以适应较长的中文短语。Baichuan 2 分词器的训练数据来自 Baichuan 2 预训练语料库,其中包含更多的抽样代码示例和学术论文,以提高覆盖率(Taylor 等,2022)。表 2 展示了 Baichuan 2 分词器与其他分词器的详细比较。

在这里插入图片描述

Table 2: The vocab size and text compression rate of Baichuan 2’s tokenizer compared with other models.The lower the better.
表 2:百川 2 分词器与其他模型的词汇量和文本压缩率对比。越低越好
2.3.1 Positional Embeddings
2.3.1 位置嵌入

       Building on Baichuan 1, we adopt Rotary Positional Embedding (RoPE) (Su et al., 2021) for Baichuan 2-7B and ALiBi (Press et al.,2021) for Baichuan 2-13B. ALiBi is a more recent positional encoding technique that has shown improved extrapolation performance.However, most open-sourced models use RoPE for positional embeddings, and optimized attention implementations like Flash Attention (Dao et al.,2022; Dao, 2023) are currently better suited to RoPE since it is multiplication-based, bypassing the need for passing attention_mask to the attention operation. Nevertheless, in preliminary experiments, the choice of positional embedding did not significantly impact model performance.To enable further research on bias-based and multiplication-based attention, we apply RoPE on Baichuan 2-7B and ALiBi on Baichuan 2-13B,consistent with Baichuan 1.
    在百川 1 的基础上,我们分别为百川 2-7B 和百川 2-13B 采用旋转位置嵌入(RoPE)(Su 等人,2021 年)和 ALiBi(Press 等人,2021 年)。ALiBi 是一种较新的位置编码技术,已经展示出改进的外推性能。然而,大多数开源模型都使用 RoPE 作为位置嵌入,优化的注意力实现,如 Flash Attention(Dao 等人,2022 年;Dao,2023 年),目前更适合 RoPE,因为它是基于乘法的,无需将注意力掩码传递给注意力操作。然而,在初步实验中,位置嵌入的选择并未显著影响模型性能。为了进一步研究基于偏见的乘法注意力,我们在百川 2-7B 上应用 RoPE,在百川 2-13B 上应用 ALiBi,与百川 1 保持一致。

2.4 Activations and Normalizations
2.4 激活和归一化

       We use SwiGLU (Shazeer, 2020) activation function, a switch-activated variant of GLU (Dauphin et al., 2017) which shows improved results. However, SwiGLU has a “bilinear” layer and contains three parameter matrices, differing from the vanilla Transformer’s feed-forward layer that has two matrices, so we reduce the hidden size from 4 times the hidden size to 8 3 \frac{8}{3} 38hidden size and rounded to the multiply of 128.
    Baichuan 2使用了SwiGLU(Shazeer, 2020)作为其激活函数。SwiGLU是GLU(Dauphin et al., 2017)的一个变种,它通过一个开关激活机制展现了更好的结果。不同于传统Transformer的前馈层只有两个矩阵,SwiGLU包含三个参数矩阵。因此,模型将隐藏层的大小从原来的4倍减少到8/3倍,并且进行了适当的调整。

       For the attention layer of Baichuan 2, we adopt the memory efficient attention (Rabe and Staats, 2021) implemented by xFormers 2 ^2 2. By leveraging xFormers’ optimized attention with biasing capabilities, we can efficiently incorporate
ALiBi’s bias-based positional encoding while reducing memory overhead. This provides performance and efficiency benefits for Baichuan 2’s large-scale training.
    Baichuan 2采用了由xFormers 2 ^2 2实现的内存高效注意力(Rabe和Staats, 2021)。通过利用xFormers的优化注意力和偏置能力,模型能够有效地整合ALiBi的基于偏置的位置编码,同时减少了内存开销。这为Baichuan 2的大规模训练带来了性能和效率上的好处。


2 ^2 2 https://github.com/facebookresearch/xformers


       We apply Layer Normalization (Ba et al., 2016) to the input of the Transformer block which is more robust to the warm-up schedule (Xiong et al., 2020). In addition, we use the RMSNorm implementation introduced by (Zhang and Sennrich,2019), which only calculates the variance of input features to improve efficiency.
    我们将层归一化(Ba 等人,2016 年)应用于 Transformer 模块的输入,这种方法对于warmup(Xiong 等人,2020 年)方面更具有鲁棒性。此外,我们使用了由(Zhang 和 Sennrich,2019 年)引入的 RMSNorm 实现,它仅计算输入特征的方差以提高效率。

2.5 Optimizations
2.5 优化

       We use AdamW (Loshchilov and Hutter, 2017) optimizer for training. β 1 β_1 β1 and β 2 β_2 β2 are set to 0.9 and 0.95, respectively. We use weight decay with 0.1 and clip the grad norm to 0.5. The models are warmed up with 2,000 linear scaling steps reaching to the max learning rate and then applying the cosine decay to the minimum learning rate. The parameter details and learning rate are shown in Table 3.
    我们在训练过程中使用了 AdamW(Loshchilov 和 Hutter,2017)优化器。 β 1 β_1 β1 β 2 β_2 β2 分别设置为 0.9 和 0.95。我们使用了 0.1 的权重衰减,并将梯度范数剪裁至 0.5。模型通过 2000 个线性缩放步骤进行预热,达到最大学习率,然后对最小学习率应用余弦衰减。参数详情和学习率见表 3。

在这里插入图片描述

Table 3: Model details of Baichuan 2.
表 3:百川 2 模型详情

       The whole models are trained using BFloat16 mixed precision. Compared to Float16, BFloat16 has a better dynamic range, making it more robust to large values that are critical in training large language models. However, BFloat16’s low precision causes issues in some settings.For instance, in some public RoPE and ALibi implementations, the torch.arange operation fails due to collisions when the integer exceeds 256, preventing differentiation of nearby positions.Therefore, we use full precision for some value-sensitive operations such as positional embeddings.
    整个模型都是使用 BFloat16 混合精度进行训练的。与 Float16 相比,BFloat16 具有更好的动态范围,使得它在处理训练大型语言模型时更加稳健。然而,BFloat16 的低精度在某些情况下可能会导致问题。例如,在某些公开的 RoPE 和 ALibi 实现中,由于整数超过 256 时发生冲突,导致 torch.arange 操作失败,从而无法对附近位置进行微分。因此,对于一些对数值敏感的操作,如位置嵌入,我们使用单精度。

       NormHead: To stabilize training and improve the model performance, we normalize the output embeddings (which are also referred as ‘head’).There are two advantages of NormHead in our experiment. First, in our preliminary experiments we found that the norm of the head are prone to be unstable. The norm of the rare token’s embedding becomes smaller during training which disturb the training dynamics. NormHead can stabilize the dynamics significantly. Second, we found that the semantic information is mainly encoded by the cosine similarity of Embedding rather than L2 distance. Since the current linear classifier computes logits by dot product, which is a mixture of L2 distance and cosine similarity.NormHead alleviates the distraction of L2 distance in computing logits. For more details, please refer appendix C.
    NormHead:为了稳定训练并提高模型性能,我们对输出嵌入(也称为“head”)进行归一化处理。在我们的实验中,NormHead 有两个优点。首先,在初步实验中,我们发现头的范数容易不稳定。在训练过程中,稀有令牌的嵌入范数会变小,这会干扰训练动态。NormHead 能够显著稳定动态。其次,我们发现语义信息主要由嵌入的余弦相似度编码,而不是 L2 距离。因为当前的线性分类器通过点积计算对数值,这既包括 L2 距离又包括余弦相似度。NormHead 减轻了计算对数值时 L2 距离的干扰。更多详情,请参阅附录 C。

       Max-z loss: During training, we found that the logits of LLMs could become very large. While the softmax function is agnostic to the absolute logit values, as it depends only on their relative values. Large logits caused issues during inference because common implementations of repetition penalty (such as the Hugging Face implementation 3 ^3 3 in model.generate) apply a scalar (e.g. 1.1 or 1.2) directly to the logits. Contracting very large logits in this way can significantly alter the probabilities after softmax, making the model sensitive to the choice of repetition penalty hyper-parameter. Inspired by NormSoftmax (Jiang et al., 2023b) and the auxiliary z-loss from PaLM(Chowdhery et al., 2022), we added a max-z loss to normalize the logits:
L m a x − z = 2 e − 4 ∗ z 2 (1) \mathcal{L}_{max-z} = 2e^{−4}∗ z^2\tag{1} Lmaxz=2e4z2(1)
where z z z is the maximum logit value. This helped stabilize training and made the inference more robust to hyper-parameters.
    Max-z loss:在训练过程中,我们发现 LLM 的 logits 可能会变得非常大。尽管 softmax 函数对 logit 的绝对值是无所谓的,因为它只依赖于它们的相对值,但大的 logits 在推断过程中会引起问题,因为常见的重复惩罚(如 Hugging Face implementation 3 ^3 3 在 model.generate 中)直接将一个标量(例如 1.1 或 1.2)应用到 logits 上。以这种方式收缩非常大的 logits 可能会显著改变 softmax 之后的概率,使模型对重复惩罚超参数的选择非常敏感。受到 NormSoftmax (Jiang et al., 2023b) 和 PaLM(Chowdhery et al., 2022) 的辅助 z 损失的启发,我们添加了一个最大 z 损失来规范化 logits:
L m a x − z = 2 e − 4 ∗ z 2 (1) \mathcal{L}_{max-z} = 2e^{−4}∗ z^2\tag{1} Lmaxz=2e4z2(1)
其中 z z z 是最大 logit 值。这有助于稳定训练,使推断对超参数更加稳健。


3 ^3 3 https://huggingface.co/transformers/v4.1.1/_modules/transformers/generation_logits_process.html


在这里插入图片描述

Figure 3: The pre-training loss of Baichuan 2.
图 3:百川 2 的预训练损失

       The final training loss of Baichuan 2-7B and Baichuan 2-13B are shown in Figure 3.
    最后的训练损失如百川 2-7B 和百川 2-13B 在图 3 中显示。

2.6 Scaling Laws
2.6 缩放定律

       Neural scaling laws, where the error decreases as a power function of training set size, model size,or both, have enabled an assuring performance when training became more and more expensive in deep learning and large language models. Before training the large language models of billions of parameters, we first train some small-sized models and fit a scaling law forn training larger models.
    神经缩放定律,即错误随训练集大小、模型大小或两者的幂函数而减小,使得在深度学习和大型语言模型训练日益昂贵的情况下,能够保证性能。在训练亿级参数的大型语言模型之前,我们首先训练一些小型模型,并为训练大型模型适配一个缩放定律。

       We launched a range of model sizes going from 10M to 3B, ranging from 1 1000 \frac{1}{1000} 10001 to 1 10 \frac{1}{10} 101 the size of the final model, and each of the model is trained for up to 1 trillion tokens, using consistent hyper-parameters and the same data set sourced from Baichuan 2. Based on the final loss of different models, we can obtain a mapping from the training flops to the target loss.
    我们推出了一系列模型尺寸,从 10M 到 3B,范围涵盖了 1 1000 \frac{1}{1000} 10001 1 10 \frac{1}{10} 101的大小,每个模型都经过高达 1 万亿个标记的训练,使用一致的超参数和来自百川 2 的数据集。根据不同模型的最终损失,我们可以得到从训flops到目标损失的映射。

在这里插入图片描述

Figure 4: The scaling law of Baichuan 2. We trained various models ranging from 10 million to 3 billion parameters with 1 trillion tokens. By fitting a power law term to the losses given training flops, we predicted losses for training Baichuan 2-7B and Baichuan 2-13B on 2.6 trillion tokens. This fitting process precisely predicted the final models’ losses (marked with two stars).
图 4:百川 2 的缩放定律。我们使用 1 万亿个标记训练了各种模型,参数范围从 1000 万到 30 亿。通过将幂律项匹配到训练的 flops 损失,我们预测了训练百川 2-7B 和百川 2-13B 在 2.6 万亿个标记上的损失。这个拟合过程准确地预测了最终模型的损失(用两个星号标记)

       To fit the scaling law of the model, we employed the formula given by Henighan et al. (2020):
L C = a × C b + L ∞ (2) \mathcal{L}_C = a × C^b + \mathcal{L}_∞ \tag{2} LC=a×Cb+L(2)
where L ∞ \mathcal{L}_∞ L is the irreducible loss and the first term is the reducible loss which is formulated as a power-law scaling term. C C C are training flops and the L C \mathcal{L}_C LC are final loss of the model in that flops. We used the curve_fit function from the SciPy 4 ^4 4 library to fit the parameters. The final fitted scaling curve and the predicted 7 billion and 13 billion parameters model’s final loss are shown in Figure 4. We can see that the fitted scaling law predicted Baichuan 2’s final loss with high accuracy.
    为了拟合模型的缩放定律,我们采用 Henighan et al. (2020) 给出的公式:
L C = a × C b + L ∞ (2) \mathcal{L}_C = a × C^b + \mathcal{L}_∞ \tag{2} LC=a×Cb+L(2)
其中 L ∞ \mathcal{L}_∞ L是不可约损失,而第一项则是可约损失,它被表述为为幂律缩放项。 C C C是训练中的浮点运算次数,而 L C \mathcal{L}_C LC 则是在那些浮点运算中的模型最终损失。我们使用了 SciPy 4 ^4 4 库中的 curve_fit 函数来拟合参数。最终拟合的标度曲线以及预测的 70 亿和 130 亿参数模型的最终损失如图 4 所示。我们可以看到,拟合的标度定律准确地预测了百川 2 的最终损失。


4 ^4 4 https://scipy.org/


2.7 Infrastructure
2.7 基础设施

        Efficiently leveraging existing GPU resources plays a critically important role in training and developing large language models today. To accomplish this, we develop a co-design approach for an elastic training framework and a smart cluster scheduling policy.
    高效利用现有的 GPU 资源对于当今训练和发展大型语言模型具有至关重要的作用。为了实现这一目标,我们开发了一种弹性训练框架的协同设计方法和智能集群调度策略。

        Since our GPUs are shared among multiple users and tasks, the specific behavior of each task is unpredictable, often leading to idle GPU nodes within the cluster. Considering that a single machine equipped with eight A800 GPUs could adequately meet the memory requirements for our Baichuan 7B and Baichuan 13B models, the primary design criterion for our training framework is the machine-level elasticity, which supports that resources for tasks can be dynamically modified according to the cluster status and thereby serves as the foundation for our smart scheduling algorithm.
    由于我们的 GPUs 被多个用户和任务共享,因此每个任务的具体行为都是不可预测的,这通常会导致集群内的空闲 GPU 节点。考虑到一台配备 8 个 A800 GPU 的机器足以满足我们的 Baichuan 7B 和 Baichuan 13B 模型的内存需求,我们的训练框架的主要设计标准是机器级别的弹性,即支持根据集群状态动态调整任务资源,从而为我们的智能调度算法提供基础。

        To meet the requirement of the machine-level elasticity, our training framework integrates tensor parallelism (Narayanan et al., 2021) and ZeRO-powered data parallelism (Rajbhandari et al., 2020), where we set tensor parallelism inside each machine and employ ZeRO shared data parallelism for elastic scaling across machines.
    为了满足机器级别的弹性需求,我们的训练框架整合了张量并行(Narayanan 等人,2021 年)和 ZeRO 驱动的数据并行(Rajbhandari 等人,2020 年)。在这里,我们在每台机器内部设置张量并行,并使用 ZeRO 共享数据并行实现跨机器的弹性缩放。

       In addition, we employ a tensor-splitting technique (Nie et al., 2022) where we split certain calculations to reduce peak memory consumption,such as the cross-entropy calculations with large vocabularies. This approach enables us to meet
memory needs without extra computing and communication, making the system more efficient.
    此外,我们采用了一种张量拆分技术(Nie 等,2022 年),通过拆分某些计算来降低峰值内存消耗,例如大词汇量下的交叉熵计算。这种方法使我们在不需要额外的计算和通信的情况下满足内存需求,从而使系统更高效。

       To further accelerate training without compromising model accuracy, we implement mixed-precision training, where we perform forward and backward computations in BFloat16,while performing optimizer updating in Float32.
    为了在不牺牲模型精度的情况下进一步提高训练速度,我们采用了混合精度训练方法。在这种方法中,我们使用 BFloat16 进行前向和后向计算,同时使用 Float32 进行优化器更新。

       Furthermore, in order to efficiently scale our training cluster to thousands of GPUs, we integrate the following techniques to avoid the degradation of communication efficiency:
    此外,为了高效地将我们的训练集群扩展到数千个 GPU,我们采用了以下技术以避免通信效率降低:

  • Topology-aware distributed training. In large-scale clusters, network connections frequently span multiple layers of switches. We strategically arrange the ranks for distributed training to minimize frequent access across different switches, which reduces latency and thereby enhances overall training efficiency.
        拓扑感知分布式训练。在大规模集群中,网络连接经常跨越多个交换机层。我们策略性地安排分布式训练的等级,以最小化在不同交换机之间的频繁访问,从而降低延迟,提高整体训练效率。

  • Hybrid and hierarchical partition for ZeRO. By partitioning parameters across GPUs, ZeRO3 reduces memory consumption at the expense of additional all-gather communications. This approach would lead to a significant communication bottleneck when scaling to thousands of GPUs (Jiang et al., 2023a). To address this issue, we propose a hybrid and hierarchical partitioning scheme. Specifically, our framework first partitions the optimizer states across all GPUs,and then adaptively decides which layers need to activate ZeRO3,and whether partitioning parameters hierarchically.
        ZeRO的混合和分层划分。通过在 GPU 之间划分参数,ZeRO3 在消耗更多全聚集通信的情况下减少了内存消耗。当扩展到数千个 GPU 时,这种方法将导致通信瓶颈(Jiang 等,2023a)。为解决这个问题,我们提出了一种混合和分层的划分方案。具体来说,我们的框架首先在所有 GPU 上划分优化器状态,然后自适应地决定哪些层需要激活 ZeRO3,以及是否需要分层划分参数。

       By integrating these strategies, our system is capable of training Baichuan 2-7B and Baichuan 2-13B models efficiently on 1,024 NVIDIA A800 GPUs, achieving a computational efficiency that exceeds 180 TFLOPS.
    通过整合这些策略,我们的系统能够有效地在 1024 个 NVIDIA A800 GPU 上训练百川 2-7B 和百川 2-13B 模型,实现超过 180 TFLOPS 的计算效率。

3 Alignment
3 (人类偏好)对齐

       Baichuan 2 also introduces the alignment procedure resulting in two chat models: Baichuan 2-7B-Chat and Baichuan 2-13B-Chat. The alignment process of the Baichuan 2 encompasses two main components: Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).
    百川 2 还引入了对齐的过程,产生了两个聊天模型:Baichuan 2-7B-Chat和Baichuan 2-13B-Chat。百川 2 的对齐过程包括两个主要组成部分:监督微调(SFT)和基于人类反馈的强化学习(RLHF)。

3.1 Supervised Fine-Tuning
3.1 有监督微调

       During the supervised fine-tuning phase, we use human labelers to annotate prompts gathered from various data sources. Each prompt is labeled as being helpful or harmless based on key principles similar to Claude (2023). To validate data quality,we use cross-validation—an authoritative annotator checks the quality of a sample batch annotated by a specific crowd worker group, rejecting any batches that do not meet our quality standards.
    在监督微调阶段,我们使用人工标注员对从不同数据源收集到的提示进行标注。每个提示根据与 Claude (2023) 类似的关键原则被标记为有益或无害。为了验证数据质量,我们使用交叉验证——一位权威的标注员检查由特定众包工人团队标注的一个样本批次的质量,拒绝任何不符合我们质量标准的批次。

       We collected over 100k supervised fine-tuning samples and trained our base model on them. Next,we delineated the reinforcement learning process via the RLHF method to further improve results.The whole process of RLHF, including RM and RL training, is shown in Figure 5.
    我们收集了超过 100k 的监督微调样本,并在它们上训练我们的基础模型。接下来,我们通过RLHF 方法描述了强化学习过程,以进一步改善结果。RLHF 的整个过程,包括 RM和 RL 训练,如图 5所示。

在这里插入图片描述

Figure 5: An illustration of Baichuan 2’s RLHF process.
图 5:百川 2 的RLHF 过程示意图

3.2 Reward Model
3.2 奖励模型

       We devised a three-tiered classification system for all prompts, consisting of 6 primary categories, 30 secondary categories, and over 200 tertiary categories. From the user’s perspective, we aim for the classification system to comprehensively cover all types of user needs. From the standpoint of reward model training, prompts within each category should have sufficient diversity to ensure the reward model can generalize well.
    我们为所有提示制定了一个三层的分类体系,包括 6 个主要类别,30 个次要类别,以及超过 200 个三级类别。从用户的角度来看,我们希望分类体系能够全面覆盖所有类型的用户需求。从奖励模型训练的角度来看,每个类别内的提示应具有足够的多样性,以确保奖励模型具有很好的泛化能力。

       Given a prompt, responses are generated by Baichuan 2 models of different sizes and stages (SFT, PPO) to enhance response diversity. Only responses generated by the Baichuan 2 model family are used in the RM training. Responses from other open-source datasets and proprietary models do not improve the reward model’s accuracy. This also underscores the intrinsic consistency of the Baichuan model series from another perspective.
    根据提示,不同大小和阶段的百川 2 模型(SFT,PPO)生成回复以增加回复的多样性。仅使用百川 2 模型家族生成的回复进行 RM 训练。来自其他开源数据集和专有模型的回复并未提高奖励模型的准确性。这也从另一个角度突显了百川模型系列的内在一致性。

       The loss function used for training the reward model is consistent with that in InstructGPT (Ouyang et al.,2022). The reward model derived from training exhibits a performance consistent with that of LLaMA 2 (Touvron et al.,2023b), indicating that the greater the score difference between two responses, the higher the discriminative accuracy of the reward model, as shown in Table 4.
    用于训练奖励模型的损失函数与 InstructGPT (Ouyang et al.,2022) 中的损失函数相同。通过训练得到的奖励模型表现出与 LLaMA 2 (Touvron et al.,2023b) 一致的性能,这表明两个响应之间的得分差异越大,奖励模型的判别准确性越高,如表 4 所示。

在这里插入图片描述

Table 4: Reward Model test accuracy on different score gaps of two responses. The larger the response gap,the better RM accuracy. The gap 1,2,3,4,5 correspond to unsure, negligibly better, slightly better, better, and significantly better,respectively.
表 4:不同响应得分差值下奖励模型的测试准确度。响应差值越大,RM 准确度越高。差值 1、2、3、4、5 分别对应不确定、微小优势、略好、较好和显著更好

3.3 PPO
3.3 PPO

       After obtaining the reward model, we employ the PPO (Schulman et al., 2017) algorithm to train our language model. We employ four models: the actor model (responsible for generating responses),the reference model (used to compute the KL penalty with fixed parameters), the reward model(providing an overarching reward for the entire response with fixed parameters), and the critic model (designed to learn per-token values).
    在获得奖励模型后,我们使用 PPO(Schulman 等人,2017)算法来训练我们的语言模型。我们采用了四种模型:actor模型(负责生成回应)、reference模型(用于计算具有固定参数的KL惩罚)、奖励模型(为整个响应提供总体奖励)和critic模型(设计为学习每个令牌的值)。

3.4 Training Details
3.4 训练细节

       During the RLHF training process, the critic model is warmed up with an initial 20 training steps ahead.Subsequently, both the critic and actor models are updated via the standard PPO algorithm. For all models, we use gradient clipping of 0.5, a constant learning rate of 5e-6, and a PPO clip threshold ϵ = 0.1 ϵ = 0.1 ϵ=0.1. We set the KL penalty coefficient β = 0.2 β =0.2 β=0.2, decaying to 0.005 over steps. We train for 350 iterations for all our chat models, resulting in Baichuan 2-7B-Chat and Baichuan 2-13B-Chat.
    在RLHF训练过程中,critic模型首先提前进行了20个训练步骤的预热。随后,通过标准的PPO算法更新critic和actor模型。对于所有模型,我们使用0.5的梯度裁剪,5e-6的恒定学习率和PPO裁剪阈值ϵ= 0.1。我们将 KL 惩罚系数β设为 0.2,并且在步骤中衰减至 0.005。我们为所有的聊天模型设置了350次迭代,从而得到了Baichuan 2-7B-Chat 和 Baichuan 2-13B-Chat。

4 Safety
4 安全

       We believe that model safety improvements stem not only from constraints during data cleansing or alignment stages but also from harnessing positive knowledge and identifying negative knowledge during all training stages. Guided by this concept, we have enhanced model safety throughout the Baichuan 2 training process.
    我们认为,模型安全性的提升不仅仅源于在数据清洗或对齐阶段所施加的限制,而且还源于在所有训练阶段中利用积极知识并确定消极知识的过程。遵循这一理念,我们在整个百川 2 训练过程中都加强了模型的安全性。

4.1 Pre-training Stage
4.1 预训练阶段

       In the pre-training stage, we pay close attention to data safety. The entire pre-training dataset underwent a rigorous data filtering process aimed at enhancing safety. We devised a system of rules and models to eliminate harmful content such as violence, pornography, racial discrimination, hate speech, and more.
   在预训练阶段,我们高度重视数据安全。整个预训练数据集都经历了严格的数据过滤过程,以提高安全性。我们制定了一套规则和模型,以消除诸如暴力、色情、种族歧视、仇恨言论等有害内容。

       Furthermore, we curated a Chinese-English bilingual dataset comprising several million webpages from hundreds of reputable websites that represent various positive value domains, encompassing areas such as policy, law, vulnerable groups, general values, traditional virtues, and more. We also heightened the sampling probability for this dataset.
    首先,我们精选了一个包含数百个权威网站的中文 - 英文双语数据集,这些网站代表了各种积极的价值领域,涵盖了政策、法律、弱势群体、普世价值观、传统美德等多个方面,总计包含数百万个网页。此外,我们还提高了这个数据集的抽样概率。

4.2 Alignment Stage
4.2 对齐阶段

       We build a red-teaming procedure consisting of 6 types of attacks and 100+ granular safety value categories, an expert annotation team of 10 with traditional internet security experience initialized safe alignment prompts. The relevant snippets from the pre-training dataset were retrieved to create responses, resulting in approximately 1K annotated data for initialization.
    我们构建了一个包括 6 种攻击类型和 100多个细粒度安全值类别组成的red-teaming 程序,并由 10 位具有传统互联网安全经验的专家标注团队初始化安全对齐提示。从预训练数据集中获取相关片段以创建响应,最终获得约 1K 个带注释数据进行初始化。

  • The expert annotation team guided a 50-person outsourced annotation team through red-blue confrontation with the initialized alignment model, resulting in the generation of 200K attack prompts.
      专家注释团队引导了一个 50 人的外包注释团队通过红蓝对立的方式与初始化的对齐模型进行交互,最终产生了 200K 个攻击提示。

  • By employing a specialized multi-value supervised sampling method, we maximized the utilization of attack data to generate responses at varying safety levels.
      通过采用一种专业的多值监督抽样方法,我们最大化地利用了攻击数据来生成在不同安全级别下的响应。

       During the RL optimization stage, we also take safety into the first account:
    在 RL 优化阶段,我们也将安全作为首要考虑因素:

  • At the onset of safety reinforcement, DPO(Rafailov et al., 2023) methods efficiently employed limited amounts of annotated data to enhance performance concerning specific vulnerability issues.
      在安全增强的初期,DPO(Rafailov 等人,2023 年)方法高效地利用了有限量的注释数据,以提高针对特定漏洞问题的性能。请保留原文的思想内涵和语义逻辑,使用地道流畅的语言表达。避免生硬翻译。
  • Byemploying a Reward Model that integrates Helpful and Harmless objectives, PPO safety reinforcement training was conducted.
      通过采用一种整合有益和无害目标的奖励模型,进行了 PPO 安全加强培训。

5 Evaluations
5 评估

       In this section, we report the zero-shot or few-shot results of the pre-trained base models on standard benchmarks. We evaluate Baichuan 2 on free-form generation tasks and multiple-choice tasks.
    在本节中,我们报告了预训练基础模型在标准基准测试上的零样本或少样本结果。我们对 Baichuan 2 在自由形式生成任务和多项选择任务上进行评估。

  • Free-form generation: Models are given some sample inputs (shots) and then generate continuations to obtain results, like for question answering, translation, and other tasks.
      自由形式生成:模型被给予一些示例输入(样本)然后生成连续的结果,比如用于问答、翻译和其他任务。

  • Multiple-choice: Models are given a question and multiple choices, and the task is to select the most appropriate candidates.
      多项选择:模型会得到一个问题和多个选项,任务是选择最合适的候选答案。

       Given the variety of tasks and examples, we incorporated open-source evaluation frameworks like lm-evaluation-harness (Gao et al., 2021) and OpenCompass (OpenCompass, 2023) into our in-house implementations for fair benchmarking against other models.
    鉴于任务和示例的多样性,我们在我们的内部实现中引入了开源评估框架,如 lm-evaluation-harness(Gao 等⼈, 2021) 和 OpenCompass(OpenCompass, 2023),以便与其他模型进行公平的基准测试。

       The models we choose to compare have similar sizes to Baichuan 2 and are open-sourced that the results can reproduced:
    我们选择了与Baichuan 2相似大小的模型进行比较,这些模型都是开源的,结果可以复现:

  • LLaMA (Touvron et al., 2023b): The language models trained by Meta on 1 trillion tokens. The context length is 2,048 and we evaluate both LLaMA 7B and LLaMA 13B.
      LLaMA(Touvron 等⼈,2023b):Meta 在 1 万亿 token 上训练的语言模型。上下文长度为 2,048,我们评估了LLaMA 7B和LLaMA 13B。
  • LLaMA 2 (Touvron et al., 2023c): A successor model to LLaMA 1 trained on 2 trillion tokens and better data mixture.
      LLaMA 2 (Touvron et al., 2023c):LLaMA1 的后继模型,训练了 2 万亿 token,并具有更好的数据混合。
  • Baichuan 1 (Baichuan, 2023b): The Baichuan 7B is trained on 1.2 trillion tokens and Baichuan13B is trained on 1.4 trillion tokens. Both of them focus on English and Chinese.
      Baichuan 1(Baichuan, 2023b):Baichuan 7B 训练了 1.2 万亿标记,Baichuan 13B 训练了 1.4 万亿 token。它们都专注于英语和中文。
  • ChatGLM 2-6B (Zeng et al., 2022): A chat language model that has strong performance on several benchmarks 5 ^5 5.
      ChatGLM 2-6B(Zeng et al., 2022):在几个基准测试上表现出色的聊天语言模型。
  • MPT-7B (MosaicML, 2023): An open-source LLMs trained 1 trillion tokens of English text and code.
      MPT-7B:一个开源LLMs,训练了1万亿 token 的英文文本和代码。
  • Falcon-7B (Penedo et al., 2023): A series of LLMs trained on 1 trillion tokens enhanced with curated corpora. It is made available under the Apache 2.0 license.
      Falcon-7B(Penedo et al., 2023):一系列经过精心策划的、训练了1万亿 token 的LLMs。它在Apache 2.0许可下提供。
  • Vicuna-13B (Chiang et al., 2023): A language model trained by fine-tuning LLaMA-13B on the conversational dataset generated by ChatGPT.
      Vicuna-13B (Chiang et al., 2023):一种通过对 ChatGPT 生成的对话数据集进行 LLaMA-13B 微调而训练的语言模型。
  • Chinese-Alpaca-Plus-13B (Cui et al., 2023): A language model trained by fine-tuning LLaMA-13B on the conversational dataset generated by ChatGPT.
      Chinese-Alpaca-Plus-13B(Cui et al., 2023):一种通过对 ChatGPT 生成的对话数据集进行 LLaMA-13B 微调而训练的语言模型。
  • XVERSE-13B: A 13B multilingual large language model trained on more than 1.4 trillion tokens.
      XVERSE-13B:一个 13B 多语言大型语言模型,训练了超过 1.4 万亿 token。

5 ^5 5They do not release their base models so we adopt theresult they report in their website.
  5 ^5 5他们没有发布他们的基础模型,因此我们采用了他们在网站上报告的结果。


5.1 Overall Performance
5.1 整体性能

       This section introduces the overall performance of Baichuan 2 base models compared with other similar-sized models. We choose 8 benchmarks for comparison: MMLU (Hendrycks et al., 2021a) The Massive Multitask Language Understanding consists of a range of multiple-choice questions on academic subjects. C-Eval (Huang et al., 2023) is a comprehensive Chinese evaluation benchmark consists of more than 10k multi-choice questions. CMMLU (Li et al., 2023) is also a general evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of the Chinese language and culture. AGIEval (Zhong et al., 2023) is a human-centric benchmark specifically designed to evaluate general abilities like human cognition and problem-solving. Gaokao (Zhang et al., 2023) is an evaluation framework that utilizes Chinese high school entrance examination questions. BBH(Suzgun et al., 2022) is a suite of challenging BIG-Bench (Srivastava et al., 2022) tasks that the language model evaluations did not outperform the average human-rater. GSM8K (Cobbe et al., 2021) is an evaluation benchmarks that focused on math.HumanEval (Chen et al., 2021) is a docstring-to-code dataset consisting of 164 coding problems that test various aspects of programming logic.
    本节将介绍百川 2 基础模型与其他类似规模模型的整体性能对比。我们选择了 8 个基准进行比较:MMLU(Hendrycks 等人,2021a)大规模多任务语言理解包括一系列学术主题的多项选择题。C-Eval(Huang 等人,2023)是一个综合的中文评估基准,包括超过 10k 的多项选择题。CMMLU(Li 等人,2023)也是一个通用评估基准,专门设计用于评估 LLM 在中国语言和文化背景下的知识和推理能力。AGIEval(Zhong 等人,2023)是一个以人为中心的基准,专门设计用于评估人类认知和解决问题等一般能力。 Gaokao(Zhang 等人,2023)是一个利用高考题目的评估框架。BBH(Suzgun 等人,2022)是一系列棘手的 BIG-Bench(Srivastava 等人,2022)任务,这些任务的语言模型评估未能超过普通人的评分。GSM8K(Cobbe 等人,2021)是一个专注于数学的评估基准。HumanEval(Chen 等人,2021)是一个 docstring-to-code 数据集,包括 164 个编程问题,测试编程逻辑的各个方面。

       For CMMLU and MMLU, we adopt the official implementations and adopt 5-shot for evaluation. For BBH we adopt 3-shot evaluations. For C-Eval,Gaokao, and AGIEval we only select the multiple-choice with four candidates for better evaluations.For GSM8K, we adopt 4-shot testing derived from OpenCompass (OpenCompass, 2023). We also incorporate the result of GPT-4 6 ^6 6 and GPT-3.5-Turbo 7 ^7 7. Unless stated otherwise, the results in this paper were obtained using our internal evaluation tools.
    对于 CMMLU 和 MMLU,我们采用官方实现并采用 5 次射击进行评估。对于 BBH,我们采用 3 次射击评估。对于 C-Eval、高考和 AGIEval,我们只选择 4 个候选人的多项选择题以获得更好的评估结果。对于 GSM8K,我们采用来自 OpenCompass(OpenCompass,2023)的 4 次射击测试。我们还结合了 GPT-4 6 ^6 6 和 GPT-3.5-Turbo 7 ^7 7 的结果。除非另有说明,否则本文中的结果都是使用我们内部的评估工具获得的。


6 ^6 6gpt-4-0613
7 ^7 7gpt-3.5-turbo-0613


       The overall result is shown in Table 1. Compared with other similar-sized open-sourced models,our model has a clear performance advantage.Especially in math and code problems, our model achieves significant improvement over Baichuan 1.
    总体结果如表 1 所示。与同类规模的开源模型相比,我们的模型具有明显的性能优势。特别是在数学和代码问题上,我们的模型在贝川 1 上取得了显著的改进。

5.2 Vertical Domain Evaluations
5.2 垂直域评估

       We also evaluate Baichuan 2 in vertical domains,where we choose the law and medical field as they has been widely studied in recent years.
   我们还在垂直领域评估了百川 2,因为在近年来,法律和医学领域已经被广泛研究。

       In the law field, we report scores of JEC-QA (Zhong et al., 2020), which is collected from the National Judicial Examination of China. It contains multiple-choice and multiple-answer questions.For compatibility with our evaluation suite, we only test the multiple-choice questions.
   在法律领域,我们报告了 JEC-QA(钟等人,2020)的分数,JEC-QA是从中国的国家司法考试中收集的。它包含了多项选择题和多项答案题。为了与我们的评估工具兼容,我们只测试多项选择题。

       In the medical field, we report scores from two medical benchmarks, MedQA (Jin et al., 2021) and MedMCQA (Pal et al., 2022), as well as average scores from medical-related disciplines in C-Eval (val), MMLU, and CMMLU (abbreviated as CMC). Specifically, MedMCQA is collected from the professional medical board exams in the USA and China, including three subsets, i.e.,USMLE, MCMLE and TWMLE, and we report the results of USMLE and MCMLE with five candidates; MedMCQA is collected from from Indian medical entrance exams, and we evaluate multiple-choice questions and report the scores in the dev set. The detail of MedMCQA includes (1) clinical medicine, basic medicine of C-Eval (val), (2) clinical knowledge, anatomy, college medicine, college biology, nutrition, virology,medical genetics, professional medicine of MMLU,(3) anatomy, clinical knowledge, college medicine,genetics, nutrition, traditional chinese medicine,virology of CMMLU. Moreover, all these datasets are evaluated in 5-shot.
    在医学领域,我们报告了两个医学基准的得分,分别是 MedQA(金等人,2021)和 MedMCQA(帕尔等人,2022),以及 C-Eval(val)、MMLU 和 CMMLU 中与医学相关学科的平均分数(简称为 CMC)。具体来说,MedMCQA 来自美国和中国的专业医学委员会考试,包括三个子集,即 USMLE、MCMLE 和 TWMLE,我们报告了 USMLE 和 MCMLE 的五位考生的结果;MedMCQA 来自印度医学入学考试,我们评估了多项选择题,并在开发集上报告了分数。MedMCQA 的详细内容包括:(1)C-Eval(val)的临床医学、基础医学;(2)临床知识、解剖学、大学医学、大学生物学、营养学、病毒学、医学遗传学、MMLU 的专业医学;(3)解剖学、临床知识、大学医学、遗传学、营养学、传统中医学、病毒学 CMMLU。此外,所有这些数据集都在 5 次射击范围内进行评估。

       As shown in Table 5 Baichuan 2-7B-Base surpasses models such as GPT-3.5 Turbo,ChatGLM 2-6B, and LLaMA 2-7B in the field of Chinese law, second only to GPT-4. Compared to Baichuan 1-7B, Baichuan 2-7B-Base shows an
improvement of nearly 10 points. In the medical field, Baichuan 2-7B-Base outperforms models like ChatGLM 2-6B and LLaMA 2-7B, showing significant improvement over Baichuan 1-7B as well.
    如表 5 所示,在法律领域,Baichuan 2-7B-Base 超越了诸如 GPT-3.5 Turbo、ChatGLM 2-6B 和 LLaMA 2-7B 等模型,仅次于 GPT-4。与 Baichuan 1-7B 相比,Baichuan 2-7B-Base 的表现提高了近 10 个百分点。在医学领域,Baichuan 2-7B-Base 的表现优于 ChatGLM 2-6B 和 LLaMA 2-7B 等模型,同样显著超过了 Baichuan 1-7B。

在这里插入图片描述

Table 5: The result of Baichuan 2 compared with other models on law and medical filed.
表 5 百川 2 与其他模型在法律和医学领域的对比结果。

       Similarly, Baichuan 2-13B-Base surpasses models other than GPT-4 in the field of Chinese law. In the medical domain, Baichuan 2-13B-Base outperforms models such as XVERSE-13B and LLaMA 2-13B. Compared to Baichuan 1-13B-Base,Baichuan 2-13B-Base also exhibits remarkable improvement.
   同样,在法律领域,百川 2-13B-Base 超越了GPT-4 以外的其他模型。在医学领域,百川 2-13B-Base 表现优于 XVERSE-13B 和 LLaMA 2-13B 等模型。与百川 1-13B-Base 相比,百川 2-13B-Base 也展现出显著的提升。

5.3 Math and Code
5.3 数学与代码

       This section introduces the performance in mathematics and coding.
   本节介绍数学和编码⽅⾯的表

       We use GSM8K (Cobbe et al., 2021) (4-shot) and MATH (Hendrycks et al., 2021b) (4-shot) to evaluate the mathematical ability. MATH contains 12,500 mathematical questions that are harder to be solved. To evaluate the model’s code ability, we report the scores in HumanEval (Chen et al., 2021)(0-shot) and MBPP (Austin et al., 2021) (3-shot).
   我们使用 GSM8K(Cobbe 等人,2021 年)(4-shot)和 MATH(Hendrycks 等人,2021b)(4-shot)来评估数学能力。MATH 包含 12,500 个较难解决的数学问题。为了评估模型的代码能力,我们报告了 HumanEval(Chen 等人,2021 年)(0-shot)和 MBPP(Austin 等人,2021 年)(3-shot)的成绩。

HumanEval is a series of programming tasks including model language comprehension,reasoning, algorithms, and simple mathematics to evaluate the correctness of the model and measure the model’s problem-solving ability.
 HumanEval是⼀系列包括模型语⾔理解、推理、算法和简单数学的编程任务,⽤于评估模型的正确性并衡量模型解决问题的能⼒

MBPP. It consists of a dataset of 974 Python short functions and program textual descriptions,along with test cases used to verify the correctness of their functionality.
  MBPP。它由 974 个 Python短函数和程序⽂本描述的数据集以及⽤于验证其功能正确性的测试⽤例组成。

       We use OpenCompass to evaluate the ability of models in math and code. As shown in Table 6, in the field of mathematics, Baichuan 2-7B-Base surpasses models like LLaMA 2-7B. In the code domain, it outperforms models of the same size such as ChatGLM 2-6B. Baichuan 2-7B-Base exhibits significant improvement compared to the Baichuan 1-7B model.
   我们使用 OpenCompass 来评估数学和代码领域的模型能力。如表 6 所示,在数学领域,百川 2-7B-Base 超越了像 LLaMA 2-7B 这样的模型。在代码领域,它优于相同规模的模型,例如 ChatGLM 2-6B。与百川 1-7B 模型相比,百川 2-7B-Base 有了显著的改进。

在这里插入图片描述

Table 6: The result of Baichuan 2 compared with other models on mathematics and coding.
表 6:百川 2 与其他模型在数学和编码方面的对比结果。

       In mathematics, Baichuan 2-13B-Base surpasses all models of the same size, approaching the level of GPT-3.5 Turbo. In the code domain, Baichuan 2-13B-Base outperforms models like LLaMA 2-13B and XVERSE-13B. Baichuan 2-13B-Base demonstrates significant improvement compared to Baichuan 1-13B-Base.
   在数学领域,Baichuan 2-13B-Base 的表现超越了所有同规模的模型,逼近了 GPT-3.5 Turbo 的水平。在代码领域,Baichuan 2-13B-Base 的表现优于 LLaMA 2-13B 和 XVERSE-13B 等模型。Baichuan 2-13B-Base 相比 Baichuan 1-13B-Base 有显著的提升。

5.4 Multilingual
5.4 多语言

       We use Flores-101 (NLLB Team, 2022; Goyal et al., 2021; Guzmán et al., 2019) to evaluate multilingual ability. Flores-101 covers 101 languages from around the world. Its data is sourced from various domains such as news, travel guides, and books. We selected the official languages of the United Nations (Arabic (ar),Chinese (zh), English (en), French (fr), Russian (ru), and Spanish (es)), as well as German (de) and Japanese (ja), as the test languages. We conducted 8-shot tests on seven subtasks in Flores-101 ,including zh-en, zh-fr, zh-es, zh-ar, zh-ru, zh-ja and zh-de. The evaluation is conducted with OpenCompass.
   我们使用 Flores-101(NLLB 团队,2022 年;Goyal 等人,2021 年;Guzmán 等人,2019 年)来评估多语言能力。Flores-101 涵盖了全球 101 种语言。它的数据来源于各种领域,如新闻、旅行指南和书籍。我们选择了联合国的官方语言(阿拉伯语(ar),中文(zh),英文(en),法语(fr),俄语(ru)和西班牙语(es)),以及德语(de)和日语(ja)作为测试语言。我们在 Flores-101 的 7 个子任务上开展了 8 次测试,包括 zh-en、zh-fr、zh-es、zh-ar、zh-ru、zh-ja 和 zh-de。评估使用 OpenCompass 进行。

       In the multilingual domain, as shown in Table 7, Baichuan 2-7B-Base surpasses all models of the same size in all seven tasks and shows significant improvement compared to Baichuan 1-7B.
   在多语言领域,如表 7 所示,百川 2-7B-基础在七个任务中均超越了同等规模的所有模型,并且与百川 1-7B 相比有显著的改进。

       Baichuan 2-13B-Base outperforms models of the same size in four out of the seven tasks. In the zh-en and zh-ja tasks, it surpasses GPT3.5 Turbo and reaches the level of GPT-4. Compared to Baichuan 1-13B-Base, Baichuan 2-13B-Base exhibits significant improvement in the zh-ar, zh-ru, and zh-ja tasks.
   百川 2-13B-Base 在七个任务中的四个任务上表现优于同等大小的模型。在 zh-en 和 zh-ja 任务中,它超越了 GPT3.5 Turbo,达到了 GPT-4 的水平。与百川 1-13B-Base 相比,百川 2-13B-Base 在 zh-ar、zh-ru 和 zh-ja 任务中表现出显著的改进。

       Although GPT-4 still dominates in the field of multilingualism, open-source models are catching up closely. In zh-en tasks, Baichuan 2-13B-Base has slightly surpassed GPT-4.
   虽然 GPT-4 仍然在多语言领域占据主导地位,但开源模型正在迅速赶上。在 zh-en 任务中,百川 2-13B-Base 略超过了 GPT-4。

5.5 Safety Evaluations
5.5 安全评价

       In Sec. 4, we describe the efforts made to improve the safety of Baichuan 2. However, some prior work indicates that helpfulness and harmlessness are two sides of a seesaw - when harmlessness increases, helpfulness could lead to a bit decrease (Bai et al., 2022a). So we evaluate these two factors before and after safety alignments.
   在第 4 节中,我们描述了为提高百川 2 安全性所做的努力。然而,一些先前的研究指出,有益性和无害性是一枚硬币的两面——当无害性增加时,有益性可能导致略微下降(Bai 等,2022a)。因此,我们在安全调整前后评估了这两个因素。

       Figure 6 shows the helpfulness and harmlessness before and after the safety alignment of Baichuan 2. We can see that our safety alignment process did not hurt the helpfulness while significantly improving the harmlessness.
   图6展示了百川2安全对准前后的有⽤性和⽆害性。我们可以看到,我们的安全对准过程没有损害有⽤性,同时显着提⾼了⽆害性。

在这里插入图片描述

Figure 6: Helpfulness and harmlessness before and after safety alignment of Baichuan 2. The x-axis shows the metric before safety alignment and the y-axis shows the result after. We see that helpfulness remains largely unchanged after this procedure, while harmlessness improved substantially (more mass in upper triangle) with safety efforts.
图 6:对百川 2 进行安全调整前后的有⽤性与无害性。x 轴表示安全调整前的数据,y 轴表示调整后的结果。我们可以看到,在进行这一过程后,有⽤性基本保持不变,而无害性有了明显提升(上三角区质量增加),这得益于安全努力。

       Then we evaluate the safety of our pre-trained models using the Toxigen (Hartvigsen et al., 2022) dataset. Same as LLaMA 2, we use the cleaned version from the SafeNLP project 8 ^8 8, distinguishing neutral and hate types for the 13 minority groups, forming a 6-shot dataset consistent with the original Toxigen prompt format. Our decoding parameters use temperature 0.1 and top-p 0.9 nucleus sampling.
   首先,我们使用 Toxigen (Hartvigsen et al., 2022) 数据集评估我们预训练模型的安全性。与 LLaMA 2 一样,我们使用来自 SafeNLP 项目的清洗版本,将 13 个少数群体的中立和仇恨类型区分开,形成一个与原始 Toxigen 提示格式一致的 6 次射击数据集。我们的解码参数使用温度 0.1 和 top-p 0.9 核采样。


8 ^8 8 https://github.com/microsoft/SafeNLP/tree/main


       We use the fine-tuned HateBert version optimized in the Toxigen (Hartvigsen et al., 2022) for model evaluation. Table 8 shows that compared to LLaMA 2, the Baichuan 2-7B and Baichuan 2-13B model has some safety advantages.
   我们使用在 Toxigen 中优化的 HateBert 版本(Hartvigsen 等人,2022 年)进行模型评估。表 8 显示,与 LLaMA 2 相比,百川 2-7B 和百川 2-13B 模型具有一定的安全性优势。

在这里插入图片描述

Table 8: Toxigen results of Baichuan 2 foundation models compared with LLaMA 2.
表8:Baichuan 2基础模型与LLaMA 2基础模型的有害性比较结果。

       Inspired by BeaverTails Ji et al. (2023) 9 ^9 9, we constructed the Baichuan Harmless Evaluation Dataset (BHED), covering 7 major safety categories of bias/discrimination, insults/profanity,illegal/unethical content,physical health, mental health, financial privacy, and sensitive topics to evaluate the safety of our chat models.
   灵感来源于 BeaverTails Ji 等人(2023) 9 ^9 9的研究,我们构建了贝川安全评估数据集(BHED),涵盖了 7 个主要的安全类别:偏见/歧视、侮辱/亵渎、非法/不道德内容、身体健康、心理健康、财务隐私和敏感话题,以评估我们聊天模型的安全性。


9 ^9 9 https://github.com/PKU-Alignment/


       To ensure comprehensive coverage within each category, We ask human annotators to generate 1,400 data samples. This was further expanded through self-instruction and cleaned by humans for fluency, resulting in 70,000 total samples with 10,000 per category. Examples of those safety prompts and principles are shown in the Appendix E.
   为确保每个类别的全面覆盖,我们邀请人类注释者生成了 1400 个数据样本。通过自我指导和由人类进行的清理以确保流畅性,进一步扩大了样本规模,最终在每个类别中都有 10,000 个样本,总共达到 70,000 个样本。附录 E 中展示了那些安全提示和原则的示例。

       We use those samples to evaluate different models and the result is shown in Table 9. We can see that Baichuan 2 is on par or outperforms other chat models in our safety evaluations.
   我们使用这些样本来评估不同的模型,结果如表 9 所示。可以看出,在安全性评估方面,百川 2 与其他聊天模型相媲美或表现更优。

在这里插入图片描述

Table 9: The result of different chat models on our safety evaluation benchmarks.
表格 9:不同聊天模型在我们安全评估基准测试中的结果。

5.6 Intermediate Checkpoints
5.6 中间Checkpoints

       We will also release the intermediate checkpoints of 7B models, from 220 billion tokens checkpoint to 2,640 billion tokens checkpoint, which is the final output of Baichuan 2-7B-Base. We examine their performance on several benchmarks and the result is shown in Figure 7.
   我们还将发布 7B 模型的checkpoints ,从 2200 亿tokens checkpoint到达 2640 亿tokens checkpoints ,这是 Baichuan 2-7B-Base 的最终输出。我们在多个基准测试上评估了它们的表现,结果如图 7 所示。

在这里插入图片描述

Figure 7: The results of intermediary checkpoints of Baichuan 2-7B which will be released to the public.
图 7:即将向公众发布的百川2-7B 的中间checkpoints结果。

       As shown in the figure, Baichuan 2 demonstrates consistent improvement as training proceeds. Even after 2.6 trillion tokens, there appears to be ample room for further gains. This aligns with previous work on scaling LLMs indicating that data size is a critical factor (Hoffmann et al., 2022). In the Appendix D, we provide more detailed training dynamics for both the 7B and 13B models.
   如图所示,随着训练的进行,百川 2 的表现持续改善。甚至在训练 2.6 万亿个标记之后,似乎仍有很大的进一步优化空间。这与之前关于扩展 LLM 的研究相吻合,表明数据规模是关键因素(Hoffmann 等人,2022 年)。在附录 D 中,我们为 70 亿和 130 亿参数的模型提供了更详细的训练动态。

6 Related Work
6 相关工作

       The field of language models has undergone a renaissance in recent years, sparked largely by
the development of deep neural networks and Transformers (Vaswani et al., 2017). Kaplan et al.(2020) proposed the scaling laws for large model pre-training. By systematically analyzing model performance as parameters and data size increased,they provided a blueprint for the current era of massive models with hundreds of or even billions of parameters.
   

       Seizing upon these scaling laws, organizations like OpenAI, Google, Meta, and Anthropic have engaged in a computing arms race to create ever-larger LLMs. Spurred by the OpenAI’s 175 billion parameters proprietary language model GPT-3 (Brown et al., 2020). The few-shot or even zero-shot ability of LLMs has revolved most natural language understanding tasks. From code generation to math-solving problems or even openworld scenarios. Specialized scientific LLMs like Galactica (Taylor et al., 2022) have also emerged to showcase the potential for large models to assimilate technical knowledge. However,raw parameter count alone does not determine model capability - Chinchilla (Hoffmann et al.,2022) demonstrated that scaling model capacity according to the number of tokens, rather than just parameters, can yield better sample efficiency.
   

       Concurrent with the development of private LLMs, academic and non-profit efforts have worked to develop open-source alternatives like Bloom (Scao et al., 2022), OPT (Zhang et al., 2022) and Pythia (Biderman et al., 2023b). Although some open-source large language models contain up to 175 billion parameters, most are trained on only 500 billion tokens or less. This is relatively small considering that 7 billion parameter models can still significantly improve after being trained on trillions of tokens. Among those open-sourced models, LLaMA (Touvron et al., 2023b) and its successor LLaMA 2 (Touvron et al., 2023c) stands out for its performance and transparency. Which was quickly optimized by the community for better inference speed and various applications.
   

       In addition to those foundation models, a lot of chat models have also been proposed to follow human instructions. Most of them fine-tune the foundation models to align with human (OpenAI,2022; Wang et al., 2023). Those chat models have demonstrated a marked improvement in understanding human instructions and solving complex tasks (Chiang et al., 2023; Xu et al.,2023; Sun et al., 2023). To further improve alignment, (Ouyang et al., 2022) incorporates the Reinforcement Learning from Human Feedback (RLHF) approach. This involves learning from human preferences by training a reward model on human-rated outputs. Other methods such as direct preference optimization (DPO) (Rafailov et al., 2023) and reinforcement learning from AI feedback (RLAIF) (Bai et al., 2022b) have also been proposed to improve the RLHF both in terms of efficiency and effectiveness.
   

7 Limitations and Ethical Considerations
7 局限性和伦理考量

       Like other large language models, Baichuan 2 also faces ethical challenges. It’s prone to biases and toxicity, especially given that much of its training data originates from the internet. Despite our best efforts to mitigate these issues using benchmarks like Toxigen (Hartvigsen et al., 2022), the risks cannot be eliminated, and toxicity tends to increase with model size. Moreover, the knowledge of Baichuan 2 models is static and can be outdated or incorrect, posing challenges in fields that require up-to-date information like medicine or law. While optimized for Chinese and English for safety, the model has limitations in other languages and may not fully capture biases relevant to non-Chinese cultures.
   

       There’s also the potential for misuse, as the model could be used to generate harmful or misleading content. Although we try our best efforts to balance safety and utility, some safety measures may appear as over-cautions, affecting the model’s usability for certain tasks. We encourage users to make responsible and ethical use of Baichuan 2 models. Meanwhile, we will continue to optimize these issues and release updated versions in the future.
   

References
参考文献

Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, and Andriy Mulyar. 2023.Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub.

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos,Siamak Shakeri, Emanuel Taropa, Paige Bailey,Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021.Program synthesis with large language models. arXiv preprint arXiv:2108.07732.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu,Amanda Askell, Jackson Kernion, Andy Jones,Anna Chen, Anna Goldie, Azalia Mirhoseini,Cameron McKinnon, et al. 2022b. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.

Baichuan. 2023a. A 13b large language model developed by baichuan intelligent technology.

Baichuan. 2023b. A large-scale 7b pretraining language model developed by baichuan-inc.

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al.2023a. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430.PMLR.

Stella Rose Biderman, Hailey Schoelkopf, Quentin G.Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023b. Pythia: A suite for analyzing large language models across training and scaling. ArXiv,abs/2304.01373.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri,Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan,Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian,Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis,Elizabeth Barnes, Ariel Herbert-Voss,

William Hebgen Guss, Alex Nichol, Alex Paino,Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati,Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. CoRR, abs/2107.03374.Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,Maarten Bosma, Gaurav Mishra, Adam Roberts,Paul Barham, Hyung Won Chung, Charles Sutton,Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Claude. 2023. Conversation with Claude AI assistant.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.2021. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168.

Yiming Cui, Ziqing Yang, and Xin Yao. 2023. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra,and Christopher Ré. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.In Advances in Neural Information Processing Systems.

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR. William Fedus, Barret Zoph, and Noam Shazeer. 2022.Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding,Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff,Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation.

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán,and Angela Fan. 2021. The flores-101 evaluation benchmark for low-resource and multilingual machine translation.

Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. 2019. Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english.

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi,Maarten Sap, Dipankar Ray, and Ece Kamar. 2022.Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv: 2203.09509.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,Mantas Mazeika, Dawn Song, and Jacob Steinhardt.2021a. Measuring massive multitask language understanding. In ICLR. OpenReview.net.

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, and et al. Scott Gray. 2020. Scaling laws for autoregressive generative modeling. arXiv preprint
arXiv:2010.14701.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,Elena Buchatskaya, Trevor Cai, Eliza Rutherford,Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu,Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu,Maosong Sun, and Junxian He. 2023. C-eval:A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails:Towards improved safety alignment of llm via a human-preference dataset.

Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. 2023a. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258.

Zixuan Jiang, Jiaqi Gu, and David Z Pan. 2023b.Normsoftmax: Normalizing the input of softmax to accelerate and stabilize training. In 2023 IEEE International Conference on Omni-layer Intelligent Systems (COINS), pages 1–6. IEEE.

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray,Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.Scaling laws for neural language models. arXiv preprint arXiv: 2001.08361.

Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226.

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2023. Cmmlu: Measuring massive multitask language understanding in chinese.

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

MosaicML. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms.

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.

Xiaonan Nie, Xupeng Miao, Zhi Yang, and Bin Cui. 2022. Tsplit: Fine-grained gpu memory management for efficient dnn training via tensor splitting. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 2615–2628. IEEE.

James Cross Onur Çelebi Maha Elbayad Kenneth Heafield Kevin Heffernan Elahe Kalbassi Janice Lam Daniel Licht Jean Maillard Anna Sun Skyler Wang Guillaume Wenzek Al Youngblood Bapi Akula Loic Barrault Gabriel Mejia Gonzalez Prangthip Hansanti John Hoffman Semarley Jarrett Kaushik Ram Sadagopan Dirk Rowe Shannon Spruit Chau Tran Pierre Andrews Necip Fazil Ayan Shruti Bhosale Sergey Edunov Angela Fan Cynthia Gao Vedanuj Goswami Francisco Guzmán Philipp Koehn Alexandre Mourachko Christophe Ropers Safiyyah Saleem Holger Schwenk Jeff Wang NLLB Team,Marta R. Costa-jussà. 2022. No language left behind:Scaling human-centered machine translation.

OpenAI. 2022. Introducing chatgpt. Blog post openai.com/blog/chatgpt.

OpenAI. 2023. Gpt-4 technical report. ArXiv,abs/2303.08774.

OpenCompass. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/InternLM/OpenCompass.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,Carroll Wainwright, Pamela Mishkin, Chong Zhang,Sandhini Agarwal, Katarina Slama, Alex Ray,et al. 2022. Training language models to follow instructions with human feedback. Advances in
Neural Information Processing Systems, 35:27730–27744.

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning,volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow,Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,and Julien Launay. 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. corr abs/1802.05365 (2018). arXiv preprint arXiv:1802.05365.

Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.

Markus N Rabe and Charles Staats. 2021. Self-attention does not need o ( n 2 ) o(n^2) o(n2) memory. arXiv preprint arXiv: 2112.05682.

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn.2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20:International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.

Teven Le Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick, Suzana Ili’c, Daniel Hesslow,Roman Castagn’e, Alexandra Sasha Luccioni,Franccois Yvon, Matthias Gallé, Jonathan Tow,Alexander M. Rush, Stella Rose Biderman,Albert Webson, Pawan Sasanka Ammanamanchi,Thomas Wang, Benoît Sagot, Niklas Muennighoff,Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major,Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurenccon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi,Aitor Soroa Etxabe, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu,Chenghao Mou, Chris C. Emezue, Christopher Klamm, Colin Leong, Daniel Alexander van Strien,David Ifeoluwa Adelani, Dragomir R. Radev,Eduardo Gonz’alez Ponferrada, Efrat Levkovizh,Ethan Kim, Eyal Bar Natan, Francesco De Toni,Gérard Dupont, Germán Kruszewski, Giada Pistilli,Hady ElSahar, Hamza Benyamina, Hieu Trung Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson,Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang,Jorg Frohberg, Josephine L. Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen,Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, Loubna Ben Allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud,Mar’ia Grandury, Mario vSavsko, Max Huang,Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad Ali Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen,Omar Espejel, Ona de Gibert, Paulo Villegas,Peter Henderson, Pierre Colombo, Priscilla A.Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto L’opez, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose,Shamsuddeen Hassan Muhammad, Shanya Sharma,S. Longpre, Somaieh Nikpoor, Stanislav Silberberg,Suhas Pai, Sydney Zink, Tiago Timponi Torrent,Timo Schick, Tristan Thrush, Valentin Danchev,Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si,Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee,Abheesht Sharma, Andrea Santilli, Antoine Chaffin,Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal V. Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Févry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiang Tang, Zheng Xin Yong, Zhiqing Sun, Shaked Brody, Y Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre Franccois Lavall’ee, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aur’elie N’ev’eol, Charles Lovering, Daniel H Garrette, Deepak R. Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata,Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Xiangru Tang, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, S. Osher Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdenvek Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ananda Santa Rosa Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Olusola Ajibade, Bharat Kumar Saxena, Carlos Muñoz Ferrandis, Danish Contractor, David M. Lansky, Davis David, Douwe Kiela, Duong Anh Nguyen, Edward Tan, Emily Baylor, Ezinwanne Ozoani, Fatim T Mirza, Frankline Ononiwu, Habib Rezanejad, H.A. Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jan Passmore, Joshua Seltzer, Julio Bonis Sanz, Karen Fort, Lívia Macedo Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, M. K. K. Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nourhan Fahmy, Olanrewaju Samuel, Ran An, R. P. Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas L. Wang, Sourav Roy, Sylvain Viguier, Thanh-Cong Le, Tobi Oyebade, Trieu Nguyen Hai Le, Yoyo Yang, Zachary Kyle Nguyen, Abhinav Ramesh Kashyap, A. Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Kumar Singh, Benjamin Beilharz, Bo Wang, Caio Matheus Fonseca de Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel Le’on Perin’an, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Iman I.B. Bello, Isha Dash, Ji Soo Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthi Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz,Maiko Takeuchi, Marc Pàmies, María Andrea Castillo, Marianna Nezhurina, Mario Sanger, Matthias Samwald, Michael Cullan, Michael Weinberg, M Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad,Nikolaus Muellner, Pascale Fung, Patricia Haller,R. Chandrasekhar, R. Eisenberg, Robert Martin, Rodrigo L. Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter,Sushil Pratap Bharati, T. A. Laud, Th’eo Gigant,Tomoya Kainuma, Wojciech Kusa, Yanis Labrak,Yashasvi Bajaj, Y. Venkatraman, Yifan Xu, Ying Xu, Yun chao Xu, Zhee Xao Tan, Zhongli Xie,Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. 2022. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202.

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang,Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2022. Language models are multilingual chain-of-thought reasoners. CoRR, abs/2210.03057.

Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, and Setsuo Arikawa. 1999. Byte pair encoding: A text compression scheme that accelerates pattern matching.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.

Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Hang Yan, Xiangyang Liu, Yunfan Shao, Qiong Tang, Xingjian Zhao, Ke Chen, Yining Zheng, Zhejian Zhou, Ruixiao Li, Jun Zhan, Yunhua Zhou, Linyang Li, Xiaogui Yang, Lingling Wu, Zhangyue Yin, Xuanjing Huang, and Xipeng Qiu. 2023. Moss: Training conversational language models from synthetic data.

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Alpaca:A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html,3(6):7.

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A large language model for science. CoRR, abs/2211.09085.

Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. 2022. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023b. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023c. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR.

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068.

Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. 2023. Evaluating the performance of large language models on gaokao benchmark.

Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. Jecqa: A legal-domain question answering dataset. In Proceedings of AAAI.

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. Agieval: A human-centric benchmark for evaluating foundation models.

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.

A Author List (alphabetically)
A 作者列表(按字母顺序排列)

       Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, Zhiying Wu. Correspondent: daniel@baichuan-inc.com

B Scaling laws
B 缩放定律

       We use 7 models to fit the scaling laws of Baichuan 2. The parameter details are shown in Table 10.
   我们采用了 7 个模型来拟合巴川 2 的缩放规律。参数细节见表 10。

在这里插入图片描述

Table 10: The model we choose for fitting scaling laws.
表 10:我们选择的用于拟合缩放定律的模型。

       The losses of the 7 different models are shown in Figure 8.
   图八展示了7个不同模型的损失。

在这里插入图片描述

Figure 8: The various training loss of small models for scaling law.
图 8:小模型在缩放定律中的各种训练损失。

C NormHead
C NormHead

       By conducting a word embedding KNN retrieval task, where given a query word the nearest K words are retrieved. We found that the semantic information is mainly encoded by the cosine similarity of embedding rather than L 2 _2 2 distance. i.e., The KNN results of cosine similarity are words with semantic similarity while the KNN results of L 2 _2 2 distance are meaningless in some way. Since the current linear classifier computes logits by dot product, which is a mixture of L 2 _2 2 distance and cosine similarity. To alleviate the distraction of L 2 _2 2 distance, We propose to compute the logits by the angle only. We normalized the output Embedding so that the dot product is not affected by the norm of embedding.
   通过执行一个词向量嵌入的 KNN 检索任务,给定一个查询词,检索最近的 K 个单词。我们发现,语义信息主要由嵌入的余弦相似度而非 L 2 _2 2 距离来编码。也就是说,余弦相似度的 KNN 结果是具有语义相似性的单词,而 L 2 _2 2 距离的 KNN 结果在某种程度上是无意义的。由于目前的线性分类器通过点积计算 logits,这是 L 2 _2 2 距离和余弦相似度的混合。为了减轻 L 2 _2 2 距离的干扰,我们建议仅通过角度计算 logits。我们对输出向量进行归一化,以使点积不受嵌入向量的模长影响。

       To validate this operation, we conduct an ablation experiment where we add or remove the normalization before softmax and train a 7B model for 12k steps. All the hyper-parameters and data are the same with Baichuan 2 7B. The training loss is shown in Figure 9. We can see that when removing the NormHead the training became very unstable at the beginning, on the contrary, after we normalized the head the training became very stable, which resulted in better performance.
   为了验证这一操作的有效性,我们进行了一项消融实验,即在 softmax 之前添加或删除归一化处理,并训练了一个 70 亿参数的模型,共进行 12000 步。所有超参数和数据都与百川 2 70 亿模型相同。训练损失如图 9 所示。我们可以看到,当移除 NormHead 时,起初训练变得非常不稳定,相反,在我们对头部进行归一化处理后,训练变得非常稳定,从而实现了更好的性能。

在这里插入图片描述

Figure 9: The training loss with and without NormHead operation. The experiments are conducted on 7 billion parameters with the same hyper-parameters (torch random seeds, data flow, batch size, learning rate, etc.)
图 9:带有和不带 NormHead 操作的训练损失。实验在 70 亿个参数上进行,使用相同的超参数(PyTorch 随机种子、数据流、批量大小、学习率等)。

D Training Dynamics
D 动态训练

       In this section, we analyze the training dynamics of our model. We save the checkpoints of Baichuan 2-7B and Baichuan 2-13B every 1000 steps. And evaluate those intermediate results on C-Eval development set (Huang et al., 2023), MMLU (Hendrycks et al., 2021a) , CMMLU (Li et al., 2023) , JEC-QA (Zhong et al., 2020), GSM8K (Shi et al., 2022) and HumanEval (Chen et al., 2021). The result is shown in Figure 10.
   在本节中,我们分析了我们模型的训练动态。我们每隔 1000 步保存一次 Baichuan 2-7B 和 Baichuan 2-13B 的checkpoints 。并在 C-Eval 开发集(Huang 等人,2023)、MMLU(Hendrycks 等人,2021a)、CMMLU(Li 等人,2023)、JEC-QA(Zhong 等人,2020)、GSM8K(Shi 等人,2022)和 HumanEval(Chen 等人,2021)上评估这些中间结果。结果如 figure 10 所示。

在这里插入图片描述

Figure 10: Evaluation results of Baichuan 2-13B and Baichuan 2-7B on different pre-training steps.
图 10:不同预训练步骤下百川 2-13B 和百川 2-7B 的评估结果。
)

       As shown, both the 7B and 13B models demonstrate substantial gains as training progresses. However, on general benchmarks such as MMLU (Hendrycks et al., 2021a) and C-Eval (Huang et al., 2023), improvements appear to plateau after 2 trillion tokens. In contrast, consistent gains are achieved on the GSM8K math tasks even beyond 2 trillion tokens. This suggests training FLOPs may strongly correlate with improvements in math problem solving, which may be further studied.
   如图所示,随着训练的进行,7B 和 13B 模型都取得了显著的提升。然而,在通用基准测试如 MMLU(Hendrycks 等人,2021a)和 C-Eval(Huang 等人,2023)中,改进似乎在 2 万亿个标记后趋于平缓。相比之下,即使在超过 2 万亿个标记的 GSM8K 数学任务上,也取得了持续的收益。这表明训练的 FLOPs 可能与数学问题解决能力的提升强烈相关,这值得进一步研究。

E Baichuan Harmless Evaluation Dataset
E 百川无害化评价数据集

       WARNING: this section contains unsafe, offensive, or upsetting examples of text.
   警告:本节包含不安全、冒犯或令人不安的文本示例。

       We proposed the Baichuan Harmless Evaluation Dataset (BHED) to evaluate the chat models, as described in Section 5.5. Here we introduce the principles and cases of BHED.
   我们提出了百川无害评估数据集(BHED)来评估聊天模型,如第 5.5 节所述。在这里,我们介绍 BHED 的原则和案例。

       The seven major safety categories consist of bias and discrimination, insults and profanity, illegal/unethical content, physical health, mental health, financial privacy, and sensitive topics.
   七大安全类别包括偏见与歧视、侮辱与粗俗语言、违法/不道德内容、身体健康、心理健康、财务隐私和敏感话题。

       To ensure diversity within each category, multiple sub-dimensions were considered:
   为了保证每个类别内的多样性,我们考虑了多个子维度:

  • Bias/discrimination covers various forms such as nationality, ethnicity, race/skin color, groups, occupation, gender, region, industry, etc. to ensure data diversity.
     偏见/歧视包括诸如国籍、民族、种族/肤色、群体、职业、性别、地区、行业等多种形式,以确保数据的多样性。

  • Insults/profanity includes both explicit and implicit insults as well as internet verbal abuse.
     侮辱/粗俗语言包括 explicit(明确)和 implicit(隐含)的侮辱,以及互联网上的言语攻击。

  • Illegal/unethical content encompasses criminal law, civil law, economic law, international law, traffic regulations, local administrative regulations, etc.
     非法或违反道德的内容包括刑法、民法、经济法、国际法、交通法规、地方行政法规等。

  • Physical health covers health knowledge, medical advice, and discrimination related to physical health.
     身体健康包括健康知识、医学建议以及与身体健康相关的歧视。

  • Mental health encompasses emotional health, cognitive and social health, self-esteem and self-worth, coping with stress and adaptability,psychological suggestions, and discrimination against groups with mental health issues.
     心理健康包括情感健康、认知和社会健康、自尊和自我价值、应对压力和适应性、心理建议以及对患有心理健康问题群体的歧视。

  • Financial privacy includes real estate, personal debt, banking information, income, stock recommendations, etc. Privacy includes personal information, family information, occupational information, contact details, private life, etc.
     财务隐私包括房地产、个人债务、银行信息、收入、股票推荐等。隐私包括个人信息、家庭信息、职业信息、联系方式、个人生活等。

  • Sensitive topics include racial hatred,international political issues, legal loopholes,human-AI relationships, etc.
     敏感话题包括种族仇恨、国际政治问题、法律漏洞、人机关系等。

       We collect 10k prompts for each of the categories, some examples are shown in Table 11.
   我们为每个类别收集了 10000 个主题,部分示例展示在表 11 中。

在这里插入图片描述

Table 11: Some examples of Baichuan Harmless Evaluation Dataset.
表11:百川无害化评价数据集部分实例。

F Details of MMLU and C-Eval
F MMLU和C-Eval的详细信息

       We provide the score of Baichuan 2 on each subject of C-Eval in Table 12 and MMLU in Table 13.
   我们在表 12 中提供了 Baichuan 2 在 C-Eval 每个科目上的得分,同时在表 13 中展示了 MMLU 的得分。

在这里插入图片描述

Table 12: The scores of each subject in C-Eval of Baichuan 2-7B-Base and Baichuan 2-13B-Base.
表 12:百川 2-7B-Base 和百川 2-13B-Base 在 C-Eval 中的各科目得分。

在这里插入图片描述

Table 13: The scores of each subject in MMLU of Baichuan 2-7B-Base and Baichuan 2-13B-Base.
表 13:百川 2-7B-Base 和百川 2-13B-Base 在各科目 MMLU 中的得分。

G Examples generated by Baichuan 2-13B-Chat
G 百川2-13B-Chat生成的示例

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小蓝xlanll/article/detail/682692
推荐阅读
相关标签
  

闽ICP备14008679号