我家自动化

这个屌丝很懒，什么也没留下！

热门标签

一文详细梳理！大模型从理论到实战落地必备干货！零基础入门到精通，收藏这一篇就够了

作者：我家自动化 | 2024-07-29 14:51:32

踩

在人工智能的浩瀚星辰中，大模型犹如璀璨的北极星，引领着技术的前沿方向。它们不仅代表了深度学习领域的最新突破，更成为了推动各行各业智能化转型的关键力量。本文笔者总结了大模型从理论研究到实战落地所需具备的所有知识干货，与大家分享~

基础知识

数学

深入浅出动态可视化数学之美(几何、微积分、概率论、线性代数等)：https://space.bilibili.com/88461692/

机器学习

吴恩达机器学习入门：https://www.coursera.org/learn/machine-learning
scikit-learn官网：https://scikit-learn.org/stable/index.html
机器学习白板系列：https://www.yuque.com/bystander-wg876/yc5f72
机器学习实战：https://github.com/apachecn/AiLearning
南瓜书PumpkinBook：https://datawhalechina.github.io/pumpkin-book/
机器学习过程可视化：https://developers.google.cn/machine-learning/crash-course/feature-crosses/playground-exercises?hl=zh-cn
机器学习数据集仓库：https://archive.ics.uci.edu/

深度学习

跟李沐学AI：https://space.bilibili.com/1567748478?spm_id_from=333.337.0.0
台大李宏毅-机器学习：https://speech.ee.ntu.edu.tw/~hylee/ml/2023-spring.php
零基础入门深度学习：https://www.zybuluo.com/hanbingtao/note/433855
深度学习500问：https://github.com/scutan90/DeepLearning-500-questions
吴恩达深度学习课程笔记及资源：http://www.ai-start.com/dl2017/
简单粗暴TensorFlow 2：https://tf.wiki/zh_hans/
卷积过程可视化：https://poloclub.github.io/cnn-explainer/

自然语言处理NLP

斯坦福NLP：https://web.stanford.edu/class/cs224n/
牛津NLP：https://github.com/oxford-cs-deepnlp-2017/lectures
跟踪NLP当前最新技术进度的项目：https://github.com/yuquanle/NLP-progress
中文NLP相关资料：https://github.com/crownpku/awesome-chinese-nlp

强化学习

蘑菇书EasyRL：
https://datawhalechina.github.io/easy-rl/#/
动手学强化学习：
https://github.com/boyu-ai/Hands-on-RL/tree/main
强化学习框架
OpenRL：https://github.com/OpenRL-Lab/openrl
RLAssistant(RLA)：https://github.com/xionghuichen/RLAssistant
PARL：https://github.com/PaddlePaddle/PARL
…

LLM训练

预训练PreTrain

BackBones：
https://github.com/FreedomIntelligence/LLMZoo
Transformer
图解Transformer：https://jalammar.github.io/illustrated-transformer/
详解Transformer原理：https://www.cnblogs.com/justLittleStar/p/17322172.html
Transformer模型Torch代码详解和训练实战：https://www.cnblogs.com/justLittleStar/p/17786071.html
BERT
BERT原理解析：https://www.cnblogs.com/justLittleStar/p/17322240.html
BERT可视化：https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
GPT
GPT原理解析：https://www.cnblogs.com/justLittleStar/p/17322259.html
图解GPT2:https://jalammar.github.io/illustrated-gpt2/
60行代码实现GPT推理：https://www.cnblogs.com/justLittleStar/p/17925108.html
T5：
https://huggingface.co/google/flan-t5-xxl
ChatGLM：
https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/glm3.md
Baichuan：
https://gitee.com/mindspore/mindformers/blob/dev/research/baichuan2/baichuan2.md
Qwen：
https://zhuanlan.zhihu.com/p/690868924
https://zhuanlan.zhihu.com/p/702491999
https://huggingface.co/Qwen/Qwen-7B
Qwen2大模型微调：
Qwen部署到手机：
LLaMA
https://github.com/meta-llama/llama
LLaMA2训练、推理全流程：https://blog.csdn.net/qq_27149279/article/details/131981984

监督微调(Supervised Fine-Tuning, SFT)

训练

一站式训练工具
Firefly：https://github.com/yangjianxin1/Firefly
LLaMA-Factory:https://github.com/hiyouga/LLaMA-Factory
微调框架
Unsloth：https://github.com/yangjianxin1/unsloth
PEFT：https://github.com/huggingface/peft
分布式AI框架
https://github.com/microsoft/Megatron-DeepSpeed
【Megatron-DeepSpeed】张量并行工具代码mpu详解：https://blog.csdn.net/bqw18744018044/article/details/131741282
Megatron源码图图图图解之分布式概览与模型切分：https://zhuanlan.zhihu.com/p/678208105
https://github.com/microsoft/DeepSpeed
DeepSpeed：AllReduce与ZeRO-DP：https://zhuanlan.zhihu.com/p/610587671
https://github.com/NVIDIA/Megatron-LM
基于Megatron-LM从0到1完成GPT2模型预训练、模型评估及推理：https://juejin.cn/post/7259682893648724029
分布式训练原理及混合精度、DDP、DeepSpeed、Megatron-LM使用：https://zhuanlan.zhihu.com/p/647389318
https://pytorch.org/tutorials/beginner/dist_overview.html
PyTorch
Megatron-LM
DeepSpeed
Megatron-DeepSpeed

LLM微调

全量参数微调
基于DeepSpeed框架对ChatGLM-6B的流水线并行实战：
大模型训练DeepSpeed
https://zhuanlan.zhihu.com/p/636488690
https://blog.csdn.net/zwqjoy/article/details/130732601
https://zhuanlan.zhihu.com/p/688873027
DeepSpeed
高效参数微调
https://zhuanlan.zhihu.com/p/676998456
https://www.cnblogs.com/ting1/p/18217395
图解AdaLoRA：https://zhuanlan.zhihu.com/p/657130029
图解LoRA：https://zhuanlan.zhihu.com/p/646831196
https://lightning.ai/pages/community/tutorial/lora-llm/
https://martinlwx.github.io/zh-cn/lora-finetuning/
https://blog.csdn.net/qq_45038038/article/details/135324609
https://zhuanlan.zhihu.com/p/693737958
https://zhuanlan.zhihu.com/p/693737958
用BitFit进行大模型高效微调：https://blog.csdn.net/DeepLn_HPC/article/details/138122100
P-Tuningv2微调ChatGLM2-6B快速上手指南：https://zhuanlan.zhihu.com/p/645892136
P-Tuning v2
BitFit
Prefix Tuning
Prompt Tuning
Adapter Tuning
LoRA
AdaLoRA
QLoRA

分布式训练并行

数据并行
https://zhuanlan.zhihu.com/p/618865052
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/
https://zhuanlan.zhihu.com/p/617133971
https://zhuanlan.zhihu.com/p/617133971
https://insujang.github.io/2022-06-11/parallelism-in-distributed-deep-learning/
DP（Data Parallel）
DDP（Distributed Data Parallel）
零冗余优化器ZeRO
模型并行：
https://zhuanlan.zhihu.com/p/613196255
https://zhuanlan.zhihu.com/p/622212228
一文捋顺千亿模型训练技术：流水线并行、张量并行和3D并行：https://zhuanlan.zhihu.com/p/617087561
https://huggingface.co/docs/transformers/v4.18.0/en/parallelism
张量并行TP
流水线并行PP
MOE并行/专家并行
https://blog.csdn.net/qq_46207024/article/details/129665922
https://blog.csdn.net/qq_27590277/article/details/136360290
https://www.paddlepaddle.org.cn/documentation/docs/en/guides/06_distributed_training/distributed_overview.html
https://blog.csdn.net/lovechris00/article/details/138734349
MindSpore分布式并行训练基础样例：
优化器相关的并行
https://blog.csdn.net/GarryWang1248/article/details/135340120
PyTorch分布式优化器：https://www.cnblogs.com/rossiXYZ/p/15664335.html
异构系统并行
https://blog.csdn.net/GarryWang1248/article/details/135340120
多维混合并行(算子并行、pipeline并行、MOE并行…)
https://blog.csdn.net/qq_51175703/article/details/136932579
自动并行：
https://zhuanlan.zhihu.com/p/662517647

LLM训练优化技术

一文读懂LLM训练加速技巧：https://zhuanlan.zhihu.com/p/649967866
I/O优化：FlashAttention V1、V2
算子优化：Nvidia CUDA operator
通信优化：ZeRO++、Onebit Adam、All-reduce Bucket、Overlap communication
显存优化：https://zhuanlan.zhihu.com/p/648924115
混合精度训练：
https://zhuanlan.zhihu.com/p/650549120
重计算
分析transformer模型的参数量、计算量、中间激活、KV cache：https://zhuanlan.zhihu.com/p/624740065
梯度累积
https://zhuanlan.zhihu.com/p/698787661

LLM压缩

量化（Quantization）

参考

模型量化理论+代码实战（LLM-QAT/GPTQ/BitNet 1.58Bits/OneBit）：https://zhuanlan.zhihu.com/p/686161543
https://aistudio.baidu.com/projectdetail/3875525

量化对象

权重（weight）
激活（activation）
KV cache
梯度（Gradients）

量化形式

根据原始数据范围是否均匀：线性量化和非线性量化
根据量化参数s ss和z zz的共享范围（即量化粒度）：逐层量化（per-tensor）、逐通道（per-token & per-channel）量化和逐组量化（per-group）

量化分类

根据应用量化压缩模型的阶段，可以将模型量化分为：

量化感知训练（Quantization Aware Training, QAT）
QLoRA（Quantized LoRA）详解：https://zhuanlan.zhihu.com/p/666234324
QLoRA、GPTQ：模型量化概述：https://zhuanlan.zhihu.com/p/646210009
https://github.com/facebookresearch/LLM-QAT
LLM-QAT：
QLoRA
PEQA
训练后量化（Post Training Quantization, PTQ）
SmoothQuant：
RPTQ
ZeroQuant-FP：https://zhuanlan.zhihu.com/p/683813769
https://arxiv.org/abs/2211.10438
https://zhuanlan.zhihu.com/p/627436535
https://zhuanlan.zhihu.com/p/646210009
GPTQ-for-LLaMa 量化分析和优化：https://zhuanlan.zhihu.com/p/625701227
LUT-GEMM：
LLM.int8()：
ZeroQuant：
GPTQ：
AWQ：
INT4/INT8
https://blog.csdn.net/weixin_42764932/article/details/131230429
https://zhuanlan.zhihu.com/p/627436535
大模型量化技术原理-ZeroQuant系列：https://zhuanlan.zhihu.com/p/683813769
https://zhuanlan.zhihu.com/p/690673432
https://arxiv.org/abs/2210.17323
使用GPTQ的4位LLM量化：
https://arxiv.org/abs/2306.00978
大语言模型的模型量化(INT8/INT4)技术：https://zhuanlan.zhihu.com/p/627436535
权重量化
全量化（权重和激活量化）

剪枝（Pruning）

深度学习的模型压缩与加速（万字长文带你入门）：https://blog.csdn.net/weixin_54338498/article/details/127588261
万字长文谈深度神经网络剪枝综述：https://zhuanlan.zhihu.com/p/692858636
模型压缩-剪枝算法详解：https://zhuanlan.zhihu.com/p/622519997

知识蒸馏（Knowledge Distillation）

知识蒸馏Knowledge Distillation学习一条龙：https://zhuanlan.zhihu.com/p/696383649

低秩分解（Low-Rank Factorization）

https://blog.csdn.net/qq_51175703/article/details/138320834
https://zhuanlan.zhihu.com/p/628232317

LLM编译

编译框架

MLIR
利用TPU-MLIR实现LLM INT8量化部署:https://zhuanlan.zhihu.com/p/654828412
XLA:https://github.com/openxla/xla
TVM:https://tvm.hyper.ai/docs

LLM推理

模型推理部署/服务化方式

服务器端部署
边缘设备端部署
云端部署
Web端部署

模型推理服务化工具

通过WEB框架封装AI模型提供服务
https://www.tornadoweb.org/en/stable/
https://dormousehole.readthedocs.io/en/latest/
https://blog.csdn.net/chinesehuazhou2/article/details/114297858
Sanic
Flask
Tornado
使用深度学习框架自带的Serving封装
https://zhuanlan.zhihu.com/p/616740782
TensorFlow Serving：https://github.com/tensorflow/serving
TorchServe：https://pytorch.org/serve/
MindSpore Serving：https://gitee.com/mindspore/serving
支持多种框架的统一推理服务化工具
https://www.hiascend.com/software/mindie
https://developer.nvidia.cn/triton-inference-server
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_triton/
Triton Inference Server
MindIE-Service：

推理加速框架

TensorRT-LLM：
https://developer.nvidia.com/zh-cn/blog/tune-and-deploy-lora-llms-with-nvidia-tensorrt-llm/
https://blog.csdn.net/kunhe0512/article/details/138286905?spm=1001.2014.3001.5502
https://github.com/NVIDIA/TensorRT-LLM/tree/v0.5.0
TensorRT-LLM和Triton推理服务器使用和部署Llama3：
使用NVIDIA TensorRT-LLM调整和部署LoRA LLM：
vLLM：
https://zhuanlan.zhihu.com/p/691045737
https://github.com/vllm-project/vllm
vLLM源码解析：
Llama.cpp：
https://blog.csdn.net/weixin_51717597/article/details/134343802
https://github.com/ggerganov/llama.cpp/tree/master/examples/main
Llama2通过llama.cpp模型量化&部署：
HuggingFace TGI：
https://github.com/huggingface/text-generation-inference
FasterTransformer：
https://zhuanlan.zhihu.com/p/626008090
https://github.com/NVIDIA/FasterTransformer
浅析FasterTransformer：
https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md
DeepSpeed
https://zhuanlan.zhihu.com/p/629644249
DeepSpeed通过系统优化加速大模型推理：
DeepSpeed-MII：
https://github.com/microsoft/DeepSpeed-MII
LMDeploy：
https://github.com/InternLM/Tutorial/blob/7c2a385cd772ed93965927599b0159c52068da85/lmdeploy/lmdeploy.md
https://blog.csdn.net/weixin_61573157/article/details/137782082
https://github.com/InternLM/lmdeploy
https://lmdeploy.readthedocs.io/zh-cn/latest/index.html
https://blog.csdn.net/weixin_42475060/article/details/135386145
LMDeploy量化部署LLM&VLM实战：
LMDeploy量化和部署：
MindFormers:
https://gitee.com/mindspore/mindformers
MindIE：
https://www.hiascend.com/software/mindie
MindSpore Lite:
https://www.mindspore.cn/lite

推理优化

KV Cache
https://zhuanlan.zhihu.com/p/662498827
https://zhuanlan.zhihu.com/p/685853516
https://zhuanlan.zhihu.com/p/679249229
图解大模型推理优化KV Cache：
大模型百倍推理加速之KV cache篇：
大模型推理加速：看图学KV Cache：
Flash attention
https://zhuanlan.zhihu.com/p/642412124
FlashAttention V1-从硬件到计算逻辑：https://zhuanlan.zhihu.com/p/669926191
图解大模型计算加速系列：
LLM的推理优化：
MQA/GQA
https://blog.csdn.net/qq128252/article/details/138704958
MHD、MQA、GQA注意力机制详解：

LLM部署环境

集群

通信

「通信硬件」

NVLink
https://mp.weixin.qq.com/s/itIi3FvUiMsGhMR2ou5Syw
一文读懂：多卡GPU是如何互联通信的:
https://www.nvidia.com/en-us/data-center/nvlink/
NVMe SSD
https://zhuanlan.zhihu.com/p/672098336
AI集群基础设施NVMe SSD详解：
InfiniBand
https://zhuanlan.zhihu.com/p/673903240

「通信软件(NCCL\HCCL…)」

https://pytorch.org/tutorials/intermediate/dist_tuto.html#collective-communication
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
用这拌元宵，一个字：香！| 分布式训练硬核技术——通讯原语：https://blog.csdn.net/Kenji_Shinji/article/details/125292757

「通信网络监控」

nvbandwidth：https://github.com/NVIDIA/nvbandwidth
DCGM：https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html

平台

Kubernetes：https://www.seldon.io/deploying-machine-learning-models-on-kubernetes

AI芯片

英伟达GPU
谷歌TPU
华为昇腾NPU
百度昆仑芯
寒武纪思元
阿里平头哥含光

显存

深度学习训练过程显存占用分析及优化：https://saikr.com/a/533227
深入解析大语言模型显存占用：https://blog.csdn.net/qq_43592352/article/details/137055671
混合精度训练与显存分析：https://baiqw.blog.csdn.net/article/details/131030255

LLM应用开发

开发框架

Langchain：
https://zhuanlan.zhihu.com/p/665503140
https://github.com/langchain-ai/langchain
理论+实践详解最热的LLM应用框架LangChain：
LangChain Agent原理解析：https://blog.csdn.net/2301_78285120/article/details/135303183
Hugging Face：
https://github.com/huggingface