赞
踩
Git下载
请确保 lfs 已经被正确安装
git lfs install
git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3.1-8B-Instruct.git
https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
解压,与模型文件放置在同一目录下
pip install trl
pip install bitsandbytes
pip install accelerate
pip install peft
# accelerate config -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- In which compute environment are you running? This machine -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Which type of machine are you using? No distributed training Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:no Do you wish to optimize your script with torch dynamo?[yes/NO]:no Do you want to use DeepSpeed? [yes/NO]: no What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:3 Would you like to enable numa efficiency? (Currently only supported on NVIDIA hardware). [yes/NO]: yes -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Do you wish to use FP16 or BF16 (mixed precision)? no
逐行模式
trl sft \
--model_name_or_path LLM-Research/Meta-Llama-3___1-8B-Instruct \
--dataset_name aclImdb_v1 \
--dataset_text_field text \
--load_in_4bit \
--use_peft \
--max_seq_length 512 \
--learning_rate 0.001 \
--per_device_train_batch_size 2 \
--output_dir ./sft-imdb-llama3-8b \
--logging_steps 10
本次微调使用一个A100GPU,跑了61个小时,时长供参考
参考链接:https://blog.csdn.net/zhujiahui622/article/details/138308088
1、有告警,可忽略
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible
对于第一个告警可以安装对应包
apt install libaio-dev
2、微调报错
W0730 06:27:45.490000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963586 closing signal SIGTERM W0730 06:27:45.493000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963587 closing signal SIGTERM W0730 06:27:45.495000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 3963589 closing signal SIGTERM E0730 06:27:45.713000 139907808057152 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 2 (pid: 3963588) of binary: /home/test/anaconda3/bin/python Traceback (most recent call last): File "/home/test/anaconda3/bin/accelerate", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command multi_gpu_launcher(args) File "/home/test/anaconda3/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher distrib_run.run(args) File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/test/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/test/anaconda3/lib/python3.11/site-packages/trl/commands/scripts/sft.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-30_06:27:45 host : node20 rank : 2 (local_rank: 2) exitcode : 1 (pid: 3963588) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [06:27:46] TRL - SFT failed on ! See the logs above for further details. cli.py:67 Traceback (most recent call last): File "/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/cli.py", line 58, in main subprocess.run( File "/home/test/anaconda3/lib/python3.11/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['accelerate', 'launch', '/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/scripts/sft.py', '--model_name_or_path', 'LLM-Research/Meta-Llama-3___1-8B-Instruct', '--dataset_name', '/imdb', '--dataset_text_field', 'text', '--load_in_4bit', '--use_peft', '--max_seq_length', '512', '--learning_rate', '0.001', '--per_device_train_batch_size', '2', '--output_dir', './sft-imdb-llama3-8b', '--logging_steps', '10']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/test/anaconda3/bin/trl", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/test/anaconda3/lib/python3.11/site-packages/trl/commands/cli.py", line 68, in main raise ValueError("TRL CLI failed! Check the traceback above..") from exc ValueError: TRL CLI failed! Check the traceback above..
使用accelerate config重新更改配置文件后重试ok
在不清楚每一个选项的作用时,先不要调整选项的值,按照上文的设置可以跑起来。
https://github.com/hiyouga/LLaMA-Factory
Fine-Tuning with LLaMA Board GUI
llamafactory-cli webui
选择模型,数据集,微调参数,开始训练即可
参考链接:
https://www.cnblogs.com/hlgnet/articles/18148788
https://blog.csdn.net/u010438035/article/details/140326826?spm=1001.2014.3001.5502
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。