赞
踩
lora的训练使用的文件是https://github.com/Akegarasu/lora-scripts
lora训练是需要成对的文本图像对的,需要准备相应的训练数据。
1.训练数据准备
使用deepbooru/blip生成训练数据,建筑类建议使用blip来生成。
2.lora在linux上环境
cuda 10.1 p40 python3.7
accelerate==0.15.0 应该只能在虚拟环境中,在train.sh中把accelerate launch --num_cpu_threads_per_process=8换成python,这么改accelerate多卡训练有问题
albumentations==0.2.0
scikit-image==0.14 版本高了会报错
numpy==1.17
这里面有个skimage的版本问题,会报错
safetensors==0.3.0
voluptuous==0.12.1
huggingface-hub==0.12.0
transformers==4.20.0
tokenizers==0.11.6
opencv-python==4.0.0.21
einops==0.3.0
ftfy==6.0
pytorch-lightning==1.2.8
xformers==0.0.9(torch可以支持torch==1.8.1)
diffusers==0.10.0
pyre-extensions==0.3.0
regex==2021.4.4
升级glic
3.sh train.sh训练
openai的clip权重要配置一下
library/train_util/ 1900多行中load_tokenizer函数中的tokenizer=CLIPTokenizer.from_pretrained()
使用的是openai的clip-vit-large-patch14参数
4.lora-scripts的核心代码解析
- train_network.py->train->
- train_util.load_tokenizer->
- BuleprintGenerator->
- config_util.generate_dreambooth_subsets_config_by_subdirs->
- blueprint_generator.generate->
- config_util.generate_dataset_group_by_blueprint 加载数据->
- train_util.prepare_accelerator->
- train_util.prepare_dtype->
- train_util.load_target_model 加载sd模型->
- train_util.replace_unet_modules->
- vae.to(accelerator.device)->
- vae.requires_grad_(False)->
- vae.eval()->
- train_dataset_group.cache_lantents->
- network_module(LoRANetwork)->
- network.apply_to(text_encoder,unet,train_text_encoder,train_unet)->
- network.prepare_optimizer_params->
- train_util.get_optimizer->
- train_dataloader=torch.utils.data.DataLoader(train_dataset_group)->
- lr_scheduler=train_util.get_scheduler_fix->
- unet,text_encoder,network,optimizer,train_dataloader,lr_scheduler=accelerator.prepare(unet,text_encoder,network,optimizer,train_dataloader,lr_scheduler)->
- unet.requires_grad_(False)->
- unet.to(accelerator.device)->
- text_encoder.requires_grad_->
- text_encoder.to(accelerator-device)->
- unet.eval()->
- text_encoder.eval()->
- network.prepare_grad_etc(text_encoder,unet)->
- dataset=train_dataset_group.dataset[0]->
- noise_scheduler=DDPMScheduler(beta_start=0.00085,beta_end=0.012,beta_schedule='scaled_linear',num_train_timesteps=1000,clip_sample=False)->
- accelerator.init_trackers('netwoek_train')->
- network.on_epoch_start(text_encoder,unet)->
- latents=batch['latents'].to(accelerator.device)->
- latents=latents*0.18215->
- encoder_hidden_states=train_util.get_hidden_states->
- noise = torch.randn(latents,device=latent.device)->
- timesteps=torch.randint(0,noise_scheduler.config.num_train_timesteps,(b_size,),device=latents.device)->
- noise_latents=noise_scheduler.add_noise(lantents,noise,timesteps) [1,4,64,64]->
- noise_pred=unet(noisy_latents,timesteps,encoder_hidden_states).sample [1,4,64,64]->
- target=noise_scheduler.get_velocity(latnets,noise,timesteps) [1,4,64,64]->
- loss=torch.nn.functional.mse_loss(noise_pred.float(),target.float(),reduction='none') [1,4,64,64]->
- loss=loss.mean([1,2,3])->
- loss_weights=batch['loss_weights']->
- loss=loss*loss_weights->
- accelerator.backward(loss)->
- param_to_clip=network.get_trainable_params()->
- accelerator.clip_grad_norm_()->
- optimizer.grad()->
- lr_scheduler.step()->
- optimizer.zero_grad()->
- train_util.sample_images(accelerator,args,None,global_step,accelerator.device,vae,tokenizer,text_encoder,unet)
5.入参
- args = Namespace(
- bucket_no_upscale=False,
- bucket_reso_steps=64,
- cache_latents=True,
- caption_dropout_every_n_epochs=0,
- caption_dropout_rate=0.0,
- caption_extension='.txt',
- caption_extention=None,
- caption_tag_dropout_rate=0.0,
- clip_skip=2,
- color_aug=False,
- dataset_config=None,
- dataset_repeats=1,
- debug_dataset=False,
- enable_bucket=True,
- face_crop_aug_range=None,
- flip_aug=False,
- full_fp16=False,
- gradient_accumulation_steps=1,
- gradient_checkpointing=False,
- in_json=None,
- keep_tokens=0,
- learning_rate=0.0001,
- log_prefix=None,
- logging_dir='./logs',
- lowram=False,
- lr_scheduler='cosine_with_restarts',
- lr_scheduler_num_cycles=1,
- lr_scheduler_power=1,
- lr_warmup_steps=0,
- max_bucket_reso=1024,
- max_data_loader_n_workers=8,
- max_grad_norm=1.0,
- max_token_length=225,
- max_train_epochs=10,
- max_train_steps=1600,
- mem_eff_attn=False,
- min_bucket_reso=256,
- mixed_precision='fp16',
- network_alpha=32.0,
- network_args=None,
- network_dim=32,
- network_module='networks.lora',
- network_train_text_encoder_only=False,
- network_train_unet_only=False,
- network_weights=None,
- no_metadata=False,
- noise_offset=0.0,
- optimizer_args=None,
- optimizer_type='',
- output_dir='./output',
- output_name='/home/sniss/local_disk/lora-scripts/output',
- persistent_data_loader_workers=False,
- pretrained_model_name_or_path='/home/sniss/local_disk/stable-diffusion-webui_23-02-17/models/Stable-diffusion/sd-v1.5.ckpt',
- prior_loss_weight=1.0,
- random_crop=False,
- reg_data_dir=None,
- resolution=(512, 512),
- resume=None,
- sample_every_n_epochs=None,
- sample_every_n_steps=None,
- sample_prompts=None,
- sample_sampler='ddim',
- save_every_n_epochs=2,
- save_last_n_epochs=None,
- save_last_n_epochs_state=None,
- save_model_as='ckpt',
- save_n_epoch_ratio=None,
- save_precision='fp16',
- save_state=False,
- seed=1337,
- shuffle_caption=True,
- text_encoder_lr=1e-05,
- tokenizer_cache_dir=None,
- train_batch_size=1,
- train_data_dir='/home/sniss/local_disk/lora-scripts/data',
- training_comment=None,
- unet_lr=0.0001,
- use_8bit_adam=False,
- use_lion_optimizer=False,
- v2=False,
- v_parameterization=False,
- vae=None,
- xformers=True)
参数的设置有好几块:
1.输入的train.sh
2.train_network.py中的main部分
3.train_util.py中的1536行附近的
add_sd_models_args/add_optimizer_args/add_training_args/add_dataset_args
6.注意事项
a.数据集名称第一个是数字20_arch,这个数字和训练轮数epoch有关
b.train_network中
c.max_token_length=75,150,225,使用225会报错?RuntimeError: The size of tensor a (227) must match the size of tensor b (77) at non-singleton dimension 1,这块是个巨坑,主要是clip初始化的时候,忘了加tokenizer_config.json这个文件。
d.bash: accelerate: command not found,没搞定
多卡训练
- python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr=localhost --master_port=22222 --use_env "./sd-scripts/train_network.py" \
-
- import torch.distributed as dist
- dist.init_process_group(backend='gloo', init_method='env://')
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。