当前位置:   article > 正文

huggingface的diffusers训练stable diffusion记录_diffusers训练lora

diffusers训练lora

目录

1.原理

                扩散模型的目的是什么?        

                扩散模型是怎么做的?        

                前向过程在干啥?

                反向过程在干啥?

2.安装环境

3.lora 训练

4.推理

5.源代码​

        noise_scheduler:

        tokenizer:

        textModel:

        vae:

        unet:

        每次train的过程:


代码:https://github.com/huggingface/diffusers/tree/main/examples/text_to_image 

2006.11239.pdf (arxiv.org)论文 2006.11239.pdf (arxiv.org)

1.原理

找到一个科普不错的

https://www.bilibili.com/video/BV1tz4y1h7q1/?spm_id_from=333.337.search-card.all.click&vd_source=3aec03706e264c240796359c1c4d7ddc

简述一下:扩散模型分为两个部分,前向过程,根据timestep添加噪声,每一次从高斯噪声中采样然后添加到图像里面,这个采样的噪声就是GT。反向过程就是去噪,使用神经网络unet去预测噪声从而实现去噪。

扩散模型的目的是什么?

学习从纯噪声生成图片的方法

扩散模型是怎么做的?

训练一个U-Net,接受一系列加了噪声的图片,学习预测所加的噪声(纯噪声图-噪声=生成图)

前向过程在干啥?

逐步向真实图片添加噪声最终得到一个纯噪声

对于训练集中的每张图片,都能生成一系列的噪声程度不同的加噪图片

在训练时,这些 【不同程度的噪声图片 + 生成它们所用的噪声】 是实际的训练样本

反向过程在干啥?

训练好模型后,采样、生成图片

前向过程:

对于x2,把x1的公式带入到x2就可以消除x1,以此类推,最后得到Xt和X0的关系式:

多次噪声的累加(即他们的均值和方差分别相加)可以等价于一个噪声,最后Xt和时间t,参数a,噪声theta,X0有关。

反向过程:

在扩散模型的训练过程中,pipeline首先产生一个与输入图片同尺寸的噪声图,在每个时间步(timestep),将噪声图传给model来预测噪声残差(noise residual),然后scheduler根据预测出的噪声残差来得到一张噪声较少的图像,如此反复,直到达到预设的最大时间步,就得到了一张高质量的生成图像。利用了贝叶斯公式:

        目标:在Xt噪声发生的条件下求解Xt-1 P(Xt-1|Xt)以至于到X0(从结果反推过程事件的发生概率)

等价于 p(xt|xt-1)在xt-1发生的条件下xt的概率 * p(xt-1) / p(xt):

推到X0

把高斯分布的公式带入

2.安装环境

  1. git clone https://github.com/huggingface/diffusers
  2. cd diffusers
  3. pip install .
  4. cd example/text_to_image
  5. pip install -r requirements.txt

3.lora 训练

  1. export MODEL_NAME="CompVis/stable-diffusion-v1-4"
  2. export DATASET_NAME="lambdalabs/pokemon-blip-captions"
  3. accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  4. --pretrained_model_name_or_path=$MODEL_NAME \
  5. --dataset_name=$DATASET_NAME --caption_column="text" \
  6. --resolution=512 --random_flip \
  7. --train_batch_size=1 \
  8. --num_train_epochs=100 --checkpointing_steps=5000 \
  9. --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  10. --seed=42 \
  11. --output_dir="sd-pokemon-model-lora" \
  12. --validation_prompt="cute dragon creature" --report_to="wandb"

报错:ValueErrorValueError: : Attempting to unscale FP16 gradients.Attempting to unscale FP16 gradients.

参考:https://github.com/ymcui/Chinese-LLaMA-Alpaca/issues/310

把--mixed_precision="fp16" 去掉

比较合理的解释

成功train:

4.推理

只train了100步,跑通流程,学习才是重点。

  1. from diffusers import StableDiffusionPipeline
  2. import torch
  3. model_path = "./sd-model-finetuned-lora"
  4. pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
  5. pipe.unet.load_attn_procs(model_path)
  6. pipe.to("cuda")
  7. prompt = "A pokemon with green head and white legs."
  8. image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
  9. image.save("pokemon.png")

5.源代码

noise_scheduler:

Diffusion里的scheduler是一个采样器(samplers),用于把噪声图像还原为原始图像,它的功能是实现逆向扩散。注意:我们把去噪的过程定义为采样。使用采样的方法,称之为采样器。各种有关schedulers的代码在diffusers库 diffusers/src/diffusers/schedulers/中可以找到。

https://zhuanlan.zhihu.com/p/674001640

调度器定义了迭代地向图像添加噪声或基于模型输出更新样本的方法。

以不同方式添加噪声代表了通过向图像添加噪声来训练扩散模型的算法过程。

对于推断(inference),调度器定义了如何基于预训练模型的输出更新样本。

调度器通常由噪声计划(noise schedule)和更新规则(update rule)来解决微分方程问题的解。

  1. def add_noise(
  2. self,
  3. original_samples: torch.FloatTensor,
  4. noise: torch.FloatTensor,
  5. timesteps: torch.IntTensor,
  6. ) -> torch.FloatTensor:
  7. self.alphas_cumprod = self.alphas_cumprod.to(device=original_samples.device)
  8. alphas_cumprod = self.alphas_cumprod.to(dtype=original_samples.dtype)
  9. timesteps = timesteps.to(original_samples.device)
  10. sqrt_alpha_prod = alphas_cumprod[timesteps] ** 0.5 # # 取出该时刻t的a值
  11. sqrt_alpha_prod = sqrt_alpha_prod.flatten()
  12. while len(sqrt_alpha_prod.shape) < len(original_samples.shape):
  13. sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1)
  14. sqrt_one_minus_alpha_prod = (1 - alphas_cumprod[timesteps]) ** 0.5
  15. sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten()
  16. while len(sqrt_one_minus_alpha_prod.shape) < len(original_samples.shape):
  17. sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1)
  18. # 加噪 (1-B)^1/2 * x + B^1/2 * 噪声
  19. noisy_samples = sqrt_alpha_prod * original_samples + sqrt_one_minus_alpha_prod * noise
  20. return noisy_samples
  21. def step(
  22. self,
  23. model_output: torch.FloatTensor,
  24. timestep: int,
  25. sample: torch.FloatTensor,
  26. generator=None,
  27. return_dict: bool = True,
  28. ) -> Union[DDPMSchedulerOutput, Tuple]:
  29. """
  30. 去噪过程的计算
  31. 本质上:图像t - 预测的时刻t的噪声 = 图像t-1
  32. Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
  33. process from the learned model outputs (most often the predicted noise).
  34. """
  35. t = timestep
  36. prev_t = self.previous_timestep(t)
  37. if model_output.shape[1] == sample.shape[1] * 2 and self.variance_type in ["learned", "learned_range"]:
  38. model_output, predicted_variance = torch.split(model_output, sample.shape[1], dim=1)
  39. else:
  40. predicted_variance = None
  41. # 1. compute alphas, betas
  42. alpha_prod_t = self.alphas_cumprod[t]
  43. alpha_prod_t_prev = self.alphas_cumprod[prev_t] if prev_t >= 0 else self.one
  44. beta_prod_t = 1 - alpha_prod_t
  45. beta_prod_t_prev = 1 - alpha_prod_t_prev
  46. current_alpha_t = alpha_prod_t / alpha_prod_t_prev
  47. current_beta_t = 1 - current_alpha_t
  48. # 2. compute predicted original sample from predicted noise also called
  49. # "predicted x_0" of formula (15) from https://arxiv.org/pdf/2006.11239.pdf
  50. if self.config.prediction_type == "epsilon":
  51. pred_original_sample = (sample - beta_prod_t ** (0.5) * model_output) / alpha_prod_t ** (0.5)
  52. elif self.config.prediction_type == "sample":
  53. pred_original_sample = model_output
  54. elif self.config.prediction_type == "v_prediction":
  55. pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
  56. else:
  57. raise ValueError(
  58. f"prediction_type given as {self.config.prediction_type} must be one of `epsilon`, `sample` or"
  59. " `v_prediction` for the DDPMScheduler."
  60. )
  61. # 3. Clip or threshold "predicted x_0"
  62. if self.config.thresholding:
  63. pred_original_sample = self._threshold_sample(pred_original_sample)
  64. elif self.config.clip_sample:
  65. pred_original_sample = pred_original_sample.clamp(
  66. -self.config.clip_sample_range, self.config.clip_sample_range
  67. )
  68. # 4. Compute coefficients for pred_original_sample x_0 and current sample x_t
  69. # See formula (7) from https://arxiv.org/pdf/2006.11239.pdf
  70. pred_original_sample_coeff = (alpha_prod_t_prev ** (0.5) * current_beta_t) / beta_prod_t
  71. current_sample_coeff = current_alpha_t ** (0.5) * beta_prod_t_prev / beta_prod_t
  72. # 5. Compute predicted previous sample µ_t
  73. # See formula (7) from https://arxiv.org/pdf/2006.11239.pdf
  74. pred_prev_sample = pred_original_sample_coeff * pred_original_sample + current_sample_coeff * sample
  75. # 6. Add noise
  76. variance = 0
  77. if t > 0:
  78. device = model_output.device
  79. variance_noise = randn_tensor(
  80. model_output.shape, generator=generator, device=device, dtype=model_output.dtype
  81. )
  82. if self.variance_type == "fixed_small_log":
  83. variance = self._get_variance(t, predicted_variance=predicted_variance) * variance_noise
  84. elif self.variance_type == "learned_range":
  85. variance = self._get_variance(t, predicted_variance=predicted_variance)
  86. variance = torch.exp(0.5 * variance) * variance_noise
  87. else:
  88. variance = (self._get_variance(t, predicted_variance=predicted_variance) ** 0.5) * variance_noise
  89. pred_prev_sample = pred_prev_sample + variance
  90. if not return_dict:
  91. return (pred_prev_sample,)
  92. return DDPMSchedulerOutput(prev_sample=pred_prev_sample, pred_original_sample=pred_original_sample)
tokenizer:

把文本-->数字token,计算机只认识数字

textModel:

对文本token进行特征编码

vae

把图像从pixel空间编码mapping到latent 空间,数据压缩

Encoder = 4* resnet block下采样encoder + mid(attention + resnet)

Encoder = 4* resnet下采样encoder + mid(attention + resnet)

Decoder = 4* 上采样resnet block + mid(attention + resnet block)

VAE = encoder + decoder

unet:

图像的特征编码

上图是lora的参数量和原始的参数量

Unet = Resnet block + Attention + lora (linear)

每次train的过程:
  1. # 1.图像encoder 到latent空间
  2. latents = vae.encode(batch["pixel_values"].to(dtype=weight_dtype)).latent_dist.sample()
  3. latents = latents * vae.config.scaling_factor # bs,3,512,512-->bs,4,64,64
  4. # 2.采样噪声
  5. noise = torch.randn_like(latents)
  6. bsz = latents.shape[0]
  7. # 3.采样随机时间步长
  8. timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bsz,), device=latents.device)
  9. timesteps = timesteps.long() # 【01000】中随机的一个值
  10. # 3.前向过程:加噪
  11. noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps) # bs,4,64,64
  12. # 5.文本encoder(条件condition)
  13. encoder_hidden_states = text_encoder(batch["input_ids"], return_dict=False)[0]
  14. # 6.获取gt--noise
  15. target = noise
  16. # 7.unet预测噪声,添加文本encoder作为条件
  17. model_pred = unet(noisy_latents, timesteps, encoder_hidden_states, return_dict=False)[0]
  18. # 8.计算loss
  19. if args.snr_gamma is None:
  20. loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/680946
推荐阅读
相关标签
  

闽ICP备14008679号