        代码地址:GitHub - lllyasviel/ControlNet: Let us control diffusion models!

        扩散模型(Diffusion Model)的主要思想是通过去噪的的方式生成图片,训练过程是每个时间步,将不同“浓度”的噪声掺到原图片,然后将时间步(timestep)和掺了噪声的图片作为输入,模型负责预测噪声,再用输入图像减去噪声然后得到原图。就像米开朗基罗说的:塑像本来就在石头里,我只是把不需要的部分去掉。这也是为什么在使用Stable Diffusion时Sampling steps不是越大越好的原因,这个值需要跟当前噪声图片所处的时间步相对应。

        ControlNet在大型预训练扩散模型(Stable Diffusion)的基础上实现了更多的输入条件,如边缘映射、分割映射和关键点等图片加上文字作为Prompt生成新的图片,同时也是stable-diffusion-webui的重要插件。ControlNet因为使用了冻结参数的Stable Diffusion和零卷积,使得即使使用个人电脑在小的数据集上fine-tuning效果也不会下降,从而实现了以端到端方式学习特定任务的条件目的。


1.使用Stable Diffusion并冻结其参数,同时copy一份SDEncoder的副本,这个副本的参数是可训练的。这样做的好处有两个:



2.零卷积 :即初始权重和bias都是零的卷积。在副本中每层增加一个零卷积与原始网络的对应层相连。在第一步训练中,神经网络块的可训练副本和锁定副本的所有输入和输出都是一致的,就好像ControlNet不存在一样。换句话说,在任何优化之前,ControlNet都不会对深度神经特征造成任何影响,任何进一步的优化都会使模型性能提升,并且训练速度很快。



        下载预训练模型,地址:lllyasviel/ControlNet at main,下载control_sd15_canny.pth模型,放到models目录。


python gradio_canny2image.py
        点击Advanced options会出现附加选项,我简单介绍一下每个选项的意思:


Image Resolution:生成的图片分辨率。

Control Strength:下面会介绍到,ControlNet分成Stable Diffusion和ControlNet两部分,这个参数是ControlNet所占的权重,当下面的Guess Mode未选中ControlNet部分的权重全都是这个值;如果下面的Guess Mode选中,在ControlNet部分每层(共13层)的权重会递增,范围从0到1。递增的代码如下,注释挺有意思:

  1. # 位置 gradio_canny2image.py
  2. # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
  3. model.control_scales = [strength * ( 0.825 ** float( 12 - i)) for i in range( 13)] if guess_mode else ([strength] * 13)
Guess Mode:不选中,模型在处理Negative Prompt部分时,Stable Diffusion和ControlNet两部分全有效;选中,在处理Negative Prompt部分时,只走Stable Diffusion分支,ControlNet部分无效。代码分两部分:

  1. # 位置 gradio_canny2image.py
  2. ...
  3. un_cond = { "c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
  4. ...
  5. # 位置 cldm/cldm.py
  6. if cond[ 'c_concat'] is None:
  7. eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control= None, only_mid_control=self.only_mid_control)
  8. else:
  9. # ControlNet() # 位置
  10. control = self.control_model(x=x_noisy, hint=torch.cat(cond[ 'c_concat'], 1), timesteps=t, context=cond_txt)
  11. control = [c * scale for c, scale in zip(control, self.control_scales)]
  12. # ControlledUnetModel() # 位置
  13. eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=control, only_mid_control=self.only_mid_control)
Canny low threshold:Canny的参数,如果边缘像素值小于低阈值,则会被抑制。

Canny high threshold:Canny的参数,边缘像素的值高于高阈值,将其标记为强边缘像素。


Guidance Scale:正向prompt所占比重,下面代码中的unconditional_guidance_scale就是这个参数,model_t是正向prompt+Added Prompt生成的特征,model_uncond是Negative Prompt生成的特征:

  1. # 位置 cldm/ddim_hacked.py
  2. model_output = model_uncond + unconditional_guidance_scale * (model_t - model_uncond)
  • 1


eta (DDIM):DDIM采样中的eta值。

Added Prompt:附加的正面prompt,比如best quality, extremely detailed

Negative Prompt:附件的负面prompt,如果生成的图不满意,哪部分不满意可以写在这里,比如longbody, lowres, bad anatomy






        模型输入包括canny图(Map Input)、Prompt、附加Prompt(Added Prompt)、负面Prompt(Negative Prompt)、随机图(Random Input)。

        Prompt和Added Prompt连个字符串拼接到一起经过CLIPEmbedder得到文本的空间表示(两个FrozenCLIPEmbedder共享参数),然后与Map Input、Random Input一同送入ControlNet的核心模块ControlLDM(Latent Diffusion),然后循环20次(对应页面参数Steps),其中timesteps每个时间步不一样,以Steps=20为例,timesteps分别等于[1,51,101,151,201,251,301,351,401,451,501,551,601,651,701,751,801,851,901,951]。

        Negative Prompt也做类似操作,然后将Prompt和Prompt的输出做加权,公式如下,其中GuidanceScale为页面参数,默认9:

        最后经过Decode First Stage还原成原图片大小。




a.timesteps经过embedding转换为特征向量送入Stable Diffusion和ControlNet;

b.随机噪声被送入Stable Diffusion;


d.Prompt的Embedding送入Stable Diffusion和ControlNet;

e.Stable Diffusion的所有参数被冻结不参与训练,Stable Diffusion由三个SDEncoderBlock、两个SDEncoder、一个SDMiddleBlock、两个SDDecoder和三个SDDecoderBlock组成;

f.ControlNet的结构与Stable Diffusion一致,只是每层后面增加一个零卷积;

g.Stable Diffusion和ControlNet中的ResBlock将上一层的输出和timesteps作为输入;

h.Stable Diffusion和ControlNet中的SpatialTransformer将上一层的输出和Prompt Embedding 作为输入。


3.Timestep Embedding

        timestep是模型的重要输入,直接影响去噪效果,timestep输入时是一个数字,经过Timestep Embedding变成长度是1280embedding。


  1. # 位置 ldm/modules/diffusionmodules/util.py
  2. def timestep_embedding( timesteps, dim, max_period=10000, repeat_only=False):
  3. """
  4. Create sinusoidal timestep embeddings.
  5. :param timesteps: a 1-D Tensor of N indices, one per batch element.
  6. These may be fractional.
  7. :param dim: the dimension of the output.
  8. :param max_period: controls the minimum frequency of the embeddings.
  9. :return: an [N x dim] Tensor of positional embeddings.
  10. """
  11. if not repeat_only:
  12. half = dim // 2
  13. freqs = torch.exp(
  14. -math.log(max_period) * torch.arange(start= 0, end=half, dtype=torch.float32) / half
  15. ).to(device=timesteps.device)
  16. args = timesteps[:, None]. float() * freqs[ None]
  17. embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=- 1)
  18. if dim % 2:
  19. embedding = torch.cat([embedding, torch.zeros_like(embedding[:, : 1])], dim=- 1)
  20. else:
  21. embedding = repeat(timesteps, 'b -> b d', d=dim)
  22. return embedding
  1. # 位置cldm/cldm.py
  2. self.input_hint_block = TimestepEmbedSequential(
  3. conv_nd(dims, hint_channels, 16, 3, padding= 1),
  4. nn.SiLU(),
  5. conv_nd(dims, 16, 16, 3, padding= 1),
  6. nn.SiLU(),
  7. conv_nd(dims, 16, 32, 3, padding= 1, stride= 2),
  8. nn.SiLU(),
  9. conv_nd(dims, 32, 32, 3, padding= 1),
  10. nn.SiLU(),
  11. conv_nd(dims, 32, 96, 3, padding= 1, stride= 2),
  12. nn.SiLU(),
  13. conv_nd(dims, 96, 96, 3, padding= 1),
  14. nn.SiLU(),
  15. conv_nd(dims, 96, 256, 3, padding= 1, stride= 2),
  16. nn.SiLU(),
  17. zero_module(conv_nd(dims, 256, model_channels, 3, padding= 1))
  18. )
  1. # 位置 ldm/modules/diffusionmodules/openaimodel.py
  2. class ResBlock( TimestepBlock):
  3. """
  4. A residual block that can optionally change the number of channels.
  5. :param channels: the number of input channels.
  6. :param emb_channels: the number of timestep embedding channels.
  7. :param dropout: the rate of dropout.
  8. :param out_channels: if specified, the number of out channels.
  9. :param use_conv: if True and out_channels is specified, use a spatial
  10. convolution instead of a smaller 1x1 convolution to change the
  11. channels in the skip connection.
  12. :param dims: determines if the signal is 1D, 2D, or 3D.
  13. :param use_checkpoint: if True, use gradient checkpointing on this module.
  14. :param up: if True, use this block for upsampling.
  15. :param down: if True, use this block for downsampling.
  16. """
  17. def __init__(
  18. self,
  19. channels,
  20. emb_channels,
  21. dropout,
  22. out_channels=None,
  23. use_conv=False,
  24. use_scale_shift_norm=False,
  25. dims=2,
  26. use_checkpoint=False,
  27. up=False,
  28. down=False,
  29. ):
  30. super().__init__()
  31. self.channels = channels
  32. self.emb_channels = emb_channels
  33. self.dropout = dropout
  34. self.out_channels = out_channels or channels
  35. self.use_conv = use_conv
  36. self.use_checkpoint = use_checkpoint
  37. self.use_scale_shift_norm = use_scale_shift_norm
  38. self.in_layers = nn.Sequential(
  39. normalization(channels),
  40. nn.SiLU(),
  41. conv_nd(dims, channels, self.out_channels, 3, padding= 1),
  42. )
  43. self.updown = up or down
  44. if up:
  45. self.h_upd = Upsample(channels, False, dims)
  46. self.x_upd = Upsample(channels, False, dims)
  47. elif down:
  48. self.h_upd = Downsample(channels, False, dims)
  49. self.x_upd = Downsample(channels, False, dims)
  50. else:
  51. self.h_upd = self.x_upd = nn.Identity()
  52. self.emb_layers = nn.Sequential(
  53. nn.SiLU(),
  54. linear(
  55. emb_channels,
  56. 2 * self.out_channels if use_scale_shift_norm else self.out_channels,
  57. ),
  58. )
  59. self.out_layers = nn.Sequential(
  60. normalization(self.out_channels),
  61. nn.SiLU(),
  62. nn.Dropout(p=dropout),
  63. zero_module(
  64. conv_nd(dims, self.out_channels, self.out_channels, 3, padding= 1)
  65. ),
  66. )
  67. if self.out_channels == channels:
  68. self.skip_connection = nn.Identity()
  69. elif use_conv:
  70. self.skip_connection = conv_nd(
  71. dims, channels, self.out_channels, 3, padding= 1
  72. )
  73. else:
  74. self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)
  75. def forward( self, x, emb):
  76. """
  77. Apply the block to a Tensor, conditioned on a timestep embedding.
  78. :param x: an [N x C x ...] Tensor of features.
  79. :param emb: an [N x emb_channels] Tensor of timestep embeddings.
  80. :return: an [N x C x ...] Tensor of outputs.
  81. """
  82. return checkpoint(
  83. self._forward, (x, emb), self.parameters(), self.use_checkpoint
  84. )
  85. def _forward( self, x, emb):
  86. if self.updown:
  87. in_rest, in_conv = self.in_layers[:- 1], self.in_layers[- 1]
  88. h = in_rest(x)
  89. h = self.h_upd(h)
  90. x = self.x_upd(x)
  91. h = in_conv(h)
  92. else:
  93. h = self.in_layers(x)
  94. emb_out = self.emb_layers(emb). type(h.dtype)
  95. while len(emb_out.shape) < len(h.shape):
  96. emb_out = emb_out[..., None]
  97. if self.use_scale_shift_norm:
  98. out_norm, out_rest = self.out_layers[ 0], self.out_layers[ 1:]
  99. scale, shift = th.chunk(emb_out, 2, dim= 1)
  100. h = out_norm(h) * ( 1 + scale) + shift
  101. h = out_rest(h)
  102. else:
  103. h = h + emb_out
  104. h = self.out_layers(h)
  105. return self.skip_connection(x) + h
        SpatialTransformer主要负责融合Prompt Embedding和上一层的输出,结构如下:


        CrossAttention1将上一个层的输出作为输入,将输入平分成三分,分别经过两个全连接得到K和V,K乘以Q经过Softmax得到一个概率图,让后在于V相乘,是一个比较标准的Attention结构,其实跟像是一个Self Attention。

        CrossAttention2和CrossAttention1的大体结构一样,不同的是K和V是由Prompt Embedding生成的。经过了两个CrossAttention,图像特征与Prompt Embedding已经融合到一起了。



  1. # 位置 ldm/modules/attention.py
  2. class BasicTransformerBlock(nn.Module):
  4. "softmax": CrossAttention, # vanilla attention
  5. "softmax-xformers": MemoryEfficientCrossAttention
  6. }
  7. def __init__( self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True,
  8. disable_self_attn=False):
  9. super().__init__()
  10. attn_mode = "softmax-xformers" if XFORMERS_IS_AVAILBLE else "softmax"
  11. assert attn_mode in self.ATTENTION_MODES
  12. attn_cls = self.ATTENTION_MODES[attn_mode]
  13. self.disable_self_attn = disable_self_attn
  14. self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout,
  15. context_dim=context_dim if self.disable_self_attn else None) # is a self-attention if not self.disable_self_attn
  16. self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
  17. self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim,
  18. heads=n_heads, dim_head=d_head, dropout=dropout) # is self-attn if context is none
  19. self.norm1 = nn.LayerNorm(dim)
  20. self.norm2 = nn.LayerNorm(dim)
  21. self.norm3 = nn.LayerNorm(dim)
  22. self.checkpoint = checkpoint
  23. def forward( self, x, context=None):
  24. return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
  25. def _forward( self, x, context=None):
  26. x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
  27. x = self.attn2(self.norm2(x), context=context) + x
  28. x = self.ff(self.norm3(x)) + x
  29. return x
  30. class SpatialTransformer(nn.Module):
  31. """
  32. Transformer block for image-like data.
  33. First, project the input (aka embedding)
  34. and reshape to b, t, d.
  35. Then apply standard transformer action.
  36. Finally, reshape to image
  37. NEW: use_linear for more efficiency instead of the 1x1 convs
  38. """
  39. def __init__( self, in_channels, n_heads, d_head,
  40. depth=1, dropout=0., context_dim=None,
  41. disable_self_attn=False, use_linear=False,
  42. use_checkpoint=True):
  43. super().__init__()
  44. if exists(context_dim) and not isinstance(context_dim, list):
  45. context_dim = [context_dim]
  46. self.in_channels = in_channels
  47. inner_dim = n_heads * d_head
  48. self.norm = Normalize(in_channels)
  49. if not use_linear:
  50. self.proj_in = nn.Conv2d(in_channels,
  51. inner_dim,
  52. kernel_size= 1,
  53. stride= 1,
  54. padding= 0)
  55. else:
  56. self.proj_in = nn.Linear(in_channels, inner_dim)
  57. self.transformer_blocks = nn.ModuleList(
  58. [BasicTransformerBlock(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d],
  59. disable_self_attn=disable_self_attn, checkpoint=use_checkpoint)
  60. for d in range(depth)]
  61. )
  62. if not use_linear:
  63. self.proj_out = zero_module(nn.Conv2d(inner_dim,
  64. in_channels,
  65. kernel_size= 1,
  66. stride= 1,
  67. padding= 0))
  68. else:
  69. self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
  70. self.use_linear = use_linear
  71. def forward( self, x, context=None):
  72. # note: if no context is given, cross-attention defaults to self-attention
  73. if not isinstance(context, list):
  74. context = [context]
  75. b, c, h, w = x.shape
  76. x_in = x
  77. x = self.norm(x)
  78. if not self.use_linear:
  79. x = self.proj_in(x)
  80. x = rearrange(x, 'b c h w -> b (h w) c').contiguous()
  81. if self.use_linear:
  82. x = self.proj_in(x)
  83. for i, block in enumerate(self.transformer_blocks):
  84. x = block(x, context=context[i])
  85. if self.use_linear:
  86. x = self.proj_out(x)
  87. x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous()
  88. if not self.use_linear:
  89. x = self.proj_out(x)
  90. return x + x_in
7.SD Encoder Block

        SD Encoder Block是Stable Diffusion编码阶段的组成单元,是编码阶段的模块,主要是ResBlock和SpatialTransformer的堆叠,实现了timestep、hint Map、和PromptEmbedding的特征融合,同时进行下采样,增加特征图的通道数。值得注意的是,这部分代码是冻结的,结构图如下:

8.SD Decoder Block

        SD Decoder Block也是Stable Diffusion编码阶段的组成单元,是解码阶段的模块,主要也是ResBlock和SpatialTransformer的堆叠,实现了timestep、hint Map、和PromptEmbedding的特征融合,同时进行上采样,减少特征图的通道数。这部分代码也是冻结的。结构图如下:

SD Encoder Block + SD Decoder Block代码实现:

  1. # 位置 cldm/cldm.py
  2. class ControlledUnetModel( UNetModel):
  3. def forward( self, x, timesteps=None, context=None, control=None, only_mid_control=False, **kwargs):
  4. hs = []
  5. with torch.no_grad():
  6. t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only= False)
  7. emb = self.time_embed(t_emb)
  8. h = x. type(self.dtype)
  9. for module in self.input_blocks:
  10. h = module(h, emb, context)
  11. hs.append(h)
  12. h = self.middle_block(h, emb, context)
  13. if control is not None:
  14. h += control.pop()
  15. for i, module in enumerate(self.output_blocks):
  16. if only_mid_control or control is None:
  17. h = torch.cat([h, hs.pop()], dim= 1)
  18. else:
  19. h = torch.cat([h, hs.pop() + control.pop()], dim= 1)
  20. h = module(h, emb, context)
  21. h = h. type(x.dtype)
  22. return self.out(h)
9.ControlNet Encoder Block

        ControlNet Encoder Block是克隆自SD Encoder Block,只是加入了零卷积,并且参数是可训练的,结构图如下:

10.Stable Diffusion

        整个Stable Diffusion的参数都是冻结不可训练的,冻结参数的代码如下:

  1. # 位置 cldm/cldm.py
  2. def configure_optimizers( self):
  3. lr = self.learning_rate
  4. params = list(self.control_model.parameters())
  5. if not self.sd_locked:
  6. params += list(self.model.diffusion_model.output_blocks.parameters())
  7. params += list(self.model.diffusion_model.out.parameters())
  8. opt = torch.optim.AdamW(params, lr=lr)
  9. return opt
a. 生成cannyMap


python gradio_annotator.py
        输入控制台打印的地址,一般是 。


  1. # 位置 gradio_annotator.py
  2. def canny( img, res, l, h):
  3. img = resize_image(HWC3(img), res)
  4. global model_canny
  5. if model_canny is None:
  6. from annotator.canny import CannyDetector
  7. model_canny = CannyDetector()
  8. result = model_canny(img, l, h)
  9. return [result]
1girl, asian, bangs, black_eyes, blunt_bangs, closed_mouth, lips, long_hair, looking_at_viewer, realistic, shirt, smile, solo, white_shirt







        我们可以从这里下载Stable Different的预训练模型放到models目录,然后通过下面的命令生成ControlNet模型,这一步主要是复制Stable Different Encoder的结构和参数:

python tool_add_control.py  ./models/v1-5-pruned.ckpt ./models/control_sd15_ini.ckpt
python tutorial_train.py
  1. # 位置 ldm/models/diffusion/dpm_solver/ddpm.py
  2. def get_loss( self, pred, target, mean=True):
  3. if self.loss_type == 'l1':
  4. loss = (target - pred). abs()
  5. if mean:
  6. loss = loss.mean()
  7. elif self.loss_type == 'l2':
  8. if mean:
  9. loss = torch.nn.functional.mse_loss(target, pred)
  10. else:
  11. loss = torch.nn.functional.mse_loss(target, pred, reduction= 'none')
  12. else:
  13. raise NotImplementedError( "unknown loss type '{loss_type}'")
  14. return loss
  1. # 位置 tutorial_train.py
  2. sd_locked = True
  3. only_mid_control = True
        如果配置一般,可以只用标准的训练过程,即冻结Stable Diffusion、训练ControlNet,这也是默认配置,代码如下:

  1. # 位置 tutorial_train.py
  2. sd_locked = True
  3. only_mid_control = False
  1. # 位置 tutorial_train.py
  2. sd_locked = False
  3. only_mid_control = False
  • 1

        ControlNet的重点内容节本就是这些,我将持续更新Stable Diffusion的相关内容,点个关注,不迷路。

