当前位置:   article > 正文

调研:huggingface-diffusers_diffusers可以做什么

diffusers可以做什么

1. Diffusers能带来什么

1.1 Overview

Diffusers是集成state-of-the-art预训练diffusion模型库,用于生成图像、音频甚至3D结构。

Diffusers库注重可用性而非高性能。

Diffusers主要提供三项能力:

  • State-of-the-art diffusion pipelines,低代码推理。
  • Interchangeable noise schedulers,便于平衡生成速度和质量。
  • Pretrained models,构建自己的diffusion模型。

1.2 支持管道

Pipeline文章/项目任务
alt_diffusionAltCLIP: Altering the Language Encoder in CLIP for Extended Language CapabilitiesImage-to-Image Text-Guided Generation
audio_diffusionAudio DiffusionUnconditional Audio Generation
controlnetAdding Conditional Control to Text-to-Image Diffusion ModelsImage-to-Image Text-Guided Generation
cycle_diffusionUnifying Diffusion Models’ Latent Space, with Applications to CycleDiffusion and GuidanceImage-to-Image Text-Guided Generation
dance_diffusionDance DiffusionUnconditional Audio Generation
ddpmDenoising Diffusion Probabilistic ModelsUnconditional Image Generation
ddimDenoising Diffusion Implicit ModelsUnconditional Image Generation
ifIFImage Generation
if_img2imgIFImage-to-Image Generation
if_inpaintingIFImage-to-Image Generation
latent_diffusionHigh-Resolution Image Synthesis with Latent Diffusion ModelsText-to-Image Generation
latent_diffusionHigh-Resolution Image Synthesis with Latent Diffusion ModelsSuper Resolution Image-to-Image
latent_diffusion_uncondHigh-Resolution Image Synthesis with Latent Diffusion ModelsUnconditional Image Generation
paint_by_examplePaint by Example: Exemplar-based Image Editing with Diffusion ModelsImage-Guided Image Inpainting
pndmPseudo Numerical Methods for Diffusion Models on ManifoldsUnconditional Image Generation
score_sde_veScore-Based Generative Modeling through Stochastic Differential EquationsUnconditional Image Generation
score_sde_vpScore-Based Generative Modeling through Stochastic Differential EquationsUnconditional Image Generation
semantic_stable_diffusionSemantic GuidanceText-Guided Generation
stable_diffusion_text2imgStable DiffusionText-to-Image Generation
stable_diffusion_img2imgStable DiffusionImage-to-Image Text-Guided Generation
stable_diffusion_inpaintStable DiffusionText-Guided Image Inpainting
stable_diffusion_panoramaMultiDiffusionText-to-Panorama Generation
stable_diffusion_pix2pixInstructPix2Pix: Learning to Follow Image Editing InstructionsText-Guided Image Editing
stable_diffusion_pix2pix_zeroZero-shot Image-to-Image TranslationText-Guided Image Editing
stable_diffusion_attend_and_exciteAttend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion ModelsText-to-Image Generation
stable_diffusion_self_attention_guidanceImproving Sample Quality of Diffusion Models Using Self-Attention GuidanceText-to-Image Generation Unconditional Image Generation
stable_diffusion_image_variationStable Diffusion Image VariationsImage-to-Image Generation
stable_diffusion_latent_upscaleStable Diffusion Latent UpscalerText-Guided Super Resolution Image-to-Image
stable_diffusion_model_editingEditing Implicit Assumptions in Text-to-Image Diffusion ModelsText-to-Image Model Editing
stable_diffusion_2Stable Diffusion 2Text-to-Image Generation
stable_diffusion_2Stable Diffusion 2Text-Guided Image Inpainting
stable_diffusion_2Depth-Conditional Stable DiffusionDepth-to-Image Generation
stable_diffusion_2Stable Diffusion 2Text-Guided Super Resolution Image-to-Image
stable_diffusion_safeSafe Stable DiffusionText-Guided Generation
stable_unclipStable unCLIPText-to-Image Generation
stable_unclipStable unCLIPImage-to-Image Text-Guided Generation
stochastic_karras_veElucidating the Design Space of Diffusion-Based Generative ModelsUnconditional Image Generation
text_to_video_sdModelscope’s Text-to-video-synthesis Model in Open DomainText-to-Video Generation
unclipHierarchical Text-Conditional Image Generation with CLIP Latents(implementation by kakaobrain)Text-to-Image Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelText-to-Image Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelImage Variations Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelDual Image and Text Guided Generation
vq_diffusionVector Quantized Diffusion Model for Text-to-Image SynthesisText-to-Image Generation

1.3 DiffusionPipeline

DiffusionPipeline是高度抽象的端到端接口,huggingdace-diffusers所有model和scheduler都包含在内,方便启动推理。

任务描述Pipeline
Unconditional Image Generationgenerate an image from Gaussian noiseunconditional_image_generation
Text-Guided Image Generationgenerate an image given a text promptconditional_image_generation
Text-Guided Image-to-Image Translationadapt an image guided by a text promptimg2img
Text-Guided Image-Inpaintingfill the masked part of an image given the image, the mask and a text promptinpaint
Text-Guided Depth-to-Image Translationadapt parts of an image guided by a text prompt while preserving structure via depth estimationdepth2img

2. 任务管道介绍

使用diffusers一个很重要的、需要特别注意的点是区分推理和训练管道之间的关系。

2.1 直接推理管道

2.1.1 Unconditional image generation

Unconditional image generation相对简单,管道内模型生成图像不需要任何额外的上下文信息(文字、图像等)。
Unconditional image generation管道生成的图像只取决于训练数据。

Pipeline:DiffusionPipeline
  • 1

2.1.2 Text-to-image generation

Text-to-image generation即Conditional image generation,允许从text prompt生成图像。text被转换embeddings用于condition模型从noise中生成图像。

Pipeline:DiffusionPipeline
  • 1

2.1.3 Text-guided image-to-image generation

Text-guided image-to-image generation允许以text prompt和一张初始图像作为限制生成一张新图像。

Pipeline:StableDiffusionImg2ImgPipeline
  • 1

2.1.4 Text-guided image-inpainting

Text-guided image-inpainting允许通过mask和text prompt编辑图像中的特定部分。

Pipeline:StableDiffusionInpaintPipeline
  • 1

2.1.5 Text-guided depth-to-image generation

Text-guided depth-to-image generation允许以text prompt和一张初始图像作为限制生成一张新图像。可通过depth_map参数保留图像depth结构,如不传递depth_map则会通过depth-estimation估计depth。

Pipeline:StableDiffusionDepth2ImgPipeline
  • 1

2.2 训练管道

2.2.1 overview

2.2.2 Unconditional Image Generation

不以任何文本或图像为条件。它只生成与训练数据分布相似的图像。

accelerate launch train_unconditional.py \
  --dataset_name="huggan/flowers-102-categories" \
  --resolution=64 \
  --output_dir="ddpm-ema-flowers-64" \
  --train_batch_size=16 \
  --num_epochs=100 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_warmup_steps=500 \
  --mixed_precision=no \
  --push_to_hub
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

2.2.3 Text-to-Image fine-tuning

以text prompt生成图像的训练流程,如Stable Diffusion模型。

accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model" 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

2.2.4 Textual Inversion

Textual Inversion从少量示例图像中捕捉novel concepts。这些学习到的概念可以用于个性化图像生成的prompt,更好地控制生成的图像。

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="textual_inversion_cat"
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

2.2.5 Dreambooth

DreamBooth是一种个性化文本到图像模型的方法,就像Stable Diffusion一样,只给出一个主题的几张(3-5张)图像。它允许模型在不同的场景、姿势和视图中生成主体的情境化图像。

python train_dreambooth_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --learning_rate=5e-6 \
  --max_train_steps=400
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

2.2.6 LoRA

LoRA: Low-Rank Adaptation of Large Language Models是一种在消耗较少内存的同时加速大型模型训练的训练方法。它将成对的秩分解权重矩阵(称为更新矩阵)添加到现有的权重中,并且只训练那些新添加的权重。

accelerate launch --mixed_precision="fp16"  train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --dataloader_num_workers=8 \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=15000 \
  --learning_rate=1e-04 \
  --max_grad_norm=1 \
  --lr_scheduler="cosine" --lr_warmup_steps=0 \
  --output_dir=${OUTPUT_DIR} \
  --push_to_hub \
  --hub_model_id=${HUB_MODEL_ID} \
  --report_to=wandb \
  --checkpointing_steps=500 \
  --validation_prompt="A pokemon with blue eyes." \
  --seed=1337
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

2.2.7 ControlNet

ControlNet相较单纯img2img更加精准和有效,可以直接提取画面的构图,人物的姿势和画面的深度信息,并以此为条件限制图像生成。

accelerate launch train_controlnet.py \
 --pretrained_model_name_or_path=$MODEL_DIR \
 --output_dir=$OUTPUT_DIR \
 --dataset_name=fusing/fill50k \
 --resolution=512 \
 --learning_rate=1e-5 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --train_batch_size=4
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

2.2.8 InstructPix2Pix

InstructPix2Pix使用:给定输入图像和编辑指令,告诉模型要做什么,模型将遵循这些指令来编辑图像。

accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_ID \
    --enable_xformers_memory_efficient_attention \
    --resolution=256 --random_flip \
    --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
    --max_train_steps=15000 \
    --checkpointing_steps=5000 --checkpoints_total_limit=1 \
    --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
    --conditioning_dropout_prob=0.05 \
    --mixed_precision=fp16 \
    --seed=42 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

2.2.9 Custom Diffusion

仅优化文本到图像扩散模型的交叉注意力层中参数即可高效学会新概念。面对多个概念组合时,可以先单独训练各个概念模型,再通过约束优化将多个微调模型合并成一个。

accelerate launch train_custom_diffusion.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --class_data_dir=./real_reg/samples_cat/ \
  --with_prior_preservation --real_prior --prior_loss_weight=1.0 \
  --class_prompt="cat" --num_class_images=200 \
  --instance_prompt="photo of a <new1> cat"  \
  --resolution=512  \
  --train_batch_size=2  \
  --learning_rate=1e-5  \
  --lr_warmup_steps=0 \
  --max_train_steps=250 \
  --scale_lr --hflip  \
  --modifier_token "<new1>" \
  --validation_prompt="<new1> cat sitting in a bucket" \
  --report_to="wandb"
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

3. Prompt Engineering

Weighting prompts

diffusers中提供的功能本质基本为text2img,text2img基于给定的prompt生成图像。prompt理应包括模型应该生成的多个概念,然往往事违人愿,故通常需要或多或少地对部分prompt进行加权以做到强调和去除强调。
扩散模型的工作原理是用上下文化的文本嵌入来调节扩散模型的交叉注意力层。因此,强调(或去除强调)提示的某些部分的简单方法是增加或减少与提示的相关部分相对应的文本嵌入向量的比例。

Weighting prompts支持从

prompt = "a red cat playing with a ball"
  • 1

变为

prompt = "a red cat playing with a ball++"
  • 1

以强调某个prompt。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/221523
推荐阅读
相关标签
  

闽ICP备14008679号