Stable Diffusion是计算机视觉领域的一个生成式大模型,能够进行文生图(txt2img)和图生图(img2img)等图像生成任务。Stable Diffusion的开源公布,以及随之而来的一系列借助Stable Diffusion为基础的工作使得人工智能绘画领域呈现出前所未有的高品质创作与创意。
今年7月Stability AI 正式推出了 Stable Diffusion XL(SDXL)1.0,这是当前图像生成领域最好的开源模型。文生图模型又完成了进化过程中的一次重要迭代,SDXL 1.0几乎能够生成任何艺术风格的高质量图像,并且是实现逼真效果的最佳开源模型。该模型在色彩的鲜艳度和准确度方面做了很好的调整,对比度、光线和阴影都比上一代更好,并全部采用原生1024x1024分辨率。除此之外,SDXL 1.0 对于难以生成的概念有了很大改善,例如手、文本以及空间的排列。
目前关于文生图(text2img)模型的训练教程多集中在LoRA、DreamBooth、Text Inversion等模型,且训练方式大多也依赖于可视化UI界面工具,如SD WebUI、AI 绘画一键启动软件等等。而Full Fine-tuning的详细教程可以说几乎没有,所以这里记录一下我在微调SDXL Base模型过程中所参考的资料,以及一些训练参数的说明。
Paper:TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design ACM MM 2023.
数据集AutoPoster-Dataset是关于电商海报图片的自动化生成任务,它包含 76000 条记录,由阿里巴巴集团提供。
Paper: AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation ACM MM 2023
Paper: Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs IJCAI 2022
Github: https://github.com/minzhouGithub/CGL-GAN
Paper: A New Dataset and Benchmark for Content-aware Visual-Textual Presentation Layout CVPR 2023
Github: https://github.com/PKU-ICST-MIPL/PosterLayout-CVPR2023
Paper: TextPainter: Multimodal Text Image Generation with
Visual-harmony and Text-comprehension for Poster Design ACM MM 2023
import csv import os import requests import warnings warnings.filterwarnings('ignore') csv_file = r"C:\Users\xxx\Downloads\tvs.csv" url_prefix = 'https://www.themoviedb.org/t/p/w600_and_h900_bestv2' save_root_path = r"D:\dataset\download_data\tv_series" def parse_csv(path): cnt = 0 s = requests.Session() s.verify = False # 全局关闭ssl验证 with open(path, 'r', encoding='utf-8') as f: reader = csv.DictReader(f) for row in reader: raw_img_url = row['poster_path'] # url item img_url = url_prefix + raw_img_url if raw_img_url == '': continue try: img_file = s.get(img_url, verify=False) except Exception as e: print(repr(e)) print("错误的状态响应码:{}".format(img_file.status_code)) if img_file.status_code == 200: img_name = raw_img_url.split('/')[-1] # img_name = row['url'].split('/')[-2] + '.jpg' save_path = os.path.join(save_root_path, img_name) with open(save_path, 'wb') as img: img.write(img_file.content) cnt += 1 print(cnt, 'saved!') print("Done!") if __name__ == '__main__': if not os.path.exists(save_root_path): os.makedirs(save_root_path) parse_csv(csv_file)
SD生成的文件大小/图像分辨率:0.00129, 0.0012, 0.0011, 0.00136, 0.0014, 0.0015, 0.0013, 0.00149
SDXL’s VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely --pretrained_vae_model_name_or_path that lets you specify the location of a better VAE (such as this one).
里新增该参数,默认值1.0。参考:https://huggingface.co/docs/transformers/main_classes/optimizer_schedules目前,AIGC领域的测评过程整体上还是比较主观,但这里还是通过美学评分(Aesthetics)和CLIP score指标来分别衡量生成的图片质量与文图匹配度。评测代码基于GhostMix的作者开发的GhostReview,笔者仅取其中的一部分并做了一些优化,请结合着原作者的代码理解,具体代码如下:
import numpy as np import torch import pytorch_lightning as pl import torch.nn as nn import clip import os import torch.nn.functional as F import pandas as pd from PIL import Image import scipy class MLP(pl.LightningModule): def __init__(self, input_size, xcol='emb', ycol='avg_rating'): super().__init__() self.input_size = input_size self.xcol = xcol self.ycol = ycol self.layers = nn.Sequential( nn.Linear(self.input_size, 1024), # nn.ReLU(), nn.Dropout(0.2), nn.Linear(1024, 128), # nn.ReLU(), nn.Dropout(0.2), nn.Linear(128, 64), # nn.ReLU(), nn.Dropout(0.1), nn.Linear(64, 16), # nn.ReLU(), nn.Linear(16, 1) ) def forward(self, x): return self.layers(x) def training_step(self, batch, batch_idx): x = batch[self.xcol] y = batch[self.ycol].reshape(-1, 1) x_hat = self.layers(x) loss = F.mse_loss(x_hat, y) return loss def validation_step(self, batch, batch_idx): x = batch[self.xcol] y = batch[self.ycol].reshape(-1, 1) x_hat = self.layers(x) loss = F.mse_loss(x_hat, y) return loss def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=1e-3) return optimizer def normalized(a, axis=-1, order=2): l2 = np.atleast_1d(np.linalg.norm(a, order, axis)) l2[l2 == 0] = 1 return a / np.expand_dims(l2, axis) def PredictionLAION(image, laion_model, clip_model, clip_process, device='cpu'): image = clip_process(image).unsqueeze(0).to(device) with torch.no_grad(): image_features = clip_model.encode_image(image) im_emb_arr = normalized(image_features.cpu().detach().numpy()) prediction = laion_model(torch.from_numpy(im_emb_arr).to(device).type(torch.FloatTensor)) return float(prediction) # ClipScore for 1 image # 1张图片的ClipScore def get_clip_score(image, text, clip_model, preprocess, device='cpu'): # Preprocess the image and tokenize the text image_input = preprocess(image).unsqueeze(0) text_input = clip.tokenize([text], truncate=True) # Move the inputs to GPU if available image_input = image_input.to(device) text_input = text_input.to(device) # Generate embeddings for the image and text with torch.no_grad(): image_features = clip_model.encode_image(image_input) text_features = clip_model.encode_text(text_input) # Normalize the features image_features = image_features / image_features.norm(dim=-1, keepdim=True) text_features = text_features / text_features.norm(dim=-1, keepdim=True) # Calculate the cosine similarity to get the CLIP score clip_score = torch.matmul(image_features, text_features.T).item() return clip_score if __name__ == '__main__': # 读取图片路径 ImgRoot = './Image/ImageRating' DataFramePath = './dataresult/MyImageRating' # all prompts results of each model ModelSummaryFile = './ImageRatingSummary/MyModelSummary_Total.csv' PromptsFolder = os.listdir(ImgRoot) if not os.path.exists(DataFramePath): os.makedirs(DataFramePath) # 读取图片对应的Prompts PromptDataFrame = pd.read_csv('./PromptsForReviews/mytest.csv') PromptsList = list(PromptDataFrame['Prompts']) # 载入评估模型 device = "cuda" if torch.cuda.is_available() else "cpu" MLP_Model = MLP(768) # CLIP embedding dim is 768 for CLIP ViT L 14 # load LAION aesthetics model state_dict = torch.load("./models/sac+logos+ava1-l14-linearMSE.pth", map_location=torch.device(device)) MLP_Model.load_state_dict(state_dict) MLP_Model.to(device) MLP_Model.eval() # Load the pre-trained CLIP model and the image CLIP_Model, CLIP_Preprocess = clip.load('ViT-L/14', device=device, download_root='./models/clip') # RN50x64 CLIP_Model.to(device) CLIP_Model.eval() # 跳过已经做过的Prompts try: DataSummaryDone = pd.read_csv(ModelSummaryFile) PromptsNotDone = [i for i in PromptsFolder if i not in list(DataSummaryDone['Model'])] except: DataSummaryDone = pd.DataFrame() PromptsNotDone = [i for i in PromptsFolder] if not PromptsNotDone: import sys sys.exit("There are no models to analyze.") for i, name in enumerate(PromptsNotDone): FolderPath = os.path.join(ImgRoot, str(name)) ImageInFolder = os.listdir(FolderPath) DataCollect = pd.DataFrame() for j, img in enumerate(ImageInFolder): prompt_index = int(img.split('-')[1]) txt = PromptsList[prompt_index] ImagePath = os.path.join(FolderPath, img) Img = Image.open(ImagePath) # Clipscore ImgClipScore = get_clip_score(Img, txt, CLIP_Model, CLIP_Preprocess, device) # aesthetics scorer # ImageScore = predict(Img) # LAION aesthetics scorer ImageLAIONScore = PredictionLAION(Img, MLP_Model, CLIP_Model, CLIP_Preprocess, device) # temp = list(ImageScore) temp = list() temp.append(float(ImgClipScore)) temp.append(ImageLAIONScore) temp = pd.DataFrame(temp) DataCollect = pd.concat([DataCollect, temp], axis=1) print("Model:{}/{}, image:{}/{}".format(i+1, len(PromptsNotDone), j+1, len(ImageInFolder))) DataCollect = DataCollect.T DataCollect['ImageIndex'] = [i + 1 for i in range(len(ImageInFolder))] DataCollect.columns = ['ClipScore', 'LAIONScore', 'ImageIndex'] # 保存原数据 DataCollect.to_csv(os.path.join(DataFramePath, str(name) + '.csv'), index=False) print("One Results File Saved!") print('Image rating complete!') # do some calculation ModelSummary = pd.DataFrame() for i in PromptsNotDone: DataCollect = pd.read_csv(os.path.join('dataresult/MyImageRating', str(i) + '.csv')) temp = pd.DataFrame(DataCollect['LAIONScore'].describe()).T # 计算数据的偏度 temp['skew'] = scipy.stats.skew(DataCollect['LAIONScore'], axis=0, bias=True, nan_policy="propagate") # 计算数据的峰度 temp['kurtosis'] = scipy.stats.kurtosis(DataCollect['LAIONScore'], axis=0, fisher=True, bias=True, nan_policy="propagate") temp.columns = [i + '_LAIONScore' for i in list(temp.columns)] # temp['RatingScore_mean']=np.mean(DataCollect['Rating']) # temp['RatingScore_std']=np.std(DataCollect['Rating']) temp['Clipscore_mean'] = np.mean(DataCollect['ClipScore']) temp['Clipscore_std'] = np.std(DataCollect['ClipScore']) # temp['Artifact_mean']=np.mean(DataCollect['Artifact']) # temp['Artifact_std']=np.std(DataCollect['Artifact']) temp['Model'] = str(i) ModelSummary = pd.concat([ModelSummary, temp], axis=0) # save results new_order = ['Model', 'count_LAIONScore', 'mean_LAIONScore', 'std_LAIONScore', 'min_LAIONScore', '25%_LAIONScore', '50%_LAIONScore', '75%_LAIONScore', 'max_LAIONScore', 'skew_LAIONScore', 'kurtosis_LAIONScore', 'Clipscore_mean', 'Clipscore_std'] # 使用 reindex() 方法重新排序列 ModelSummary = ModelSummary.reindex(columns=new_order) DataSummaryDone = pd.concat([DataSummaryDone, ModelSummary], axis=0) DataSummaryDone.to_csv('./ImageRatingSummary/MyModelSummary_Total.csv') pd.set_option('display.max_rows', None) # None表示没有限制 pd.set_option('display.max_columns', None) # None表示没有限制 pd.set_option('display.width', 1000) # 设置宽度为1000字符 print(DataSummaryDone)
A feline peering out from a striped transparent travel bag with a bicycle in the background. Outdoor setting, sunset ambiance. Product advertisement of pet bag, No humans, focus on cat and bag, vibrant colors, recreational theme
Four amber glass bottles with droppers placed side by side, arranged on a white background, skincare product promotion, no individuals present, still life setup
“文本渲染仍然不可靠,他们认为该模型很难将单词 token 映射为图像中的字母”
