当前位置:   article > 正文

[linux-sd-webui]图生文,blip/deepdanbooru_deepbooru

deepbooru

GitHub - pharmapsychotic/clip-interrogator: Image to prompt with BLIP and CLIPImage to prompt with BLIP and CLIP. Contribute to pharmapsychotic/clip-interrogator development by creating an account on GitHub.icon-default.png?t=N7T8https://github.com/pharmapsychotic/clip-interrogatorGitHub - salesforce/LAVIS: LAVIS - A One-stop Library for Language-Vision IntelligenceLAVIS - A One-stop Library for Language-Vision Intelligence - GitHub - salesforce/LAVIS: LAVIS - A One-stop Library for Language-Vision Intelligenceicon-default.png?t=N7T8https://github.com/salesforce/LAVISclip_interrogator教程 - 知乎同步发布在我的博客 https://blog.thisis.plus/2023/04/22/clip_interrogator%E6%95%99%E7%A8%8B/文字生成图片是近年来多模态和大模型研究的热门方向,openai提出的CLIP提供了一个方法建立起了图片和文字的联系,…icon-default.png?t=N7T8https://zhuanlan.zhihu.com/p/624066332模型方法---图像生成文字clip-interrogator - 知乎前言最近大火的方法stable-diffusion方法,将文字转成图片。那么有没有相反的方法,有了图片给一段文字描述? 其实这个类似Clip里的相似度,但那个还是需要提供几个一段文字描述,而不能智能化的自动生成。 但新的…icon-default.png?t=N7T8https://zhuanlan.zhihu.com/p/578505705scripts/clip_interrogator_ext.py · db/clip-interrogator-ext - Gitee.comicon-default.png?t=N7T8https://gitee.com/dbscholar0/clip-interrogator-ext/blob/main/scripts/clip_interrogator_ext.pyblip是个多模态的视觉-语言模型,在webui中使用了blipv1,目前blip已经有v2版本了,deepbooru适合二次元的场景,除此之外的场景建议使用blip,blip有两个版本,GitHub - pharmapsychotic/clip-interrogator: Image to prompt with BLIP and CLIP,还有一个原作者团队整合的GitHub - salesforce/LAVIS: LAVIS - A One-stop Library for Language-Vision Intelligence,就是把训练的代码都放在这个库里面了,包含了blipv1/v2。

1.GitHub - pharmapsychotic/clip-interrogator: Image to prompt with BLIP and CLIP

这个库也支持blipv1/v2,对齐sd中的功能,纯推理,其中核心用的是hugging face中transformers库的BlipForConditionalGeneration,Blip2ForConditionalGeneration。

  1. config=Config->
  2. ci=Interrogator(config)->clip_interrogator.clip_interrogator->load_caption_model()->load_clip_model()->
  3. - tokenize=open_clip.get_tokenizer(clip_model_name)->
  4. inference(ci,image,mode)->
  5. Interrogator.interrogate()->
  6. caption=caption or self.generate_caption(image)->
  7. - self._prepare_caption()->
  8. - self.caption_model=self.caption_model()->
  9. inputs=self.caption_processor(pil_image)->
  10. - transformers.models.blip.processing_blip.BlipProcessor.__call__->
  11. - encoding_image_processor=self.image_processor(images)->
  12. tokens=self.caption_model.generate(inputs[1,3,384,384],self.config.caption_max_length)[BlipForConditioalGeneration].generate()->
  13. - vision_outputs=self.vision_model(pixel_values)->
  14. - image_embeds=vision_outputs[0](1,577,1024)->
  15. - outputs=self.text_decoder.generate()[transformers.generation.utils.py->GenerationMaxin]->
  16. -- model_kwargd['attention_mask']=self._prepare_attention_mask_for_generation(input)->
  17. -- logits_processor=self._get_logits_processor()->
  18. -- stopping_criteria=self._get_stopping_crireria()->
  19. -- return self.greedy_search()->
  20. --- outputs=self(**model_inputs,...)->
  21. --- blip.modeling_blip_text.BlipTextLMHeadModel.forward()->
  22. --- outputs=self.bert()->outputs:[last_hidden_state,past_key_values]->
  23. --- sequence_output=outputs[0] [1,1,768]->
  24. --- prediction_scores=self.cls(sequence_output)->
  25. ---- BlipTextOnlyMLMHead.forward()->
  26. ---- BlipTextLMPredictionHead().forward()->transformer->decoder->
  27. ---- prediction_scores [1,1,30524]
  28. --- blip.modeling_outputs.CausalLMOutputWithCrossAttention()->
  29. -- next_token_logitrs=output.logits[:,-1,:]->
  30. -- next_tokens_scores=logits_processor(input_ids,next_token_logits)->
  31. -- next_tokens=torch.argmax(next_tokens_scores,dim=-1) [1] ->
  32. -- input_ids [1,11]->
  33. -self.caption_processor.batch_decode(tokens)->
  34. -- blip.processing_blip.BlipProcessor.batch_decode()->
  35. -- tokenization_utils_base.PreTrainedTokenizerBase().batch_decode()->
  36. -- tokenization_utils_fast.PretrainedTokenizerFast()._decode()->
  37. --- text=self._tokenizer.decode()->
  38. image_features=self.image_to_features(image)->

模型的输出包括多个方面:

people walking around a building with a glass facade, rendering of the windigo, stroopwaffel, grand library, inspired by Peter Fiore, detailled light, h 1024, archviz, inspired by Lodewijk Bruckman, the photo shows a large, librarian, soft curvy shape, phase 2, clogs

在上面大段的描述中,除了第一逗号之前的通过图像上物体和位置信息生成的,后面的描述都是通过四个数据集中筛选出与图的特征相似度很高的结果进行排序的,具体数据集合有:

artists flavors mediums movements

其中artists里面都是画家、mediums和movements就是属于那种画风种类,和那种派别的。

而flavors中有很多描述的信息,可以快速找到合适的信息做CLIP计算获得最佳结果。第一句是blip生成的。

上面这些在stable-diffusion-webui中是没有的,webui中只到LMmodel生成的promot就结束,不会在用clip计算相似度找5个类别的词了。

install

主要还是open_clip_torch这个库的问题

cp -r /home/sniss/.local/lib/python3.7/site-packages/open_clip_torch-2.16.0-py3.7.egg/open_clip/openai.py /home/sniss/local_disk/

在57行改下:

  1. # if get_pretrained_url(name, 'openai'):
  2. # model_path = download_pretrained_from_url(get_pretrained_url(name, 'openai'), cache_dir=cache_dir)
  3. # elif os.path.isfile(name):
  4. # model_path = name
  5. # else:
  6. # raise RuntimeError(f"Model {name} not found; available models = {list_openai_models()}")
  7. model_path = cache_dir
cp -r openai.py /home/sniss/.local/lib/python3.7/site-packages/open_clip_torch-2.16.0-py3.7.egg/open_clip/openai.py

2.deepdanbooru

wd 1.4 tagger用的也是这个模型,本质是第一个分类模型,blip是语言模型,输出是自然语言段,而deepdanbooru是分类模型,输出的是一个个tag,作者在danbooru网站上下了6000个标签进行训练的。目前有维护版本的stable-diifusion-webui-wd1.4tagger项目,更新比较新的分类模型,比如convnext以及vit等。

  1. # from AUTOMATC1111
  2. # maybe modified by Nyanko Lepsoni
  3. # modified by crosstyan
  4. import os.path
  5. import re
  6. import tempfile
  7. import argparse
  8. import glob
  9. import zipfile
  10. import deepdanbooru as dd
  11. import tensorflow as tf
  12. import numpy as np
  13. from basicsr.utils.download_util import load_file_from_url
  14. from PIL import Image
  15. from tqdm import tqdm
  16. re_special = re.compile(r"([\\()])")
  17. def get_deepbooru_tags_model(model_path: str):
  18. if not os.path.exists(os.path.join(model_path, "project.json")):
  19. is_abs = os.path.isabs(model_path)
  20. if not is_abs:
  21. model_path = os.path.abspath(model_path)
  22. load_file_from_url(
  23. r"https://github.com/KichangKim/DeepDanbooru/releases/download/v3-20211112-sgd-e28/deepdanbooru-v3-20211112-sgd-e28.zip",
  24. model_path,
  25. )
  26. with zipfile.ZipFile(
  27. os.path.join(model_path, "deepdanbooru-v3-20211112-sgd-e28.zip"), "r"
  28. ) as zip_ref:
  29. zip_ref.extractall(model_path)
  30. os.remove(os.path.join(model_path, "deepdanbooru-v3-20211112-sgd-e28.zip"))
  31. tags = dd.project.load_tags_from_project(model_path)
  32. model = dd.project.load_model_from_project(model_path, compile_model=False)
  33. return model, tags
  34. def get_deepbooru_tags_from_model(
  35. model,
  36. tags,
  37. pil_image,
  38. threshold,
  39. alpha_sort=False,
  40. use_spaces=True,
  41. use_escape=True,
  42. include_ranks=False,
  43. ):
  44. width = model.input_shape[2]
  45. height = model.input_shape[1]
  46. image = np.array(pil_image)
  47. image = tf.image.resize(
  48. image,
  49. size=(height, width),
  50. method=tf.image.ResizeMethod.AREA,
  51. preserve_aspect_ratio=True,
  52. )
  53. image = image.numpy() # EagerTensor to np.array
  54. image = dd.image.transform_and_pad_image(image, width, height)
  55. image = image / 255.0
  56. image_shape = image.shape
  57. image = image.reshape((1, image_shape[0], image_shape[1], image_shape[2]))
  58. y = model.predict(image)[0]
  59. result_dict = {}
  60. for i, tag in enumerate(tags):
  61. result_dict[tag] = y[i]
  62. unsorted_tags_in_theshold = []
  63. result_tags_print = []
  64. for tag in tags:
  65. if result_dict[tag] >= threshold:
  66. if tag.startswith("rating:"):
  67. continue
  68. unsorted_tags_in_theshold.append((result_dict[tag], tag))
  69. result_tags_print.append(f"{result_dict[tag]} {tag}")
  70. # sort tags
  71. result_tags_out = []
  72. sort_ndx = 0
  73. if alpha_sort:
  74. sort_ndx = 1
  75. # sort by reverse by likelihood and normal for alpha, and format tag text as requested
  76. unsorted_tags_in_theshold.sort(key=lambda y: y[sort_ndx], reverse=(not alpha_sort))
  77. for weight, tag in unsorted_tags_in_theshold:
  78. tag_outformat = tag
  79. if use_spaces:
  80. tag_outformat = tag_outformat.replace("_", " ")
  81. if use_escape:
  82. tag_outformat = re.sub(re_special, r"\\\1", tag_outformat)
  83. if include_ranks:
  84. tag_outformat = f"({tag_outformat}:{weight:.3f})"
  85. result_tags_out.append(tag_outformat)
  86. # print("\n".join(sorted(result_tags_print, reverse=True)))
  87. return ", ".join(result_tags_out)
  88. if __name__ == "__main__":
  89. parser = argparse.ArgumentParser()
  90. parser.add_argument("--path", type=str, default="./images/")
  91. parser.add_argument("--threshold", type=int, default=0.75)
  92. parser.add_argument("--alpha_sort", type=bool, default=False)
  93. parser.add_argument("--use_spaces", type=bool, default=True)
  94. parser.add_argument("--use_escape", type=bool, default=True)
  95. parser.add_argument("--model_path", type=str, default="./deepdanbooru-models")
  96. parser.add_argument("--include_ranks", type=bool, default=False)
  97. args = parser.parse_args()
  98. # global model_path
  99. # model_path:str
  100. if args.model_path == "":
  101. script_path = os.path.realpath(__file__)
  102. default_model_path = os.path.join(os.path.dirname(script_path), "deepdanbooru-models")
  103. # print("No model path specified, using default model path: {}".format(default_model_path))
  104. model_path = default_model_path
  105. else:
  106. model_path = args.model_path
  107. types = ('*.jpg', '*.png', '*.jpeg', '*.gif', '*.webp', '*.bmp')
  108. files_grabbed = []
  109. for files in types:
  110. files_grabbed.extend(glob.glob(os.path.join(args.path, files)))
  111. # print(glob.glob(args.path + files))
  112. model, tags = get_deepbooru_tags_model(model_path)
  113. for image_path in tqdm(files_grabbed, desc="Processing"):
  114. image = Image.open(image_path).convert("RGB")
  115. prompt = get_deepbooru_tags_from_model(
  116. model,
  117. tags,
  118. image,
  119. args.threshold,
  120. alpha_sort=args.alpha_sort,
  121. use_spaces=args.use_spaces,
  122. use_escape=args.use_escape,
  123. include_ranks=args.include_ranks,
  124. )
  125. image_name = os.path.splitext(os.path.basename(image_path))[0]
  126. txt_filename = os.path.join(args.path, f"{image_name}.txt")
  127. # print(f"writing {txt_filename}: {prompt}")
  128. with open(txt_filename, 'w') as f:
  129. f.write(prompt)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/413129
推荐阅读
相关标签
  

闽ICP备14008679号