当前位置:   article > 正文

textual_inversion 配置文件里面的重要参数_textual inversion

textual inversion

Config file:

Most of these extra parameters are outlined in the actual paper. They are just flags for enabling our baselines.

Here's what they do:

  • placeholder_strings: The actual string that will represent the concept in future prompts. This must be a single token string, so don't use things like "myplaceholderword".
  • initializer_words: The words used as the initialization for the 'new word' embedding. This is not the placeholder, and will be overwritten by whatever you provide with --init_word when running main.py. It's just a starting point for the optimization. It should be some rough description of your concept's main class, for example 'dog', 'face', 'painting' etc.
  • per_image_tokens: This gives each image in your training set (up to ~20ish images) its own unique token, in addition to the shared placeholder. This way, the model can try to capture the shared information in "*", and the image-specific information (background etc.) in the specific tokens. For LDM this led to worse results (some information about the shared concept was instead stored in the per-image tokens).
  • num_vectors_per_token: This is the number of embedding vectors used to represent the concept. Basically how many 'words' is the concept represented as behind the scenes. More words = more expressivity, but also more overfitting so its hard to later edit images.
  • progressive_words: If you are using more than one vector per token, you can enable this to increase the number of vectors progressively over training. So you'd start with 1 word which will capture the concept as best as it can, and after a set number of training iterations, the model will move to using more and more vectors. This is meant to give the same flexibility as the multi-vector approach, but possibly reduce overfitting. This also didn't work well with LDM, so its off by default.

Output folder images:

gs stands for global step, it's the step of the training where the images were produced.

  • inputs: These are just the input images provided to the model at this iteration
  • reconstruction: If you read the LDM/SD paper, you'll see that part of the model is a network that knows how to compress and uncompress the images into some latent space, where the actual diffusion model is performed. The reconstructions show the input images after this compression and uncompression step, so you can make sure its loaded correctly / see what small details might be lost at this step.
  • conditioning: These are the conditioning sentences used when sampling images for validation. It literally just prints out the text that was used to produce each of the samples in the same step.
  • samples: Samples produced at this step with the conditioning texts but without classifier-free guidance (e.g. the guidance scale is 1.0). These will be mostly noise.
  • samples_scaled: Samples produced at this step with the conditioning texts and with classifier-free guidance (guidance scale is 5.0). These should look like your concept.

The output you want to track is samples_scaled. Everything else is mostly for debugging purposes.

参考链接:https://github.com/rinongal/textual_inversion/issues/19

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/697204
推荐阅读
相关标签
  

闽ICP备14008679号