Stable Diffusion

From Spanking Art
Jump to navigation Jump to search
The main user interface of AUTOMATIC1111's open-source web UI implementation of the Stable Diffusion AI image generation model, with the sd-1-5-pruned-emaonly.ckpt model loaded, in the txt2img tab, in dark mode.
This plot serves to demonstrate the noise diffusion process, using the v1-5 model, the prompt "a (european castle:1.3) in japan. by Albert Bierstadt, ray traced, octane render, 8k" and the DDIM sampling method.

Stable Diffusion is an open source AI art generator software released in 2022. It is primarily used to generate detailed images conditioned on text descriptions (text-to-image, "txt2img"), though it also has image-to-image ("img2img") features such as inpainting and outpainting. It can generate output in practically any artistic style as well as photorealistic images.

Stable Diffusion's code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8GB VRAM. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services. Stable Diffusion can run on Windows, Linux, MacOS and other operating systems. There are also ways to use Stable Diffusion without a GPU (graphics card), and ways to use Stable Diffusion online on any computer or mobile device, see the links below.

The many different implementations and distributions of Stable Diffusion each have their own special features, and these change rapidly as new versions of the software are released. The rest of this page describes the AUTOMATIC1111 web UI implementation and some of its extensions, unless otherwise noted.

Common issues[edit]

Complexity and frustration[edit]

Stable Diffusion has so many settings and ways to influence the results that learning to use it takes endless hours of experimenting to get a feeling for what might perhaps help, and what most likely doesn't, in a particular situation. Its complexity can be overwhelming and is ever growing.

The user will experience moments of happy surprise when something amazing is produced, seemingly like magic. But depending on what they are trying to do, especially in img2img mode, these moments will be more than counterbalanced by endless setbacks when a less-than-perfect output only becomes worse, in one way or another, when they are trying to improve it.

Bad anatomy[edit]

Image generated by Stable Diffusion with several anatomic issues

A common issue with Stable Diffusion's generated images is that characters often have flawed anatomy, resulting in grotesque, "freaky" looking output.

The image on the right is a typical example. Stable Diffusion was asked to generate the following:

  • Prompt: "Realistic photo of a beautiful sexy chubby , muddy and wet Japan girl (bra-less) wearing green lacy bikini , sitting on top of moldy rock in flooded rural area in Malaysia ,"
  • Negative prompt: "cartoon, anime, drawing, painting, 3D , 3D render"

The result looks photorealistic, but has a number of anatomic issues:

  1. The woman has three arms.
  2. She has two belly buttons.
  3. Her right hand has six fingers.
  4. Her right foot looked deformed.

Such issues can be fixed by various methods, such as:

  • Generating a large number of images with the same prompt and random seeds until you get one that is flawless.
  • Using a fixed seed plus a variation seed, variants of an image you like best can be produced until all of its issues are repaired.
  • Alternatively, image variants can be combined using an image editor, keeping those parts that are the best.
  • Inpainting can be used to change flawed parts until they are correct, and to blend parts of a collage.
  • ControlNet can be used to control a character's anatomy.

Stable Diffusion also often has difficulties with hands, bare feet, and characters in somewhat unusual poses that the model has not seen enough in its training images.

Problems with too big image resolutions[edit]

Image in 512x768 pixels resolution illustrating the "double character" issue. The prompt asked for a "naked little boy", but the AI generated two on top of another, merging the upper one's legs with the lower one's arms.

Stable Diffusion's original model was trained on 512x512 pixel images, and works best at this resolution. You can increase the resolution of the output image, but if it is too large in any direction, you are likely to end up with double characters, often monstrosities where part of a person is mounted on top of another, possibly absurdly merged. This issue can already appear at moderate resolutions such 512x768 pixels, and is the more likely the higher the resolution gets.

There are workarounds for this problem, such as "Hires. fix" (see below). Another common workaround is to generate the image at low resolution, then scale it up (see below) and refine it via inpainting (see below) at higher resolution. Or to generate the core part of the image at low resolution and then use outpainting (see below) to expand the image. ControlNet (see below) can also be used to circumvent the problem from happening.

Problems with multiple characters[edit]

When Stable Diffusion generates two or more characters in the same output image, it tends to make the characters very similar to each other. The attributes of one character will tend to be transfered to the other character(s), such as: pose, gender, age, hair, skin colour, clothing and nudity. Even with prompts that very clearly describe the distinct attributes of the characters, Stable Diffusion will typically be unable to fulfil the request satisfactorily.

Only whenever the training images are strong enough to provide (stereo)typical two-character scenes (such as a bride and a groom, or a mother and a baby), this issue can be overcome.

Spanking art with Stable Diffusion[edit]

Spanking art with Stable Diffusion: A tutorial by Spankart (11/2022).

The AI image generation technology has great potential for the spanking art scene, comparable to the appearance of rendered spanking art in the mid-1990s. It has a totally different approach from rendered art and thus can complement that existing technology.

Compared to rendered art, AI-generated art is much more like painting and much less 3D oriented, although it can output realistic 3D-looking images. While rendered art and any other traditional digital art creation is calculable, AI image generation has a great element of randomness and surprise for the artist. There will be moments of joy when the software produces fairly good or fantastic output, and long phases of frustration when the software fails again and again at seemingly simple tasks. On the downside, the artist has a considerably reduced influence on the details of the produced image, although the artist is still the one in charge, somewhat like an "animal trainer". On the upside, AI image generation is more art-focused and versatile in terms of artistic style, and can overcome some issues and limitations that 3D image rendering technology has.

Comparatively weak but still highly relevant for creating spanking art is Stable Diffusion's img2img feature as a tool to transform an input image. The possibilities include:

  • A painted or sketched drawing (digital or analog) can be transformed into a more realistic, perhaps even photorealistic, image.
  • A photo or rendered art can be transformed into a more painterly artwork, and this can be done in an infinity of artistic styles.
  • Things that are difficult to do in rendered art (such as complex backgrounds, interior designs, outdoor scenes) can be easier to do with an AI.
  • Rendered art can be improved upon, for example in terms of lighting, clothes, hair, skin, surface materials or textures.
  • Objects or characters can be added to, or removed from, the image.
  • A character can be transformed, for example into a different age, different facial expression, a different gender, different hairstyle, different body type, different clothes, and/or states of nudity.

The extension ControlNet (see below) is highly useful for creating spanking art. It gives the user much better control over the output, especially in terms of composing the scene and posing characters.

Many artists who create spanking images with Stable Diffusion make also use of image editors, for example to add redness or marks to bottoms, or to add spanking implements or other objects that Stable Diffusion has difficulties to create.


In October 2022, CarthagFall published "Method for AI-assisted spankings", a short tutorial on his method to create AI-generated spanking art with the assistance of an AI image generator (Stable Diffusion).[1]

In November 2022, Spankart, too, explored AI technology for spanking artwork creation and began demonstrating his technique in a variety of styles. His method is similar: posing characters and improving a generated image in successive iterations, keeping a current working version of the image in an image editor (such as Gimp or Photoshop), using an exported image as input for img2img, generating variations with the AI, selecting variations that improve the image, blending them as new layers with the working version, and using the image editor's other tools to make manual edits wherever this helps. Spankart also wrote this wiki article about Stable Diffusion which is hopefully a useful resource for other spanking artists to get started with this new technology.

NAI uses a somewhat different approach: he trains custom Stable Diffusion models to generate spanking art directly.

For a list of more artists, see AI-generated spanking art.


The following gallery shows artwork created by Spankart using Stable Diffuison and Gimp, with up to 1000 manual inpainting iterations for each work:

Basic concepts and features[edit]


In Stable Diffusion, a model is a (typically quite large) file that contains the weights of a trained neural network. These models are also called checkpoints, probably for historic reasons, and have the file ending .ckpt.

The model is what the AI has "learned" by analyzing millions or billions of images, and what enables it to produce new images. Stable Diffusion's first widely distributed model was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, using 5 billion image-text pairs.

There are also other Stable Diffusion models that were trained on other images, which greatly impacts the generated outputs. For example, WaifuDiffusion was trained on 250k anime-styled images. WaifuDiffusion is therefore better able to create anime-style art, but less able to create other artistic styles or photographic-looking output. See the links about Stable Diffusion models below.

Example images[edit]

The influence of the model on the output can be demonstrated with the following example images. All are batches of 6 images generated in txt2img mode with random seeds. In the first image set, we let Stable Diffusion generate images with an empty prompt. These are quite interesting. They give pointers to the kind of images the model was most trained on, for example landscapes, portraits, indoor scenes, artwork or photos. In a way, they show what the model "dreams in sleep" when there is no prompt at all to guide its output in any direction.

In the second set of example images, we did the same with the prompt "girl". This prompt is intentionally minimalistic and ambiguous as it can refer to a younger or older female person. Not specifying any style in the prompt (such as "photo", "artwork", "painting" or "anime") and not specifying any view like "close-up", "full body" or "portrait" will show what kind of style and view the model most likes to produce.

Prompts and negative prompts[edit]

A batch of twelve 512x512 images were generated with txt2img using the following prompts:
Prompt: 19-year-old blonde busty girl, short straight undercut hair, dynamic posing, art style of Albert Lynch, Carne Griffiths, Franz Xaver Winterhal
Negative prompt: (((deformed))), [blurry], bad anatomy, disfigured, poorly drawn face, mutation, mutated, (extra_limb), (ugly), (poorly drawn hands), messy drawing, ((((mutated hands and fingers)))).
Settings: Steps: 50, Sampler: Euler a, CFG scale: 7
Photorealistic image generated with the prompt "Photographic portrait of beautiful women wearing white lace dress, glowing skin, Sony α7, 35mm Lens, f1.8, film grain, golden hour, soft lighting, by Daniel F Gerhartz".

Prompts are used to tell the AI what to produce. Stable Diffusion generates noise (or an input image plus noise) and then modifies this noise gradually towards a clearer image, based on the prompt and its model.

Negative prompts can be optionally used to do the opposite. They direct the process away from whatever the negative prompt says.

Stable diffusion works best with prompts in English language, as the standard model was trained with such descriptors.

Prompts and negative prompts can optionally use weights to strengthen or reduce the weight of specific words or phrases, making the AI draw more, or less, attention to them. This feature is also called emphasis markers.

  • Round brackets add weight. A (word) — increase attention to word by a factor of 1.1
  • Square brackets reduce weight. A [word] — decrease attention to word by a factor of 1.1 — i.e. a weight of 0.9
  • Brackets can be nested to further increase or decrease the weight. A ((word)) — increase attention to word by a factor of 1.21 (= 1.1 * 1.1)
  • Alternatve weight syntax: A (word:1.5) — increase attention to word by a factor of 1.5

Fast way to add and change prompt weights[edit]

In AUTOMATIC1111, you can easily add and change weights by selecting a word or phrase from the prompt and then pressing CTRL+Up or CTRL+Down. Once the brackets are there, you no longer need to mark the word or phrase if you want to change its weight. Just place the cursor anywhere between the brackets and then press CTRL+Up or CTRL+Down.

Animated GIF showing how to add and change prompt weights in AUTOMATIC1111.

Image generation far beyond the prompt[edit]

Most of the image generation is actually not influenced by the prompt, but by the training images and by the rules the AI has extracted from them. For example, Stable Diffusion has a good understanding of the following concepts and applies them automatically all the time:

  • how the projection of common three-dimensional objects into a two-dimensional image looks like (although the AI does not actually model anything in three dimensions and works only in 2D space)
  • symmetry (e.g. in a character's face, body, clothing and shoes; in furniture, buildings, etc.)
  • continuation (e.g. an object can be partially hidden by another object and continue on the other side)
  • perspective (e.g. if one or more objects have certain vanishing points, other objects can get roughly the same)
  • light (e.g. light has colour and changes an object's colours, objects cast shadows, surfaces reflect light, things can be solid, transparent, shiny)
  • repetition (e.g. patterns repeat)

Stable Diffusion will also apply a zillion of stereotypes it has gathered from the training images. For example, if you use "princess" in the prompt, chances are high that you will get a young beautiful female wearing jewellery and a crown, although you didn't specify any of that. Going away from such stereotypes can be a challenge. In simple cases, a modification of the prompt and the negative prompt might suffice, in other cases it will require more tricky methods.


You can save your current prompt as a style. In Stable Diffusion's terminology, a style is nothing but a simple way to reuse prompts or prompt parts. You can reuse a style simply by selecting it from a drop-down list and clicking "apply", which will append it to your current prompt. If you use the special string {prompt} in your style, it will insert your current prompt into that position, rather than appending the style to your current prompt.

Sampling steps[edit]

Stable Diffusion uses a loop of sampling steps to gradually change a starting image consisting of noise into the final result. The user can control how many steps they want to use. This is also called the step count. The default value is 20.

The larger the step count, the smaller the change each step does, and the smaller the step count, the larger the change each step does. The first steps always change the most, and further steps change less and less. More steps can produce more detail and can help improve some issues. But more steps require significantly more calculation time and can also produce certain issues such as ugly artifacts. So a higher step count is not always good.

Tip: You can generate images at medium step counts, e.g. 15-20, for greater speed. Once you got an image you like, you can re-generate it (with the exact same seed, prompt, etc.) but with more sampling steps (e.g. 25-50) for better quality. However this strategy does not work for too low step counts because then the image will change significantly when you increase the step count.

Tip 2: The results from ancestral samplers such as "Euler a" tend to change significantly if you add or subtract as little as 1 or 2 from the step count (unless the step count is already quite high). So you can experiment with the sampling steps to see if you can get a variant that you like better.

Sampling methods[edit]

Sampling methods, or samplers for short, are difficult to explain. They influence the output and each has its own characteristics, such as "creativity", detail, blurriness, etc. Samplers also differ in speed. The result of different samplers is highly dependent on the number of sampling steps, and with lower step counts, the difference of a single step can considerably change the output. When a very high number of sampling steps is used (such as 50 or 100), most samplers converge towards the same output. Their differences show mainly in the low to mid range of sampling steps.

The perhaps most popular samplers are:

  • Euler a (this is currently the default and probably the most used sampler)
  • DPM2 Karras
  • DPM++ 2M Karras (popular for quality results, but at the cost of higher VRAM compared to Euler a)
  • DDIM (works well with as little as 10 sampling steps, making it fast. A useful choice if you want to run big batches in a reasonable time)

Batches and batch counts[edit]

Batches can make the AI produce more than one output image in one go, and are controlled by two numbers:

  • The batch count produces its output sequentially.
  • The batch size produces (and displays) one or more output images in parallel. This is useful to watch several output images (with different seeds) develop simultaneously.

These two can be combined. For example, if the batch count is set to 3 and the batch size is set to 4, Stable Diffusion will create a total of 12 images: a first batch of 4, a second batch of 4, and a third batch of 4.

CFG scale[edit]

The CFG scale (classifier-free guidance scale) controls how strongly the AI will apply its algorithm. A too small CFG scale will result in images that have low contrasts and little to do with the prompt. They tend to be aesthetically beautiful, but in the way of abstract art, not realism. A too great CFG scale will produce unappealing images with artifacts, exaggerated features, oversaturated colours and too high contrast. The default value is 7.

For the best output, it is crucial to find the right CFG scale value, neither too low nor too high. This will take experimentation. Some samplers work better at low CFG scale ranges and other samplers at higher ones. The best value will also depend on the model, the prompt, and other factors.


The seed feeds the pseudorandom number generator that produces the initial noise. When the seed is set to -1, a random seed is used for each created image. The actual seed that was used is saved with the image. If a user likes a particular output image, they can copy this seed number and use it as an input seed to experiment with how variations of the model, the prompt, the negative prompt, the number of sampling steps, the sampling method or the CFG scale influence the output.

Variation seed[edit]

This feature is optional and by default off. It works in both txt2img and img2img mode and is activated by clicking the "Extra" checkbox near the seed input box.

With this feature, Stable Diffusion will use a mix of the noise generated by two seeds, the main seed and the variation seed. This can be useful when you have generated an image that you like and want to improve it by creating variations. These variations can use the same prompt and other settings, but will generate a different output image because they are starting from a slightly different noise.

In a typical use case, you will use a fixed seed as your main seed, taken from an image that you generated with a random seed. Set the Variation seed to -1 to mix the main seed and a random variation seed, giving you an unpredictable variation each time an image is generated. To create many variations, use the Batch feature (see below). The Variation strength determines the ratio of the mix (between 0 and 1). For small variations, try a variation strength of 0.05 or 0.1. When the variation strength is too high, the output image will be as different as if you had used a random seed for your main seed, so while such a setting is possible, it makes little sense.

Unfortunately, it is not possible to repeat the process to refine the image further. This is because the two seeds do not combine mathematically.

Quality features[edit]

Restore faces[edit]

Restore faces is a feature to improve the quality of faces.

Highres. fix[edit]

By default, txt2img makes horrible images at higher resolutions. The output will typically feature the prompt's main object/subject/character multiple times instead of once, and combine these parts in a messy, jumbled composition. This is because the AI's model was trained on 512x512 pixel images (for v1-5). The Stable Diffusion algorithm works best for the same resolution, and really poor when the resolution is 1.5 times that or more.

Highres. fix is a feature to work around this problem. Highres. fix is effectively like a chain of txt2img, resizing and img2img (see below). With highres. fix, The AI generates the image at a lower resolution, upscales it, and uses img2img with denoising to add details at high resolution. Highres. fix has a setting "Denoising strength" that works the same as the denoising strength in img2img mode: the higher it is set, the less detail of the input image will be used. A relatively low value will work best in order not to lose much from the original image.

The Settings have an option to save a copy of the image before applying highres fix.

Note: Highres. fix is not upscaling. The probably best method to create high resolution AI art of good quality is to

  1. use the normal txt2img mode to create the best image you can at low resolution,
  2. and then use either of:
      1. img2img keeping the same resolution (to refine the image and add details), followed by
      2. upscaling (see below) to increase the resolution
    • img2img with resolution resizing (to refine the image and add details at higher resolution)

This way they get more control over the two individual steps and can achieve better quality output. Highres. fix is simply a method to automate the above process. It has the advantage of less manual work, but the disadvantage of less control and typically poorer output. If you generate a lot of images in batch mode, highres. fix will slow your process down because it costs time for every image even when the high resolution isn't actually needed.


Prompt matrix[edit]

The script Prompt matrix can be used to experiment how variations of the prompt influence the output. Prompt matrix will operate on a prompt such as "a busy city street in a modern city | illustration | cinematic lighting" and produce a grid of the four combinations:

  • "a busy city street in a modern city"
  • "a busy city street in a modern city, illustration"
  • "a busy city street in a modern city, cinematic lighting"
  • "a busy city street in a modern city, illustration, cinematic lighting"

The output images will all use the same seed, so that only the prompt difference will show its effect. With this script, the Batch count and Batch size should be set to 1, to avoid generating the same output multiple times.

2 prompt suffixes produce 4 combinations, 3 prompt suffixes 12 combinations and 4 prompt suffixes 16 combinations.

Advanced prompt matrix[edit]

The script Advanced prompt matrix works similar to Prompt matrix, but has a different syntax, produces no grid, and can be used with batch count to create all combinations of the prompt matrix multiple times. When using a batch count > 1, each prompt variation will be generated for each seed. The value of batch size is ignored.

With this script, a prompt of "a <corgi|cat> wearing <goggles|a hat>" produces 4 output images for the prompts:

  • "a corgi wearing goggles"
  • "a corgi wearing a hat"
  • "a cat wearing goggles"
  • "a cat wearing a hat".

Prompts from file or textbox[edit]

The script Prompts from file or textbox allows you to generate a batch of images from pre-written prompts. The list of prompts can be loaded from a file or edited in a textbox in the user interface.

Dynamic Prompts[edit]

The script Dynamic Prompting <version>, an extension, implements an expressive template language for random or combinatorial prompt generation along with features to support deep wildcard directory structures.

X/Y plot[edit]

The script X/Y plot creates a grid of images with varying parameters. You can use it to test the variation of practically any parameter Stable Diffusion has. You can select one or two parameters, which will be used for the rows and columns (X and Y) of the grid.

  • Number ranges can be specified like "1-5", which means the plot will use the values 1, 2, 3, 4, 5.
  • For ranges you can optionally specify an increment in round brackets like:
    • 1-5 (+2) = 1, 3, 5
    • 10-5 (-3) = 10, 7
    • 1-3 (+0.5) = 1, 1.5, 2, 2.5, 3
  • Alternatively, for ranges you can specify in square brackets how many values you want:
    • 1-10 [5] = 1, 3, 5, 7, 10
    • 0.0-1.0 [6] = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
  • Some parameters take comma-separated words as values. For example, you can choose "Sampler" and give it the values "euler a, dpm2, dpm adaptive, dpm2 a karras, ddim" to generate output for each of these 5 samplers.
  • The parameter "Prompt S/R" does search-and-replace on the prompt. It takes a list of two or more words (or phrases). It searches in the prompt for the first word, and replaces all instances of it with each of the other entries from the list. For example, If your prompt is "a man holding an apple, 8k clean" and you select Prompt S/R and give it the value "an apple, a watermelon, a gun", the plot will use the prompt variants:
    • "a man holding an apple, 8k clean"
    • "a man holding a watermelon, 8k clean"
    • "a man holding a gun, 8k clean"

Special prompt features[edit]

Prompt editing[edit]

Prompt editing allows to change the prompt after a certain number of steps. The AI will begin its process with an initial prompt, and then mid-way after a certain number of steps it will switch to another prompt. This makes it possible to blend two concepts and can work better than trying to use both concepts in the prompt, conflicting with each other. It also allows one prompt to influence the beginning of the process (where foggy colour patches are formed) and the other prompt to influence the end of the process (where details are formed). The base syntax for this is: [from:to:when]

Example: a [fantasy:cyberpunk:16] landscape

  • At start, the model will be drawing a fantasy landscape.
  • After step 16, it will switch to drawing a cyberpunk landscape, continuing from where it stopped with fantasy.

Alternating Words[edit]

Alternating Words is similar to prompt editing, but here the AI swaps two prompt variants every other step, or likewise with three or more variants. For example, the prompt "[cow|horse] in a field" will use "a cow in a field" and "a horse in a field" alternatingly and create an image of an animal that looks somewhat like a cross between a cow and a horse.

Composable Diffusion[edit]

Composable Diffusion is a method to allow the combination of multiple prompts using an uppercase AND, such as: "a cat AND a dog". This is different from "a cat and a dog", which is just one prompt and will most likely create an image of two animals, while "a cat AND a dog" will create an image of one animal that looks, sort of, half-cat, half-dog.

Weights can be added, for example: "a cat :1.2 AND a dog AND a penguin :2.2". Very low weights, such as lower than 0.1, have very small effect on the outcome. This can be useful to fine-tune an already good prompt to modify it just a little bit, but not too much.


The extension Wildcards allows you to use the syntax "__name__" in your prompt to get a random line from a file named name.txt in the wildcards directory. This way, you can introduce an element of surprise in your prompt and control the possible values for each wildcard parameter.


The extension Unprompted is a highly modular extension that allows you to include various shortcodes in your prompts. You can pull text from files, set up your own variables, process text through conditional functions, and much more.

img2img features[edit]


With img2img, nearly everything works the same as with txt2img, but additionally the output is influenced by an input image and a denoising strength. The denoising strength (between 0 and 1) controls how strongly the input image is denoised before the AI runs its algorithm. The smaller the denoising strength, the more closely the output will match the input image, and the larger the denoising strength, the more the output will differ from the input image.

The denoising strength is critical, and users will often have to adjust it up and down as needed for the particular case. When it is set to a low value (such as 0.3), the output will remain close to the original, but the quality will often be poor (blurry, bad features). When it is set to a high value (such as 0.8), the output will differ dramatically, it will often not be what the user wanted, and the quality can be either exceptionally good or exceptionally bad. When the input image is bad, it is easy to improve it, but the better the input image, the harder it can be to find a denoising strength setting that will improve and not deteriorate it. The same also applies to inpainting when using masked content = original (see below).

All of the following grids of images were rendered using the same input image, the same prompt prompt "a nude woman lying in a meadow with wildflowers" and the exact same settings, except for the denoising strength:

Challenges with img2img[edit]

One could think that if txt2img produces pretty good images from pure noise, then img2img should produce images at least as good from a mixture of an already fairly good input image plus a certain amount of noise. But that is unfortunately not the case. Typically, img2img will produce output that is inferior to txt2img output. 10% of the image may improve, perhaps, but at the same time 90% of the image will deteriorate. There are workarounds for this problem, such as:

  • Using inpainting (see below) to mask an area in the image to be changed, while the rest of the image remains unchanged.
  • Adjusting the denoising strength for each inpainting job to tell the AI how closely it should stay with the input, or how boldly it should create new image material.
  • Using an external image editor to make manual blends of an output image with the input image, accepting only those image areas that you actually find to be an improvement.
img2img and models[edit]

Some models can produce excellent images in txt2img mode, but do a much worse job in img2img mode, especially for inpainting and outpainting. Hugging Face published in their 1.5 version a special inpainting model that works better than the standard 1.5 version for these tasks.

Color drift issue[edit]

Stable Diffusion's standard img2img feature has the issue that the colours in the output tend to become slightly more purple. This is is especially noteable when img2img is applied iteratively. The AUTOMATIC1111 user interface has the feature "Apply color correction to img2img results to match original colors" to fix the color drift issue.

Color Sketch[edit]

Color Sketch is an optional feature that adds a minimalistic control to the GUI. It allows you to sketch simple lines or color patches on top of the input image. This can help guide the AI in img2img mode. For example, the AI tends to find it difficult to add something new to an empty space in the image. You can "break the void" with a sketched blob to overcome this obstacle. The feature Color Sketch can be activated via commandline args, for normal img2img mode and/or for inpainting mode (see below).

Interrogate CLIP and Interrogate DeepBooru[edit]

Interrogate CLIP tries to guess a possible prompt from an input image. By default, it also tries do guess the artist's name and ends with the words "..., by <artist name>". The prompt returned can be a start if you want to create a somewhat similar image, either in txt2img or img2img mode. The interrogation feature is also useful to learn the AI's way of wording things.

Interrogate DeepBooru works similar, but returns so-called danbooru (DeepBooru/DeepDanbooru) tags from an image. These tags come from the imageboard-system Danbooru. They are originally mainly used to tag anime-style girl images, but can also be used in non-anime-style images. Danbooru tags are in english language and use underscores and brackets. Danbooru tags can be used in Stable Diffusion prompts to create similar or derived images. They work not only in combination with anime-style models such as WaifuDiffusion, but in any model.

Input image (a photo, not AI-generated)

With this as an input image, the following prompts are returned by the AI:

Interrogate CLIP
man is being filmed by another woman in a classroom with a cross on her back and a cross on her arm, by Billie Waters
Interrogate DeepBooru
ass, barefoot, blonde_hair, blood, canvas_\(object\), chair, desk, dress, easel, injury, long_hair, multiple_girls, paintbrush, painting, painting_\(object\), photo_\(medium\), photorealistic, pussy, realistic, traditional_media

This demonstrates how the AI recognizes some things correctly, fails to recognize and misrecognizes other things.


Inpainting can be used to change only a masked part of the input image, and keep the rest of the input image unchanged. Or alternatively, you can also change only the unmasked part and keep the masked part unchanged.

The unchanged parts will affect and guide the parts that are changed. There are four options for "masked content":

  • fill: the masked area is filled with a blurred grey from the edge of the masked area towards the center. This option is a good choice whenever you want to make something disappear. Fill should be used with a high denoising strength, otherwise the output remains a greyish blur.
  • original: the original pixels are used. This option is a good choice whenever you want to guide the output by the original under the mask and make smaller, less dramatic changes to it. The lower the denoising strength, the smaller the change the AI will make to the input.
  • latent noise: the masked area is filled with noise. This option is useful when you want to add something radically new, unguided by the original under the mask, although it will be guided by the unmasked parts. Latent noise is best used with a high or maximum (1) denoising strength, otherwise the output will be just noise.
  • latent nothing: the masked area is filled with flat grey. This option is somewhat similar to "fill" but with sharp contours. Latent nothing is best used with a high or maximum (1) denoising strength, otherwise the output will be just blankness.

See here for images that illustrate these options.

Then the masked area is denoised according to the denoising strength, and then the AI does its work, guided by both the input image and the prompt.

When latent noise and latent nothing is used with a denoising strength lower than the maximum (1), the AI will tend to use the mask contours as the contours of whatever it creates. This can be used intentionally as it is useful for certain cases, but otherwise these two options will work best with a maximum denoising strength.

Inpainting at a reduced width and height for "Only masked"[edit]

When the inpaint area is set to "Whole picture", the width and height of the generated output is determined by the settings for width and height, as you would expect. Typically, you will want to set these to the same as the input image's dimensions.

When the inpaint area is set to "Only masked", the behaviour is different. Then, the width and height of the generated ouput is always identical to the input image, and the settings for width and height determine the resolution at which Stable Diffusion generates the intermediary output. Stable Diffusion first scales the masked area up (or down) to the width and height setting, then generates the output at that resolution, then scales it down (or up) to match the original size, and finally and merges it with the unchanged rest of the image to produce the output.

With that knowlede, you can use the behaviour to your advantage. A lower resolution can significantly reduce the memory load, speed up generation time and can also in some cases improve the results. For example, suppose you have an image at 960x1344 pixels and want to inpaint a certain detail in it, which is smaller than 512x512 pixels. You can mask the detail's area, set the inpaint area to "only masked" and reduce the width and height to 512x512 pixels (Stable Diffusion's favourite resolution). This method will often work well. If you used the full 960x1344 resolution, the process would take considerably longer, the higher resolution would be lost when the intermediary result is scaled down, and on top of that the image's quality would usually be worse because Stable Diffusion is not very good at such high resolution.

If use "Only masked" and reduce the width and height to dimensions too much, to values smaller than the masked area, you will experience a loss of pixel detail because the output is then scaled up instead of scaled down.

If you are inpainting a large area of the image, using a reduced width and height will speed up the process too, but at a loss of pixel detail because the output is then scaled up instead of scaled down.

Challenges with inpainting[edit]

The original image's painting style (or photographic properties) will strongly affect the output as the AI will want to match it with the generated output.

Inpainting can sometimes be a joy and sometimes a real hardship. Whenever Stable Diffusion recognizes something in the original image, it will tend to inpaint matching output with relatively little difficulty, often producing surprising variations that may, or may not, be an improvement. Whenever Stable Diffusion misrecognizes what is in the original image, wild horses couldn't drag it to inpaint what the user wants.

The way the outside of the masked area affects the output for the masked area can be a challenge for inpainting. The issue described above, "Problems with multiple characters", also applies to inpainting. For example, suppose you have two different characters in the picture. Inpainting one character, or certain body parts of one, will tend to generate output that looks similar to the features of the other, even if the prompt explicitly asks the AI for different output. This applies to everything: hair, clothing, states of undress, etc.

One trick to overcome the problem is to cover up whatever is disturbing the output. Once you got a satisfactory result for the inpainted area, you can manually restore the covered-up area.

Related, a common problem with inpainting is blur. Stable Diffusion often generates blur in the image's background because background blur is a very common feature of photographs and especially professional high-quality photos, so the AI learned it in its training. This kind of blur doesn't do much harm. Parts of the image are in focus and parts are out of focus, which is fine. But when you then inpaint a larger part of the foreground, you usually want that to be sharp. However, Stable Diffusion will often inpaint it blurry because it sees the blur of the background and tries to match that(!). The AI's drive to match the features of the surrounding image areas is unfortunately stronger than its knowledge about focused fore- and middle grounds. Trying to counter this inpainting blur problem by via the prompt (e.g. "in focus", "sharp") and/or negative prompt ("blur", "blurry") has little effect.



Girl with a Pearl Earring, by Spankart based on the famous painting by Vermeer, demonstrating the use of outpainting.

Outpainting can be used to expand an image beyond its original dimensions. Outpainting can be done in any combination of the directions left, right, up and down. In the AUTOMATIC1111 webui interface, the outpainting feature is somewhat hidden in the "Inpaint" tab by using the script "Outpainting mk2" or "Poor man's outpainting".

Unlike normal image generation, outpainting seems to profit very much from large step count. Outpainting is best used with a high or maximum (1) denoising strength.

Hugging Face published in their 1.5 version a special inpainting model that works better than the standard 1.5 version for inpainting and outpainting. RunwayML also published a special checkpoint finetuned for inpainting.

Batch img2img[edit]

Batch img2img can be used to process multiple images in a directory.


img2img alternative test[edit]

The img2img alternative test script deconstructs a given input image using a starting "input prompt" given by the user. You need to use the right settings for the script to work, then you should be able to hit "Generate" and get back a result that is a very close approximation to the original. When this works, you can modify your prompt to generate variations of the image, which will keep much of the image unchanged. For example, you can change the haircolor, hairstyle, or the clothes of a person/character, or their facial expression.

For more explanation, see here.

Color Sketch[edit]

Color Sketch is a simple tool to sketch in an input image for img2img mode. This is useful to guide the AI not only via the prompt (and optionally an inpainting mask), but also visually where and in what approximate colour to make the requested change in the image.

To enable the Color Sketch feature in AUTOMATIC1111's web UI implementation, add --gradio-img2img-tool color-sketch to the commandline args.

Training the AI[edit]


Dreambooth (an extension) allows you to train your own model from a given existing model and a set of training images provided by you. The new model can then be used to create art, subjects and styles based on what the AI learned from the training images.

Using Dreambooth requires ~20GB VRAM and takes a lot of GPU time. It creates a model, which is typically a quite large file of several GB.

Textual Inversion / Embedding[edit]

Textual Inversion allows you to fine-tune Stable Diffusion to objects or styles without changing the model, as Dreambooth does. It uses only a few training images and is used, basically, "to teach the AI a new word" that stands for an object in an image, or an art style. Its result is called an embedding. Embeddings are relatively small files and can be easily shared among users.

  • Objects can be actual objects, or for example a human face. By training the software with just a handful of suitable input images, it can learn the visual essence of the object and apply that to newly generated images.
  • Likewise, by training the software with just a handful of suitable input images, it can learn the visual essence of an artistic style.

Once an embedding has been created, or downloaded and installed, one can use it simply by using the embedding's file name as a word in the prompt.


LoRA (Low-Rank Adaptation of Large Language Models, also often spelled lora) is another method to train Stable Diffusion on one's own images. It combines features of DreamBooth and Textual Inversion. Training LoRA requires only 6 to 7 GB of VRAM. Like Textual Inversion, LoRA creates relatively small files (typically 100-300 MB) that you can use together with any model. Also like Textual Inversion, you can use LoRA to train objects or art styles.

A good way to train LoRa is to use the kohya-ss GUI.

LoRA is added to the prompt by putting the following text into any location: <lora:filename:multiplier>, where filename is the name of file with LoRA on disk, excluding extension, and multiplier is a number, generally from 0 to 1, that lets you choose how strongly LoRA will affect the output. LoRA cannot be added to the negative prompt.

LoRAs for spanking scenes, with trigger words:


An X/Y plot of an anime-style woman in various different settings. This plot serves to demonstrate the usage of Hypernetworks which allows Stable Diffusion-based image generation models to imitate the art style of specific artists, even if the artist is not recognised by the original diffusion model, by applying a small neural network at various points within the larger network.

Hypernetworks are small pre-trained neural networks that steer results towards a particular direction, for example applying visual styles and motifs, when used in conjunction with a larger neural network. The Hypernetwork processes the image by finding key areas of importance such as hair and eyes, and patches them in secondary latent space. They are significantly smaller in filesize compared to DreamBooth models, another method for fine-training AI diffusion models, making Hypernetworks a viable alternative to DreamBooth models in some, but not all, use-cases.

Hypernetwork training also requires only 6GB of VRAM, compared to the ~20GB VRAM required for DreamBooth training (although this VRAM requirement can be lowered using DeepSpeed). A downside to Hypernetworks is that they are comparatively less flexible and accurate, and can sometimes lead to unpredictable results. For this reason, Hypernetworks are suited towards applying visual style or cleaning up blemishes in human anatomy, while DreamBooth models are more adept at depicting specific user-defined subjects.


ControlNet [2] is an extension that can be used to control Stable Diffusion's output through a control image provided by the user. For example, ControlNet can use edge detection to extract contours from the control image. SD will then generate output images that try to match these contours.

With ControlNet, the task of converting a drawing into a digital painting or photorealistic image can become easier than with the standard img2img/sketch approach. With the artist's tighter control of contours, positions and poses, the AI can be permitted greater freedom to "do its magic" with colours, lighting and details, and possibly create image output that better meets what the user wants to achieve.

How does it work?[edit]

In img2img mode, an input image's pixels are used to guide the output (in addition to the prompt). ControlNet uses a somewhat similar, but at the same time completely different approach. With ControlNet, certain compositional features of an input image are used to influence the output, but none of its actual pixels.

ControlNet works by computing the difference between certain features (e.g. edges, depth, pose) of the generated image in progress and the control image. It then computes a regularization term that adjusts the output to better batch the desired features. This is repeated in each diffusion step, so that while the image is gradually developing, it is constantly pulled towards the target.

It is possible to combine the guiding effects of ControlNet and img2img if you wish to do so, but ControlNet is actually especially powerful in txt2img mode where its tight control in certain featurs combined with complete pixel freedom in all other respects can produce astounding and surprising results.

ControlNet preprocessors and models[edit]

ControlNet comes with a number of preprocessors and models. The preprocessors are optional and can be skipped by setting them to "none" if you want to directly provide their output as input to the model.

The ControlNet models are trained for certain sub-variants of the process. They are completely distinct from the normal SD models and can be used in combination with any SD model (recommended are v1.4 and v1.5 models and their derivatives).

It is also possible to use multiple ControlNet models in combination. There are special CoAdapter models (Composable Adapter) that allow different models with various conditions to be aware of each other and synergize to achieve more powerful composability.


ControlNet has the following parameters, among others:

  • Weight: How strongly ControlNet influences Stable Diffusion's generated output. Default: 1.
  • Guidance Start (T): At which point (regarding the series of diffusion steps) ControlNet begins influencing. Default: 0 = from the very beginning
  • Guidance End (T): At which point (regarding the series of diffusion steps) ControlNet ends influencing. Default: 1 = until the very end

ControlNet can be used in both txt2img and img2img mode.

  • In txt2img mode, ControlNet influences the output while everything else can be influenced as usual by the artist, e.g. by their choice of model and prompt.
  • In img2img mode, Stable Diffusion will additionally base its output on the input image provided. This input image can be, but needs not be the same as the one used to control the contours. This method can be used, among other things, to improve an existing image via inpainting and using ControlNet for additional output control.


After the extension and one or more ControlNet models have been installed and SD has been restarted, a new collapsible section "ControlNet" will appear in the GUI under the section "Seed". In this section the user finds various settings to influence the process.

Further features[edit]


Upscaling can be used to scale an image up to a larger pixel size. Stable Diffusion comes with different upscalers, and has technologies such as GFPGAN and CodeFormer to improve details in human faces when upscaling. In the AUTOMATIC1111 webui interface, the upscaling feature is found in the "Extras" tab.

PNG Info[edit]

Stable Diffusion can be configured to save metadata with every image created. This is helpful to create further work on an image. With PNG Info, the metadata of any image can be displayed.

Checkpoint Merger[edit]

An X/Y plot of algorithmically-generated AI portrait artworks depicting the aesthetics of different science-fiction subgenres, created using a custom merged Stable Diffusion AI diffusion model checkpoint featuring wd-v1-3-full.ckpt merged with F111 and Stable Diffusion V1-5 at 0.5 sigmoid, and then merged with R34_e4 at 0.25 weighted sum. This plot also serves to illustrate how the replacement of a single keyword within a text prompt can alter the aesthetic style of an AI-generated artwork.

Checkpoint Merger can be used to merge two models (checkpoints), resulting in a new model that has a combination of their abilities (and their issues).


Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from the model provided that the image content is not illegal or harmful to individuals. The freedom provided to users over image usage has caused controversy over the ethics of ownership, as Stable Diffusion and other generative models are trained from copyrighted images without the owner’s consent.

Stable Diffusion is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to commercial generative AI products. Some models for Stable Diffusion come with terms of contract that forbid the use for certain kinds of output, while other models are shared freely among users and do not come with any TOC.


Stable Diffusion makes its source code available, along with pretrained weights. Its license prohibits certain use cases, including crime, libel, harassment, doxxing, "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... [or] legally protected characteristics or categories". The user owns the rights to their generated output images, and is free to use them commercially.

See also[edit]




NSFW tutorials:

Software to download:

Other distributions:

Simple web user interfaces to try it out, require no user installation and no GPU:

Example images:

Smallwikipedialogo.png This page uses content from Wikipedia. The original article was at Stable Diffusion. The list of authors can be seen in the page history. As with Spanking Art, the text of Wikipedia is available under a copyleft license, the Creative Commons Attribution Sharealike license.