README.md
4.7 KB · 101 lines · markdown Raw
1
2 ---
3 license: openrail++
4 base_model: stabilityai/stable-diffusion-xl-base-1.0
5 tags:
6 - stable-diffusion-xl
7 - stable-diffusion-xl-diffusers
8 - text-to-image
9 - diffusers
10 - inpainting
11 inference: false
12 ---
13
14 # SD-XL Inpainting 0.1 Model Card
15
16 ![inpaint-example](inpaint-examples-min.png)
17
18 SD-XL Inpainting 0.1 is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask.
19
20 The SD-XL Inpainting 0.1 was initialized with the `stable-diffusion-xl-base-1.0` weights. The model is trained for 40k steps at resolution 1024x1024 and 5% dropping of the text-conditioning to improve classifier-free classifier-free guidance sampling. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and, in 25% mask everything.
21
22
23 ## How to use
24
25 ```py
26 from diffusers import AutoPipelineForInpainting
27 from diffusers.utils import load_image
28 import torch
29
30 pipe = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16, variant="fp16").to("cuda")
31
32 img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
33 mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
34
35 image = load_image(img_url).resize((1024, 1024))
36 mask_image = load_image(mask_url).resize((1024, 1024))
37
38 prompt = "a tiger sitting on a park bench"
39 generator = torch.Generator(device="cuda").manual_seed(0)
40
41 image = pipe(
42 prompt=prompt,
43 image=image,
44 mask_image=mask_image,
45 guidance_scale=8.0,
46 num_inference_steps=20, # steps between 15 and 30 work well for us
47 strength=0.99, # make sure to use `strength` below 1.0
48 generator=generator,
49 ).images[0]
50 ```
51
52 **How it works:**
53 `image` | `mask_image`
54 :-------------------------:|:-------------------------:|
55 <img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" alt="drawing" width="300"/> | <img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" alt="drawing" width="300"/>
56
57
58 `prompt` | `Output`
59 :-------------------------:|:-------------------------:|
60 <span style="position: relative;bottom: 150px;">a tiger sitting on a park bench</span> | <img src="https://huggingface.co/datasets/valhalla/images/resolve/main/tiger.png" alt="drawing" width="300"/>
61
62 ## Model Description
63
64 - **Developed by:** The Diffusers team
65 - **Model type:** Diffusion-based text-to-image generative model
66 - **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/LICENSE.md)
67 - **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses two fixed, pretrained text encoders ([OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main)).
68
69
70 ## Uses
71
72 ### Direct Use
73
74 The model is intended for research purposes only. Possible research areas and tasks include
75
76 - Generation of artworks and use in design and other artistic processes.
77 - Applications in educational or creative tools.
78 - Research on generative models.
79 - Safe deployment of models which have the potential to generate harmful content.
80 - Probing and understanding the limitations and biases of generative models.
81
82 Excluded uses are described below.
83
84 ### Out-of-Scope Use
85
86 The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
87
88 ## Limitations and Bias
89
90 ### Limitations
91
92 - The model does not achieve perfect photorealism
93 - The model cannot render legible text
94 - The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
95 - Faces and people in general may not be generated properly.
96 - The autoencoding part of the model is lossy.
97 - When the strength parameter is set to 1 (i.e. starting in-painting from a fully masked image), the quality of the image is degraded. The model retains the non-masked contents of the image, but images look less sharp. We're investing this and working on the next version.
98
99 ### Bias
100 While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
101