README.md · stable-diffusion-xl-1.0-inpainting-0.1

1

2

---

3

license: openrail++

4

base_model: stabilityai/stable-diffusion-xl-base-1.0

5

tags:

6

- stable-diffusion-xl

7

- stable-diffusion-xl-diffusers

8

- text-to-image

9

- diffusers

10

- inpainting

11

inference: false

12

---

13

14

# SD-XL Inpainting 0.1 Model Card

15

16

![inpaint-example](inpaint-examples-min.png)

17

18

SD-XL Inpainting 0.1 is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask.

19

20

The SD-XL Inpainting 0.1 was initialized with the `stable-diffusion-xl-base-1.0` weights. The model is trained for 40k steps at resolution 1024x1024 and 5% dropping of the text-conditioning to improve classifier-free classifier-free guidance sampling. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and, in 25% mask everything.

21

22

23

## How to use

24

25

```py

26

from diffusers import AutoPipelineForInpainting

27

from diffusers.utils import load_image

28

import torch

29

30

pipe = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16, variant="fp16").to("cuda")

31

32

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"

33

mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

34

35

image = load_image(img_url).resize((1024, 1024))

36

mask_image = load_image(mask_url).resize((1024, 1024))

37

38

prompt = "a tiger sitting on a park bench"

39

generator = torch.Generator(device="cuda").manual_seed(0)

40

41

image = pipe(

42

prompt=prompt,

43

image=image,

44

mask_image=mask_image,

45

guidance_scale=8.0,

46

num_inference_steps=20, # steps between 15 and 30 work well for us

47

strength=0.99, # make sure to use `strength` below 1.0

48

generator=generator,

49

).images[0]

50

```

51

52

**How it works:**

53

`image` | `mask_image`

54

:-------------------------:|:-------------------------:|

55

<img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" alt="drawing" width="300"/> | <img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" alt="drawing" width="300"/>

56

57

58

`prompt` | `Output`

59

:-------------------------:|:-------------------------:|

60

<span style="position: relative;bottom: 150px;">a tiger sitting on a park bench</span> | <img src="https://huggingface.co/datasets/valhalla/images/resolve/main/tiger.png" alt="drawing" width="300"/>

61

62

## Model Description

63

64

- **Developed by:** The Diffusers team

65

- **Model type:** Diffusion-based text-to-image generative model

66

- **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/LICENSE.md)

67

- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses two fixed, pretrained text encoders ([OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main)).

68

69

70

## Uses

71

72

### Direct Use

73

74

The model is intended for research purposes only. Possible research areas and tasks include

75

76

- Generation of artworks and use in design and other artistic processes.

77

- Applications in educational or creative tools.

78

- Research on generative models.

79

- Safe deployment of models which have the potential to generate harmful content.

80

- Probing and understanding the limitations and biases of generative models.

81

82

Excluded uses are described below.

83

84

### Out-of-Scope Use

85

86

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

87

88

## Limitations and Bias

89

90

### Limitations

91

92

- The model does not achieve perfect photorealism

93

- The model cannot render legible text

94

- The model struggles with more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”

95

- Faces and people in general may not be generated properly.

96

- The autoencoding part of the model is lossy.

97

- When the strength parameter is set to 1 (i.e. starting in-painting from a fully masked image), the quality of the image is degraded. The model retains the non-masked contents of the image, but images look less sharp. We're investing this and working on the next version.

98

99

### Bias

100

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

101