README.md · IP-Adapter-FaceID

README.md

11.0 KB · 408 lines · markdown Raw

1	`---`
2	`tags:`
3	`- text-to-image`
4	`- stable-diffusion`
5
6	`language:`
7	`- en`
8	`library_name: diffusers`
9	`---`
10
11	`# IP-Adapter-FaceID Model Card`
12
13
14	`<div align="center">`
15
16	`[Project Page](https://ip-adapter.github.io) \| [Paper (ArXiv)](https://arxiv.org/abs/2308.06721) \| [Code](https://github.com/tencent-ailab/IP-Adapter)`
17	`</div>`
18
19	`---`
20
21
22
23	`## Introduction`
24
25	`An experimental version of IP-Adapter-FaceID: we use face ID embedding from a face recognition model instead of CLIP image embedding, additionally, we use LoRA to improve ID consistency. IP-Adapter-FaceID can generate various style images conditioned on a face with only text prompts.`
26
27	`![results](./ip-adapter-faceid.jpg)`
28
29
30	`Update 2023/12/27:`
31
32	`IP-Adapter-FaceID-Plus: face ID embedding (for face ID) + CLIP image embedding (for face structure)`
33
34	`<div align="center">`
35
36	`![results](./faceid-plus.jpg)`
37	`</div>`
38
39	`Update 2023/12/28:`
40
41	`IP-Adapter-FaceID-PlusV2: face ID embedding (for face ID) + controllable CLIP image embedding (for face structure)`
42
43	`You can adjust the weight of the face structure to get different generation!`
44
45	`<div align="center">`
46
47	`![results](./faceid_plusv2.jpg)`
48	`</div>`
49
50	`Update 2024/01/04:`
51
52	`IP-Adapter-FaceID-SDXL: An experimental SDXL version of IP-Adapter-FaceID`
53
54	`<div align="center">`
55
56	`![results](./sdxl_faceid.jpg)`
57	`</div>`
58
59	`Update 2024/01/17:`
60
61	`IP-Adapter-FaceID-PlusV2-SDXL: An experimental SDXL version of IP-Adapter-FaceID-PlusV2`
62
63
64	`Update 2024/01/19:`
65
66	`IP-Adapter-FaceID-Portrait: same with IP-Adapter-FaceID but for portrait generation (no lora! no controlnet!). Specifically, it accepts multiple facial images to enhance similarity (the default is 5).`
67
68	`<div align="center">`
69
70	`![results](./faceid_portrait_sd15.jpg)`
71	`</div>`
72
73
74	`## Usage`
75
76	`### IP-Adapter-FaceID`
77
78	`Firstly, you should use [insightface](https://github.com/deepinsight/insightface) to extract face ID embedding:`
79
80	```python
81
82	`import cv2`
83	`from insightface.app import FaceAnalysis`
84	`import torch`
85
86	`app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])`
87	`app.prepare(ctx_id=0, det_size=(640, 640))`
88
89	`image = cv2.imread("person.jpg")`
90	`faces = app.get(image)`
91
92	`faceid_embeds = torch.from_numpy(faces[0].normed_embedding).unsqueeze(0)`
93	```
94
95	`Then, you can generate images conditioned on the face embeddings:`
96
97	```python
98
99	`import torch`
100	`from diffusers import StableDiffusionPipeline, DDIMScheduler, AutoencoderKL`
101	`from PIL import Image`
102
103	`from ip_adapter.ip_adapter_faceid import IPAdapterFaceID`
104
105	`base_model_path = "SG161222/Realistic_Vision_V4.0_noVAE"`
106	`vae_model_path = "stabilityai/sd-vae-ft-mse"`
107	`ip_ckpt = "ip-adapter-faceid_sd15.bin"`
108	`device = "cuda"`
109
110	`noise_scheduler = DDIMScheduler(`
111	`num_train_timesteps=1000,`
112	`beta_start=0.00085,`
113	`beta_end=0.012,`
114	`beta_schedule="scaled_linear",`
115	`clip_sample=False,`
116	`set_alpha_to_one=False,`
117	`steps_offset=1,`
118	`)`
119	`vae = AutoencoderKL.from_pretrained(vae_model_path).to(dtype=torch.float16)`
120	`pipe = StableDiffusionPipeline.from_pretrained(`
121	`base_model_path,`
122	`torch_dtype=torch.float16,`
123	`scheduler=noise_scheduler,`
124	`vae=vae,`
125	`feature_extractor=None,`
126	`safety_checker=None`
127	`)`
128
129	`# load ip-adapter`
130	`ip_model = IPAdapterFaceID(pipe, ip_ckpt, device)`
131
132	`# generate image`
133	`prompt = "photo of a woman in red dress in a garden"`
134	`negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality, blurry"`
135
136	`images = ip_model.generate(`
137	`prompt=prompt, negative_prompt=negative_prompt, faceid_embeds=faceid_embeds, num_samples=4, width=512, height=768, num_inference_steps=30, seed=2023`
138	`)`
139
140	```
141
142	`you can also use a normal IP-Adapter and a normal LoRA to load model:`
143
144	```python
145	`import torch`
146	`from diffusers import StableDiffusionPipeline, DDIMScheduler, AutoencoderKL`
147	`from PIL import Image`
148
149	`from ip_adapter.ip_adapter_faceid_separate import IPAdapterFaceID`
150
151	`base_model_path = "SG161222/Realistic_Vision_V4.0_noVAE"`
152	`vae_model_path = "stabilityai/sd-vae-ft-mse"`
153	`ip_ckpt = "ip-adapter-faceid_sd15.bin"`
154	`lora_ckpt = "ip-adapter-faceid_sd15_lora.safetensors"`
155	`device = "cuda"`
156
157	`noise_scheduler = DDIMScheduler(`
158	`num_train_timesteps=1000,`
159	`beta_start=0.00085,`
160	`beta_end=0.012,`
161	`beta_schedule="scaled_linear",`
162	`clip_sample=False,`
163	`set_alpha_to_one=False,`
164	`steps_offset=1,`
165	`)`
166	`vae = AutoencoderKL.from_pretrained(vae_model_path).to(dtype=torch.float16)`
167	`pipe = StableDiffusionPipeline.from_pretrained(`
168	`base_model_path,`
169	`torch_dtype=torch.float16,`
170	`scheduler=noise_scheduler,`
171	`vae=vae,`
172	`feature_extractor=None,`
173	`safety_checker=None`
174	`)`
175
176	`# load lora and fuse`
177	`pipe.load_lora_weights(lora_ckpt)`
178	`pipe.fuse_lora()`
179
180	`# load ip-adapter`
181	`ip_model = IPAdapterFaceID(pipe, ip_ckpt, device)`
182
183	`# generate image`
184	`prompt = "photo of a woman in red dress in a garden"`
185	`negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality, blurry"`
186
187	`images = ip_model.generate(`
188	`prompt=prompt, negative_prompt=negative_prompt, faceid_embeds=faceid_embeds, num_samples=4, width=512, height=768, num_inference_steps=30, seed=2023`
189	`)`
190
191
192	```
193
194	`### IP-Adapter-FaceID-SDXL`
195
196	`Firstly, you should use [insightface](https://github.com/deepinsight/insightface) to extract face ID embedding:`
197
198	```python
199
200	`import cv2`
201	`from insightface.app import FaceAnalysis`
202	`import torch`
203
204	`app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])`
205	`app.prepare(ctx_id=0, det_size=(640, 640))`
206
207	`image = cv2.imread("person.jpg")`
208	`faces = app.get(image)`
209
210	`faceid_embeds = torch.from_numpy(faces[0].normed_embedding).unsqueeze(0)`
211	```
212
213	`Then, you can generate images conditioned on the face embeddings:`
214
215	```python
216
217	`import torch`
218	`from diffusers import StableDiffusionXLPipeline, DDIMScheduler`
219	`from PIL import Image`
220
221	`from ip_adapter.ip_adapter_faceid import IPAdapterFaceIDXL`
222
223	`base_model_path = "SG161222/RealVisXL_V3.0"`
224	`ip_ckpt = "ip-adapter-faceid_sdxl.bin"`
225	`device = "cuda"`
226
227	`noise_scheduler = DDIMScheduler(`
228	`num_train_timesteps=1000,`
229	`beta_start=0.00085,`
230	`beta_end=0.012,`
231	`beta_schedule="scaled_linear",`
232	`clip_sample=False,`
233	`set_alpha_to_one=False,`
234	`steps_offset=1,`
235	`)`
236	`pipe = StableDiffusionXLPipeline.from_pretrained(`
237	`base_model_path,`
238	`torch_dtype=torch.float16,`
239	`scheduler=noise_scheduler,`
240	`add_watermarker=False,`
241	`)`
242
243	`# load ip-adapter`
244	`ip_model = IPAdapterFaceIDXL(pipe, ip_ckpt, device)`
245
246	`# generate image`
247	`prompt = "A closeup shot of a beautiful Asian teenage girl in a white dress wearing small silver earrings in the garden, under the soft morning light"`
248	`negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality, blurry"`
249
250	`images = ip_model.generate(`
251	`prompt=prompt, negative_prompt=negative_prompt, faceid_embeds=faceid_embeds, num_samples=2,`
252	`width=1024, height=1024,`
253	`num_inference_steps=30, guidance_scale=7.5, seed=2023`
254	`)`
255
256	```
257
258
259	`### IP-Adapter-FaceID-Plus`
260
261	`Firstly, you should use [insightface](https://github.com/deepinsight/insightface) to extract face ID embedding and face image:`
262
263	```python
264
265	`import cv2`
266	`from insightface.app import FaceAnalysis`
267	`from insightface.utils import face_align`
268	`import torch`
269
270	`app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])`
271	`app.prepare(ctx_id=0, det_size=(640, 640))`
272
273	`image = cv2.imread("person.jpg")`
274	`faces = app.get(image)`
275
276	`faceid_embeds = torch.from_numpy(faces[0].normed_embedding).unsqueeze(0)`
277	`face_image = face_align.norm_crop(image, landmark=faces[0].kps, image_size=224) # you can also segment the face`
278	```
279
280	`Then, you can generate images conditioned on the face embeddings:`
281
282	```python
283
284	`import torch`
285	`from diffusers import StableDiffusionPipeline, DDIMScheduler, AutoencoderKL`
286	`from PIL import Image`
287
288	`from ip_adapter.ip_adapter_faceid import IPAdapterFaceIDPlus`
289
290	`v2 = False`
291	`base_model_path = "SG161222/Realistic_Vision_V4.0_noVAE"`
292	`vae_model_path = "stabilityai/sd-vae-ft-mse"`
293	`image_encoder_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"`
294	`ip_ckpt = "ip-adapter-faceid-plus_sd15.bin" if not v2 else "ip-adapter-faceid-plusv2_sd15.bin"`
295	`device = "cuda"`
296
297	`noise_scheduler = DDIMScheduler(`
298	`num_train_timesteps=1000,`
299	`beta_start=0.00085,`
300	`beta_end=0.012,`
301	`beta_schedule="scaled_linear",`
302	`clip_sample=False,`
303	`set_alpha_to_one=False,`
304	`steps_offset=1,`
305	`)`
306	`vae = AutoencoderKL.from_pretrained(vae_model_path).to(dtype=torch.float16)`
307	`pipe = StableDiffusionPipeline.from_pretrained(`
308	`base_model_path,`
309	`torch_dtype=torch.float16,`
310	`scheduler=noise_scheduler,`
311	`vae=vae,`
312	`feature_extractor=None,`
313	`safety_checker=None`
314	`)`
315
316	`# load ip-adapter`
317	`ip_model = IPAdapterFaceIDPlus(pipe, image_encoder_path, ip_ckpt, device)`
318
319	`# generate image`
320	`prompt = "photo of a woman in red dress in a garden"`
321	`negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality, blurry"`
322
323	`images = ip_model.generate(`
324	`prompt=prompt, negative_prompt=negative_prompt, face_image=face_image, faceid_embeds=faceid_embeds, shortcut=v2, s_scale=1.0,`
325	`num_samples=4, width=512, height=768, num_inference_steps=30, seed=2023`
326	`)`
327
328	```
329
330	`### IP-Adapter-FaceID-Portrait`
331
332	```python
333
334	`import cv2`
335	`from insightface.app import FaceAnalysis`
336	`import torch`
337
338	`app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])`
339	`app.prepare(ctx_id=0, det_size=(640, 640))`
340
341
342	`images = ["1.jpg", "2.jpg", "3.jpg", "4.jpg", "5.jpg"]`
343
344	`faceid_embeds = []`
345	`for image in images:`
346	`image = cv2.imread("person.jpg")`
347	`faces = app.get(image)`
348	`faceid_embeds.append(torch.from_numpy(faces[0].normed_embedding).unsqueeze(0).unsqueeze(0))`
349	`faceid_embeds = torch.cat(faceid_embeds, dim=1)`
350	```
351
352	```python
353	`import torch`
354	`from diffusers import StableDiffusionPipeline, DDIMScheduler, AutoencoderKL`
355	`from PIL import Image`
356
357	`from ip_adapter.ip_adapter_faceid_separate import IPAdapterFaceID`
358
359	`base_model_path = "SG161222/Realistic_Vision_V4.0_noVAE"`
360	`vae_model_path = "stabilityai/sd-vae-ft-mse"`
361	`ip_ckpt = "ip-adapter-faceid-portrait_sd15.bin"`
362	`device = "cuda"`
363
364	`noise_scheduler = DDIMScheduler(`
365	`num_train_timesteps=1000,`
366	`beta_start=0.00085,`
367	`beta_end=0.012,`
368	`beta_schedule="scaled_linear",`
369	`clip_sample=False,`
370	`set_alpha_to_one=False,`
371	`steps_offset=1,`
372	`)`
373	`vae = AutoencoderKL.from_pretrained(vae_model_path).to(dtype=torch.float16)`
374	`pipe = StableDiffusionPipeline.from_pretrained(`
375	`base_model_path,`
376	`torch_dtype=torch.float16,`
377	`scheduler=noise_scheduler,`
378	`vae=vae,`
379	`feature_extractor=None,`
380	`safety_checker=None`
381	`)`
382
383
384	`# load ip-adapter`
385	`ip_model = IPAdapterFaceID(pipe, ip_ckpt, device, num_tokens=16, n_cond=5)`
386
387	`# generate image`
388	`prompt = "photo of a woman in red dress in a garden"`
389	`negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality, blurry"`
390
391	`images = ip_model.generate(`
392	`prompt=prompt, negative_prompt=negative_prompt, faceid_embeds=faceid_embeds, num_samples=4, width=512, height=512, num_inference_steps=30, seed=2023`
393	`)`
394
395
396	```
397
398
399
400	`## Limitations and Bias`
401	`- The models do not achieve perfect photorealism and ID consistency.`
402	`- The generalization of the models is limited due to limitations of the training data, base model and face recognition model.`
403
404
405	`## Non-commercial use`
406	`AS InsightFace pretrained models are available for non-commercial research purposes, IP-Adapter-FaceID models are released exclusively for research purposes and is not intended for commercial use.`
407
408