README.md
4.2 KB · 112 lines · markdown Raw
1 ---
2 license: apache-2.0
3 tags:
4 - vision
5 widget:
6 - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
7 candidate_labels: playing music, playing sports
8 example_title: Cat & Dog
9 ---
10
11 # SigLIP (shape-optimized model)
12
13 SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Zhai et al. and first released in [this repository](https://github.com/google-research/big_vision).
14
15 This model has the SoViT-400m architecture, which is the shape-optimized version as presented in [Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design](https://arxiv.org/abs/2305.13035) by Alabdulmohsin et al.
16
17 Disclaimer: The team releasing SigLIP did not write a model card for this model so this model card has been written by the Hugging Face team.
18
19 ## Model description
20
21 SigLIP is [CLIP](https://huggingface.co/docs/transformers/model_doc/clip), a multimodal model, with a better loss function. The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.
22
23 A TLDR of SigLIP by one of the authors can be found [here](https://twitter.com/giffmana/status/1692641733459267713).
24
25 ## Intended uses & limitations
26
27 You can use the raw model for tasks like zero-shot image classification and image-text retrieval. See the [model hub](https://huggingface.co/models?search=google/siglip) to look for
28 other versions on a task that interests you.
29
30 ### How to use
31
32 Here is how to use this model to perform zero-shot image classification:
33
34 ```python
35 from PIL import Image
36 import requests
37 from transformers import AutoProcessor, AutoModel
38 import torch
39
40 model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
41 processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
42
43 url = "http://images.cocodataset.org/val2017/000000039769.jpg"
44 image = Image.open(requests.get(url, stream=True).raw)
45
46 texts = ["a photo of 2 cats", "a photo of 2 dogs"]
47 inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
48
49 with torch.no_grad():
50 outputs = model(**inputs)
51
52 logits_per_image = outputs.logits_per_image
53 probs = torch.sigmoid(logits_per_image) # these are the probabilities
54 print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")
55 ```
56
57 Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user:
58
59 ```python
60 from transformers import pipeline
61 from PIL import Image
62 import requests
63
64 # load pipe
65 image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")
66
67 # load image
68 url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
69 image = Image.open(requests.get(url, stream=True).raw)
70
71 # inference
72 outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
73 outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
74 print(outputs)
75 ```
76 For more code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#).
77
78 ## Training procedure
79
80 ### Training data
81
82 SigLIP is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794).
83
84 ### Preprocessing
85
86 Images are resized/rescaled to the same resolution (384x384) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
87
88 Texts are tokenized and padded to the same length (64 tokens).
89
90 ### Compute
91
92 The model was trained on 16 TPU-v4 chips for three days.
93
94 ## Evaluation results
95
96 Evaluation of SigLIP compared to CLIP is shown below (taken from the paper).
97
98 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip_table.jpeg"
99 alt="drawing" width="600"/>
100
101 ### BibTeX entry and citation info
102
103 ```bibtex
104 @misc{zhai2023sigmoid,
105 title={Sigmoid Loss for Language Image Pre-Training},
106 author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
107 year={2023},
108 eprint={2303.15343},
109 archivePrefix={arXiv},
110 primaryClass={cs.CV}
111 }
112 ```