README.md
4.0 KB · 110 lines · markdown Raw
1 ---
2 license: apache-2.0
3 tags:
4 - vision
5 widget:
6 - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
7 candidate_labels: playing music, playing sports
8 example_title: Cat & Dog
9 ---
10
11 # SigLIP (base-sized model)
12
13 SigLIP model pre-trained on WebLi at resolution 224x224. It was introduced in the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Zhai et al. and first released in [this repository](https://github.com/google-research/big_vision).
14
15 Disclaimer: The team releasing SigLIP did not write a model card for this model so this model card has been written by the Hugging Face team.
16
17 ## Model description
18
19 SigLIP is [CLIP](https://huggingface.co/docs/transformers/model_doc/clip), a multimodal model, with a better loss function. The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.
20
21 A TLDR of SigLIP by one of the authors can be found [here](https://twitter.com/giffmana/status/1692641733459267713).
22
23 ## Intended uses & limitations
24
25 You can use the raw model for tasks like zero-shot image classification and image-text retrieval. See the [model hub](https://huggingface.co/models?search=google/siglip) to look for
26 other versions on a task that interests you.
27
28 ### How to use
29
30 Here is how to use this model to perform zero-shot image classification:
31
32 ```python
33 from PIL import Image
34 import requests
35 from transformers import AutoProcessor, AutoModel
36 import torch
37
38 model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
39 processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
40
41 url = "http://images.cocodataset.org/val2017/000000039769.jpg"
42 image = Image.open(requests.get(url, stream=True).raw)
43
44 texts = ["a photo of 2 cats", "a photo of 2 dogs"]
45 inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
46
47 with torch.no_grad():
48 outputs = model(**inputs)
49
50 logits_per_image = outputs.logits_per_image
51 probs = torch.sigmoid(logits_per_image) # these are the probabilities
52 print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")
53 ```
54
55 Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user:
56
57 ```python
58 from transformers import pipeline
59 from PIL import Image
60 import requests
61
62 # load pipe
63 image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-224")
64
65 # load image
66 url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
67 image = Image.open(requests.get(url, stream=True).raw)
68
69 # inference
70 outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
71 outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
72 print(outputs)
73 ```
74 For more code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#).
75
76 ## Training procedure
77
78 ### Training data
79
80 SigLIP is pre-trained on the English image-text pairs of the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794).
81
82 ### Preprocessing
83
84 Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
85
86 Texts are tokenized and padded to the same length (64 tokens).
87
88 ### Compute
89
90 The model was trained on 16 TPU-v4 chips for three days.
91
92 ## Evaluation results
93
94 Evaluation of SigLIP compared to CLIP is shown below (taken from the paper).
95
96 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip_table.jpeg"
97 alt="drawing" width="600"/>
98
99 ### BibTeX entry and citation info
100
101 ```bibtex
102 @misc{zhai2023sigmoid,
103 title={Sigmoid Loss for Language Image Pre-Training},
104 author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
105 year={2023},
106 eprint={2303.15343},
107 archivePrefix={arXiv},
108 primaryClass={cs.CV}
109 }
110 ```