README.md
| 1 | --- |
| 2 | license: apache-2.0 |
| 3 | tags: |
| 4 | - vision |
| 5 | widget: |
| 6 | - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png |
| 7 | candidate_labels: playing music, playing sports |
| 8 | example_title: Cat & Dog |
| 9 | --- |
| 10 | |
| 11 | # SigLIP (shape-optimized model) |
| 12 | |
| 13 | SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Zhai et al. and first released in [this repository](https://github.com/google-research/big_vision). |
| 14 | |
| 15 | This model has the SoViT-400m architecture, which is the shape-optimized version as presented in [Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design](https://arxiv.org/abs/2305.13035) by Alabdulmohsin et al. |
| 16 | |
| 17 | Disclaimer: The team releasing SigLIP did not write a model card for this model so this model card has been written by the Hugging Face team. |
| 18 | |
| 19 | ## Model description |
| 20 | |
| 21 | SigLIP is [CLIP](https://huggingface.co/docs/transformers/model_doc/clip), a multimodal model, with a better loss function. The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes. |
| 22 | |
| 23 | A TLDR of SigLIP by one of the authors can be found [here](https://twitter.com/giffmana/status/1692641733459267713). |
| 24 | |
| 25 | ## Intended uses & limitations |
| 26 | |
| 27 | You can use the raw model for tasks like zero-shot image classification and image-text retrieval. See the [model hub](https://huggingface.co/models?search=google/siglip) to look for |
| 28 | other versions on a task that interests you. |
| 29 | |
| 30 | ### How to use |
| 31 | |
| 32 | Here is how to use this model to perform zero-shot image classification: |
| 33 | |
| 34 | ```python |
| 35 | from PIL import Image |
| 36 | import requests |
| 37 | from transformers import AutoProcessor, AutoModel |
| 38 | import torch |
| 39 | |
| 40 | model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384") |
| 41 | processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384") |
| 42 | |
| 43 | url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
| 44 | image = Image.open(requests.get(url, stream=True).raw) |
| 45 | |
| 46 | texts = ["a photo of 2 cats", "a photo of 2 dogs"] |
| 47 | inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt") |
| 48 | |
| 49 | with torch.no_grad(): |
| 50 | outputs = model(**inputs) |
| 51 | |
| 52 | logits_per_image = outputs.logits_per_image |
| 53 | probs = torch.sigmoid(logits_per_image) # these are the probabilities |
| 54 | print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'") |
| 55 | ``` |
| 56 | |
| 57 | Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user: |
| 58 | |
| 59 | ```python |
| 60 | from transformers import pipeline |
| 61 | from PIL import Image |
| 62 | import requests |
| 63 | |
| 64 | # load pipe |
| 65 | image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384") |
| 66 | |
| 67 | # load image |
| 68 | url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
| 69 | image = Image.open(requests.get(url, stream=True).raw) |
| 70 | |
| 71 | # inference |
| 72 | outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"]) |
| 73 | outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs] |
| 74 | print(outputs) |
| 75 | ``` |
| 76 | For more code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#). |
| 77 | |
| 78 | ## Training procedure |
| 79 | |
| 80 | ### Training data |
| 81 | |
| 82 | SigLIP is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794). |
| 83 | |
| 84 | ### Preprocessing |
| 85 | |
| 86 | Images are resized/rescaled to the same resolution (384x384) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). |
| 87 | |
| 88 | Texts are tokenized and padded to the same length (64 tokens). |
| 89 | |
| 90 | ### Compute |
| 91 | |
| 92 | The model was trained on 16 TPU-v4 chips for three days. |
| 93 | |
| 94 | ## Evaluation results |
| 95 | |
| 96 | Evaluation of SigLIP compared to CLIP is shown below (taken from the paper). |
| 97 | |
| 98 | <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip_table.jpeg" |
| 99 | alt="drawing" width="600"/> |
| 100 | |
| 101 | ### BibTeX entry and citation info |
| 102 | |
| 103 | ```bibtex |
| 104 | @misc{zhai2023sigmoid, |
| 105 | title={Sigmoid Loss for Language Image Pre-Training}, |
| 106 | author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer}, |
| 107 | year={2023}, |
| 108 | eprint={2303.15343}, |
| 109 | archivePrefix={arXiv}, |
| 110 | primaryClass={cs.CV} |
| 111 | } |
| 112 | ``` |