README.md
26.5 KB · 557 lines · markdown Raw
1 ---
2 license: other
3 license_name: nvidia-license
4 license_link: https://huggingface.co/nvidia/LocateAnything-3B/blob/main/LICENSE
5 language:
6 - en
7 tags:
8 - nvidia
9 - eagle
10 - vision
11 - object-detection
12 - grounding
13 - locateanything
14 - arxiv:2605.27365
15 demo: https://huggingface.co/spaces/nvidia/LocateAnything
16 github: https://github.com/NVlabs/Eagle/tree/main/Embodied
17 library_name: transformers
18 pipeline_tag: image-text-to-text
19 base_model:
20 - Qwen/Qwen2.5-3B-Instruct
21 ---
22
23 # LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
24
25 <p align="center">
26 <img src="assets/teaser.jpg" alt="LocateAnything teaser" width="100%">
27 </p>
28
29 ## 🔗 Quick Links
30
31 * 🚀 **Online Demo**: [LocateAnything (Hugging Face Spaces)](https://huggingface.co/spaces/nvidia/LocateAnything)
32 * 💻 **GitHub Code**: [NVlabs/Eagle/Embodied](https://github.com/NVlabs/Eagle/tree/main/Embodied)
33 * 📄 **Paper**: [arXiv:2605.27365](https://arxiv.org/abs/2605.27365)
34
35
36 # Model Overview
37
38 ### Description:
39
40 LocateAnything is a vision-language model for fast and high-quality visual grounding, enabling precise object localization, dense detection, and point-based localization across diverse domains in both Enterprise Intelligence and Physical AI. The model adopts a generalist design, supporting tasks such as referring expression grounding, multi-object detection, GUI element grounding, and text localization, with strong performance in complex and cluttered scenes.
41
42 Its core innovation, Parallel Box Decoding (PBD), predicts complete bounding box coordinates in a single parallel step rather than autoregressive token-by-token decoding, improving efficiency while preserving geometric consistency. This enables up to 2.5× higher throughput compared to prior approaches.
43
44 The model is trained on a large-scale multi-domain dataset (12M images, 138M+ queries, 785M bounding boxes) spanning natural scenes, robotics, driving, GUI interaction, and document understanding. It serves as a foundation for generalist multimodal perception and has been integrated into NVIDIA’s frontier production-grade vision-language models, such as Nemotron 3 Nano Omni, supporting grounding, GUI understanding, and multimodal agentic capabilities.
45
46 LocateAnything is developed as part of the [Eagle VLM](https://github.com/NVlabs/EAGLE) model family. This released model is for research and development only. In addition, LocateAnything contributed to [Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) and [Cosmos](https://www.nvidia.com/en-us/ai/cosmos/) as part of the Computer Use and Visual Grounding features. We give special thanks the Nemotron and Cosmos Teams for time and efforts in product integration.
47
48 ### Demo Videos
49
50 <p align="left">
51 <video src="https://huggingface.co/nvidia/LocateAnything-3B/resolve/main/assets/demo.mp4" controls="controls" width="80%">
52 Your browser does not support the video tag.
53 </video>
54 </p>
55
56 <p align="left">
57 <video src="https://huggingface.co/nvidia/LocateAnything-3B/resolve/main/assets/decoding_demo.mp4" controls="controls" width="80%">
58 Your browser does not support the video tag.
59 </video>
60 </p>
61
62 ### License/Terms of Use:
63
64 This model is released under the [NVIDIA License](https://huggingface.co/nvidia/LocateAnything-3B/blob/main/LICENSE) for non-commercial use, which permits use, reproduction, and modification for **academic and non-profit research purposes only**. Commercial use is **not permitted**, except by NVIDIA and its affiliates. Redistribution must retain the license and all applicable copyright and attribution notices. The model is provided **“as is” without warranty of any kind**, and users assume all associated risks.
65
66 This model is built using components from third-party models with their respective licenses:
67 - Language model: [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) (Qwen Research License)
68 - Vision encoder: [MoonViT-SO-400M](https://huggingface.co/moonshotai/MoonViT-SO-400M) (MIT License)
69
70 Models are improved using Qwen.
71
72 ### Deployment Geography:
73
74 Global
75
76 ### Use Case:
77
78 LocateAnything-3B is intended for developers and researchers building vision-language models and applications that require fast and precise visual localization from natural language instructions.
79
80 Supported use cases include:
81 - Open-set, common, and long-tail object detection
82 - Dense multi-object detection in cluttered scenes
83 - Phrase and referring-expression grounding
84 - Automated dataset labeling and annotation (e.g., detection, grounding, pointing)
85 - GUI element grounding for interactive and agentic systems
86 - Robotics and autonomous driving perception
87 - Document understanding, layout grounding, and OCR localization
88 - Industrial inspection, surveillance, and remote sensing applications
89 - Point-based localization and fine-grained spatial reasoning
90
91 ### Release Date [Insert the expected release date below]:
92
93 - Github [05/26/2026] via https://github.com/NVlabs/Eagle/tree/main/Embodied.
94 - Hugging Face [05/26/2026] via https://huggingface.co/nvidia/LocateAnything-3B.
95 - Demo [05/26/2026] via https://huggingface.co/spaces/nvidia/LocateAnything.
96 - Webpage [05/26/2026] via https://research.nvidia.com/labs/lpr/locate-anything/.
97 - Tech Report [05/26/2026] via https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf
98
99 ## References(s):
100 - Wang et al., [LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding](https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf), NVIDIA Tech Report, 2026
101 - Kimi Team, [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491), arXiv:2504.07491, 2025.
102 - Qwen Team, [Qwen2.5: A Party of Foundation Models](https://qwen.ai/blog?id=qwen2.5), Qwen Blog, 2024.
103 - Chen et al., [Pix2Seq: A Language Modeling Framework for Object Detection](https://arxiv.org/abs/2109.10852), ICLR, 2022.
104 - Jiang et al., [Detect Anything via Next Point Prediction](https://arxiv.org/abs/2510.12798), arXiv:2510.12798, 2025.
105 - Liu et al., [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499), arXiv:2303.05499, 2023.
106 - Lin et al., [Microsoft COCO: Common Objects in Context](https://arxiv.org/abs/1405.0312), ECCV, 2014.
107 - Gupta et al., [LVIS: A Dataset for Large Vocabulary Instance Segmentation](https://arxiv.org/abs/1908.03195), CVPR, 2019.
108 - Li et al., [ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use](https://arxiv.org/abs/2504.07981), ACM MM, 2025.
109
110 ## Model Architecture:
111
112 **Architecture Type:** Transformer-based vision-language model (VLM).
113
114 **Network Architecture:** Native-resolution VLM with the following components:
115 - Vision encoder: MoonViT
116 - Language model: Qwen2.5-3B-Instruct
117 - Multimodal projector: MLP projector
118 - Output formulation: Block-based structure for visual grounding
119
120 **Number of model parameters:** 3B.
121
122 LocateAnything extends a vision-language model with Parallel Box Decoding (PBD), a block-wise multi-token prediction framework for efficient visual grounding. Instead of autoregressive coordinate generation, the model predicts complete bounding boxes and points in parallel structured units, improving decoding efficiency while preserving geometric consistency. The architecture jointly optimizes next-token prediction and multi-token prediction to balance reasoning ability and parallel inference. Training follows a four-stage pipeline: initial multimodal knowledge adaptation using captioning, VQA, OCR, and related data, followed by grounding and dense-scene localization fine-tuning.
123
124 ## Input(s):
125
126 **Input Type(s):** Image and Text.
127
128 **Input Format(s):**
129 - Image: RGB image input with original source resolution.
130 - Text: Natural-language prompt or task template, such as object categories, referring expressions, GUI instructions, OCR/layout requests, or pointing queries.
131
132 **Input Parameters:**
133 - Image: Two-Dimensional (2D)
134 - Text: One-Dimensional (1D)
135
136 **Other Properties Related to Input:**
137 - Production image resolution supports up to 2.5K.
138 - Prompt length supports up to 24K tokens.
139 - Training detection and grounding stages use a maximum sequence length of 25,600 tokens.
140 - Inference supports up to 8,192 newly generated tokens.
141
142 ## Output(s):
143
144 **Output Type(s):** Text.
145
146 **Output Format(s):**
147 - Text: Model-generated token sequence containing semantic labels and structured coordinate tokens, such as bounding boxes (`<box> x1, y1, x2, y2 </box>`) and points (`<box> x, y </box>`).
148
149 **Output Parameters:**
150 - Text: One-Dimensional (1D)
151 - Bounding boxes/points: Two-Dimensional (2D) spatial coordinates
152
153 **Other Properties Related to Output:**
154 - Outputs are organized into fixed-length blocks (length 6), including Semantic, Box, Negative, and End blocks.
155 - A Box block encodes quantized spatial coordinates with structural tokens; unused positions are padded with `<null>`.
156 - Fast Mode predicts box-aligned blocks in parallel; Slow Mode uses autoregressive decoding; Hybrid Mode defaults to parallel decoding with fallback to autoregressive decoding for format irregularity or spatial ambiguity.
157
158 Our AI models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves improved training and inference performance compared to CPU-only solutions.
159
160 ## Software Integration:
161 **Runtime Engine(s):**
162 * Transformers. The inference setup uses standard VLM generation with BF16 precision and KV cache. TensorRT, TensorRT-LLM, and Triton are not yet supported.
163
164 **Supported Hardware Microarchitecture Compatibility:**
165
166 * NVIDIA Ampere (e.g., A100)
167 * NVIDIA Blackwell
168 * NVIDIA Hopper (e.g., H100)
169 * NVIDIA Lovelace (e.g., L40, RTX 4090)
170
171 Deployment on embedded platforms such as NVIDIA Thor is possible with additional model optimization, including quantization, compression, or distillation. Other architectures may be supported depending on available memory, precision support, and software configuration.
172
173 **Supported Operating System(s):**
174 * Linux
175
176 The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
177
178 ## Model Version(s):
179 LocateAnything-3B: 3B-parameter research model variant evaluated in Hybrid Mode by default. Fast, Hybrid, and Slow inference modes are supported by the same model formulation.
180
181 LocateAnything-3B can be integrated into systems that require spatial grounding from natural language, such as GUI agents, robotics/embodied agents, document-understanding pipelines, OCR/text localization, and open-world detection workflows.
182
183 ## Training, Testing, and Evaluation Datasets:
184
185 ### Data Modality:
186 Image and Text. <br>
187 * Image <br>
188 * Text <br>
189
190 ### Training Data Size:
191 **Image Training Data Size:** <br>
192 * 1 Million to 1 Billion Images - 12M unique images. <br>
193
194 **Text Training Data Size:** <br>
195 * 1 Billion to 10 Trillion Tokens - Derived from approximately 140M natural-language queries. <br>
196
197 **Data Collection Method by dataset:** <br>
198 - Hybrid: Human, Automated <br>
199 Data is collected from human-curated and open-source datasets, as well as automated ingestion of publicly available data sources.
200
201 **Labeling Method by dataset:** <br>
202 - Hybrid: Human, Synthetic, Automated <br>
203 Labeling includes original human or open-source annotations, along with model-assisted and synthetic annotation generation using Qwen3-VL, Molmo, SAM 3, and Rex-Omni, with automated post-verification.
204
205 **Properties:** The training data consists of supervised fine-tuning (SFT) datasets with multimodal inputs, primarily image-text pairs and structured annotations such as bounding boxes, points, and negative samples.
206
207 The data spans multiple domains, including grounding, open-world grounding, general and dense object detection, scene text detection, GUI understanding and grounding, document layout understanding, and OCR.
208
209 Modalities include visual inputs (images) and natural-language queries or instructions. The dataset is derived from a mixture of publicly available academic datasets, along with model-assisted and synthetic annotations. It may include publicly available and potentially copyrighted content; users are responsible for ensuring compliance with applicable usage rights.
210
211 The linguistic content primarily consists of short, task-oriented natural-language expressions, such as object categories, referring expressions, GUI instructions, OCR queries, and grounding prompts, typically in English.
212
213 ## Evaluation Dataset:
214
215 **Data Collection Method by dataset:**
216 - Hybrid: Human, Automated
217
218 **Labeling Method by dataset:**
219 - Hybrid: Human, Synthetic, Automated
220
221 **Properties:** The evaluation datasets consist of publicly available benchmarks spanning visual grounding, object detection, document understanding, scene text detection, and GUI-related tasks. Modalities include image inputs paired with natural-language queries and structured annotations such as bounding boxes and points.
222
223 The evaluation suite covers both box-level and point-level grounding tasks, with approximately 48K images for box evaluation and 35K images for point evaluation across multiple datasets. These datasets span diverse domains including natural scenes, documents, aerial imagery, and human-centric interactions, enabling comprehensive assessment of localization accuracy and robustness.
224
225 Evaluation queries are typically short, task-oriented natural-language expressions such as referring phrases, object categories, and grounding prompts.
226
227 Performance is measured using box-based F1 at IoU thresholds of 0.5 and 0.95, as well as mean IoU for detection, layout, and OCR tasks. Point-based localization is evaluated based on whether predicted points fall within ground-truth segmentation masks or bounding boxes. Inference efficiency is reported in boxes per second (BPS) on a single NVIDIA H100 GPU with batch size 1.
228
229 ## Quantitative Evaluation Benchmarks
230
231 ### General Object Detection
232 <p align="left">
233 <img src="assets/coco_lvis.png" width="700">
234 </p>
235
236 ### Dense Object Detection
237 <p align="left">
238 <img src="assets/dense_object_detection.png" width="700">
239 </p>
240
241 ### GUI Understanding
242 <p align="left">
243 <img src="assets/sspro.png" width="700">
244 </p>
245
246 ### Layout Grounding and OCR
247 <p align="left">
248 <img src="assets/layout_ocr.png" width="700">
249 </p>
250
251 ### Referring Expression Grounding
252 <p align="left">
253 <img src="assets/referring.png" width="700">
254 </p>
255
256 ### Pointing
257 <p align="left">
258 <img src="assets/pointing.png" width="700">
259 </p>
260
261 ## Inference:
262
263 Test Hardware: H100 & A100
264
265 We suggest using `max_new_tokens=8192` and `generation_mode="hybrid"` to avoid truncated response and balance speed with accuracy.
266
267 ### Batch Hybrid Inference
268
269 This release includes `batch_infer.py`, `batch_utils`, and `kernel_utils` for
270 high-throughput detection and grounding. The `la_flash` backend is a pure
271 FlashAttention-varlen sparse range executor: it keeps LocateAnything's hybrid
272 MTP decoding path, avoids dense `[B,H,Q,K]` SDPA masks, and does not require a
273 custom CUDA extension build.
274
275 Use it with:
276
277 ```bash
278 python batch_infer.py \
279 --model . \
280 --attn la_flash \
281 --scheduler pipeline \
282 --batch-size 4 \
283 --image /path/to/image.jpg \
284 --query "person</c>car"
285 ```
286
287 A100 4K probe, real 3840x2160 street image, `query=vehicle`,
288 `batch_size=4`, raw PIL input, `in_token_limit=25600`, hybrid MTP inference:
289
290 | Backend | Attention Path | Time | Peak Reserved Memory |
291 | --- | --- | ---: | ---: |
292 | `sdpa` | Dense SDPA masks | 8.2600 s | 35.12 GB |
293 | `la_flash` | FlashAttention sparse range plan | 8.0314 s | 11.71 GB |
294
295 See `batch_utils/README.md` and `kernel_utils/README.md` for runtime knobs and
296 implementation details.
297
298 ### Installation
299
300 ```bash
301 pip install opencv-python-headless==4.11.0.86 transformers==4.57.1 numpy==1.25.0 Pillow==11.1.0 peft torchvision decord==0.6.0 lmdb==1.7.5
302 ```
303
304 > PyTorch (`torch`) must be installed separately according to your CUDA version. See [pytorch.org/get-started](https://pytorch.org/get-started/locally/).
305
306 Optional — [MagiAttention](https://sandai-org.github.io/MagiAttention/docs/main/user_guide/install.html) (Hopper / Blackwell GPUs only, recommended for faster MTP inference):
307
308 ```bash
309 git clone https://github.com/SandAI-org/MagiAttention.git
310 cd MagiAttention
311 git checkout v1.0.5
312 git submodule update --init --recursive
313 pip install -r requirements.txt
314 pip install --no-build-isolation .
315 ```
316
317 If MagiAttention is installed, the model will automatically use it for efficient MTP block-diffusion attention. If not installed, it will fall back to PyTorch SDPA — fully functional but slower for MTP decoding.
318
319 ### Worker (recommended)
320
321 Below is a self-contained worker that loads the model once and serves perception queries via a unified `predict()` plus task-specific convenience methods. You can drop this class into any FastAPI / gRPC / Triton serving framework.
322
323 ```python
324 import re
325 import torch
326 from PIL import Image
327 from transformers import AutoModel, AutoTokenizer, AutoProcessor
328
329
330 class LocateAnythingWorker:
331 """Stateful worker that loads the model once and serves perception queries."""
332
333 def __init__(self, model_path: str, device: str = "cuda", dtype=torch.bfloat16):
334 self.device = device
335 self.dtype = dtype
336
337 self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
338 self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
339 self.model = AutoModel.from_pretrained(
340 model_path,
341 torch_dtype=dtype,
342 trust_remote_code=True,
343 ).to(device).eval()
344
345 @torch.no_grad()
346 def predict(
347 self,
348 image: Image.Image,
349 question: str,
350 generation_mode: str = "hybrid", # "fast" (MTP) | "slow" (NTP/AR) | "hybrid"
351 max_new_tokens: int = 2048,
352 temperature: float = 0.7,
353 verbose: bool = True,
354 ) -> dict:
355 messages = [
356 {"role": "user", "content": [
357 {"type": "image", "image": image},
358 {"type": "text", "text": question},
359 ]}
360 ]
361
362 text = self.processor.py_apply_chat_template(
363 messages, tokenize=False, add_generation_prompt=True
364 )
365 images, videos = self.processor.process_vision_info(messages)
366 inputs = self.processor(
367 text=[text], images=images, videos=videos, return_tensors="pt"
368 ).to(self.device)
369
370 pixel_values = inputs["pixel_values"].to(self.dtype)
371 input_ids = inputs["input_ids"]
372 image_grid_hws = inputs.get("image_grid_hws", None)
373
374 response = self.model.generate(
375 pixel_values=pixel_values,
376 input_ids=input_ids,
377 attention_mask=inputs["attention_mask"],
378 image_grid_hws=image_grid_hws,
379 tokenizer=self.tokenizer,
380 max_new_tokens=max_new_tokens,
381 use_cache=True,
382 generation_mode=generation_mode,
383 temperature=temperature,
384 do_sample=True,
385 top_p=0.9,
386 repetition_penalty=1.1,
387 verbose=verbose,
388 )
389
390 result = {"answer": response[0] if isinstance(response, tuple) else response}
391 if isinstance(response, tuple) and len(response) >= 3:
392 result["history"] = response[1]
393 result["stats"] = response[2]
394 return result
395
396 # ---- Convenience methods for each task ----
397
398 def detect(self, image: Image.Image, categories: list[str], **kwargs) -> dict:
399 """Object detection / document layout analysis."""
400 cats = "</c>".join(categories)
401 prompt = f"Locate all the instances that matches the following description: {cats}."
402 return self.predict(image, prompt, **kwargs)
403
404 def ground_single(self, image: Image.Image, phrase: str, **kwargs) -> dict:
405 """Phrase grounding — single instance."""
406 prompt = f"Locate a single instance that matches the following description: {phrase}."
407 return self.predict(image, prompt, **kwargs)
408
409 def ground_multi(self, image: Image.Image, phrase: str, **kwargs) -> dict:
410 """Phrase grounding — multiple instances."""
411 prompt = f"Locate all the instances that match the following description: {phrase}."
412 return self.predict(image, prompt, **kwargs)
413
414 def ground_text(self, image: Image.Image, phrase: str, **kwargs) -> dict:
415 """Text grounding."""
416 prompt = f"Please locate the text referred as {phrase}."
417 return self.predict(image, prompt, **kwargs)
418
419 def detect_text(self, image: Image.Image, **kwargs) -> dict:
420 """Scene text detection."""
421 prompt = "Detect all the text in box format."
422 return self.predict(image, prompt, **kwargs)
423
424 def ground_gui(self, image: Image.Image, phrase: str, output_type: str = "box", **kwargs) -> dict:
425 """GUI grounding (box or point)."""
426 if output_type == "point":
427 prompt = f"Point to: {phrase}."
428 else:
429 prompt = f"Locate the region that matches the following description: {phrase}."
430 return self.predict(image, prompt, **kwargs)
431
432 def point(self, image: Image.Image, phrase: str, **kwargs) -> dict:
433 """Pointing."""
434 prompt = f"Point to: {phrase}."
435 return self.predict(image, prompt, **kwargs)
436
437 # ---- Utility: parse model output ----
438
439 @staticmethod
440 def parse_boxes(answer: str, image_width: int, image_height: int) -> list[dict]:
441 """Parse model output into pixel-coordinate bounding boxes.
442
443 Coordinates in model output are normalized integers in [0, 1000].
444 """
445 boxes = []
446 for m in re.finditer(r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>", answer):
447 x1, y1, x2, y2 = [int(g) for g in m.groups()]
448 boxes.append({
449 "x1": x1 / 1000 * image_width,
450 "y1": y1 / 1000 * image_height,
451 "x2": x2 / 1000 * image_width,
452 "y2": y2 / 1000 * image_height,
453 })
454 return boxes
455
456 @staticmethod
457 def parse_points(answer: str, image_width: int, image_height: int) -> list[dict]:
458 """Parse model output into pixel-coordinate points."""
459 points = []
460 for m in re.finditer(r"<box><(\d+)><(\d+)></box>", answer):
461 x, y = int(m.group(1)), int(m.group(2))
462 points.append({
463 "x": x / 1000 * image_width,
464 "y": y / 1000 * image_height,
465 })
466 return points
467 ```
468
469 ### Usage Example
470
471 ```python
472 worker = LocateAnythingWorker("nvidia/LocateAnything-3B")
473 img = Image.open("example.jpg").convert("RGB")
474
475 # Object Detection
476 result = worker.detect(img, ["person", "car", "bicycle"])
477 print("Detection:", result["answer"])
478
479 # Phrase Grounding (multiple)
480 result = worker.ground_multi(img, "people wearing red shirts")
481 print("Grounding:", result["answer"])
482
483 # Scene Text Detection
484 result = worker.detect_text(img)
485 print("Text Detection:", result["answer"])
486
487 # Pointing
488 result = worker.point(img, "the traffic light")
489 print("Pointing:", result["answer"])
490
491 # GUI Grounding (point)
492 result = worker.ground_gui(img, "the search button", output_type="point")
493 print("GUI Point:", result["answer"])
494
495 # Parse structured output into pixel coordinates
496 w, h = img.size
497 boxes = LocateAnythingWorker.parse_boxes(result["answer"], w, h)
498 points = LocateAnythingWorker.parse_points(result["answer"], w, h)
499 ```
500
501 ### Supported Tasks & Prompt Templates
502
503 | Task | Worker Method | Output | Prompt Template |
504 | --- | --- | --- | --- |
505 | Object Detection | `worker.detect(img, [...])` | Box | `Locate all the instances that matches the following description: [CATEGORIES].` |
506 | Phrase Grounding (single) | `worker.ground_single(img, phrase)` | Single Box | `Locate a single instance that matches the following description: [PHRASE].` |
507 | Phrase Grounding (multi) | `worker.ground_multi(img, phrase)` | Multiple Boxes | `Locate all the instances that match the following description: [PHRASE].` |
508 | Text Grounding | `worker.ground_text(img, phrase)` | Box | `Please locate the text referred as [PHRASE].` |
509 | Scene Text Detection | `worker.detect_text(img)` | Box | `Detect all the text in box format.` |
510 | Document Layout Analysis | `worker.detect(img, [...])` | Box | `Locate all the instances that matches the following description: [CATEGORIES].` |
511 | GUI Grounding (box) | `worker.ground_gui(img, phrase, "box")` | Box | `Locate the region that matches the following description: [PHRASE].` |
512 | GUI Grounding (point) / Pointing | `worker.ground_gui(img, phrase, "point")` / `worker.point(img, phrase)` | Point | `Point to: [PHRASE].` |
513
514 `[PHRASE]` is a free-form natural-language description; `[CATEGORIES]` is a comma-separated list (multiple categories may also be joined with `</c>`).
515
516 ### Generation Modes
517
518 | Mode | Description | Speed | Accuracy |
519 | --- | --- | --- | --- |
520 | `fast` | MTP only, never falls back to AR | Fastest | Good for simple scenes |
521 | `slow` | Pure auto-regressive decoding | Slowest | Most robust |
522 | `hybrid` (default) | MTP first, falls back to AR on uncertain boxes, switches back after box boundary | Balanced | Best overall |
523
524 ## Batch Utils and Kernel Utils
525
526 This repository also includes optional utilities for high-throughput detection
527 runs:
528
529 - `batch_infer.py`: JSONL/image-query batch inference CLI.
530 - `batch_utils/`: batched hybrid generation runtime. See
531 `batch_utils/README.md`.
532 - `kernel_utils/`: LA Flash sparse range utilities. See
533 `kernel_utils/README.md`.
534
535 Run a small batch inference job:
536
537 ```bash
538 python batch_infer.py \
539 --model . \
540 --attn la_flash \
541 --scheduler pipeline \
542 --batch-size 4 \
543 --image assets/pointing.png \
544 --query "the object being pointed at"
545 ```
546
547 The batched sparse-plan decode runtime is intended for inference/evaluation and
548 does not support the training `labels` path. Training remains on the
549 MagiAttention backend.
550
551 ## Ethical Considerations:
552 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
553
554 Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
555
556 Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
557