README.md · LocateAnything-3B

1

---

2

license: other

3

license_name: nvidia-license

4

license_link: https://huggingface.co/nvidia/LocateAnything-3B/blob/main/LICENSE

5

language:

6

- en

7

tags:

8

- nvidia

9

- eagle

10

- vision

11

- object-detection

12

- grounding

13

- locateanything

14

- arxiv:2605.27365

15

demo: https://huggingface.co/spaces/nvidia/LocateAnything

16

github: https://github.com/NVlabs/Eagle/tree/main/Embodied

17

library_name: transformers

18

pipeline_tag: image-text-to-text

19

base_model:

20

- Qwen/Qwen2.5-3B-Instruct

21

---

22

23

# LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

24

25

26

<img src="assets/teaser.jpg" alt="LocateAnything teaser" width="100%">

27

28

29

## 🔗 Quick Links

30

31

* 🚀 **Online Demo**: [LocateAnything (Hugging Face Spaces)](https://huggingface.co/spaces/nvidia/LocateAnything)

32

* 💻 **GitHub Code**: [NVlabs/Eagle/Embodied](https://github.com/NVlabs/Eagle/tree/main/Embodied)

33

* 📄 **Paper**: [arXiv:2605.27365](https://arxiv.org/abs/2605.27365)

34

35

36

# Model Overview

37

38

### Description:

39

40

LocateAnything is a vision-language model for fast and high-quality visual grounding, enabling precise object localization, dense detection, and point-based localization across diverse domains in both Enterprise Intelligence and Physical AI. The model adopts a generalist design, supporting tasks such as referring expression grounding, multi-object detection, GUI element grounding, and text localization, with strong performance in complex and cluttered scenes.

41

42

Its core innovation, Parallel Box Decoding (PBD), predicts complete bounding box coordinates in a single parallel step rather than autoregressive token-by-token decoding, improving efficiency while preserving geometric consistency. This enables up to 2.5× higher throughput compared to prior approaches.

43

44

The model is trained on a large-scale multi-domain dataset (12M images, 138M+ queries, 785M bounding boxes) spanning natural scenes, robotics, driving, GUI interaction, and document understanding. It serves as a foundation for generalist multimodal perception and has been integrated into NVIDIA’s frontier production-grade vision-language models, such as Nemotron 3 Nano Omni, supporting grounding, GUI understanding, and multimodal agentic capabilities.

45

46

LocateAnything is developed as part of the [Eagle VLM](https://github.com/NVlabs/EAGLE) model family. This released model is for research and development only. In addition, LocateAnything contributed to [Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) and [Cosmos](https://www.nvidia.com/en-us/ai/cosmos/) as part of the Computer Use and Visual Grounding features. We give special thanks the Nemotron and Cosmos Teams for time and efforts in product integration.

47

48

### Demo Videos

49

50

51

  <video src="https://huggingface.co/nvidia/LocateAnything-3B/resolve/main/assets/demo.mp4" controls="controls" width="80%">

52

Your browser does not support the video tag.

53

</video>

54

55

56

57

  <video src="https://huggingface.co/nvidia/LocateAnything-3B/resolve/main/assets/decoding_demo.mp4" controls="controls" width="80%">

58

Your browser does not support the video tag.

59

</video>

60

61

62

### License/Terms of Use:

63

64

This model is released under the [NVIDIA License](https://huggingface.co/nvidia/LocateAnything-3B/blob/main/LICENSE) for non-commercial use, which permits use, reproduction, and modification for **academic and non-profit research purposes only**. Commercial use is **not permitted**, except by NVIDIA and its affiliates. Redistribution must retain the license and all applicable copyright and attribution notices. The model is provided **“as is” without warranty of any kind**, and users assume all associated risks.

65

66

This model is built using components from third-party models with their respective licenses:

67

- Language model: [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) (Qwen Research License)

68

- Vision encoder: [MoonViT-SO-400M](https://huggingface.co/moonshotai/MoonViT-SO-400M) (MIT License)

69

70

Models are improved using Qwen.

71

72

### Deployment Geography:

73

74

Global

75

76

### Use Case:

77

78

LocateAnything-3B is intended for developers and researchers building vision-language models and applications that require fast and precise visual localization from natural language instructions.

79

80

Supported use cases include:

81

- Open-set, common, and long-tail object detection

82

- Dense multi-object detection in cluttered scenes

83

- Phrase and referring-expression grounding

84

- Automated dataset labeling and annotation (e.g., detection, grounding, pointing)

85

- GUI element grounding for interactive and agentic systems

86

- Robotics and autonomous driving perception

87

- Document understanding, layout grounding, and OCR localization

88

- Industrial inspection, surveillance, and remote sensing applications

89

- Point-based localization and fine-grained spatial reasoning

90

91

### Release Date [Insert the expected release date below]:

92

93

- Github [05/26/2026] via https://github.com/NVlabs/Eagle/tree/main/Embodied.

94

- Hugging Face [05/26/2026] via https://huggingface.co/nvidia/LocateAnything-3B.

95

- Demo [05/26/2026] via https://huggingface.co/spaces/nvidia/LocateAnything.

96

- Webpage [05/26/2026] via https://research.nvidia.com/labs/lpr/locate-anything/.

97

- Tech Report [05/26/2026] via https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf

98

99

## References(s):

100

- Wang et al., [LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding](https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf), NVIDIA Tech Report, 2026

101

- Kimi Team, [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491), arXiv:2504.07491, 2025.

102

- Qwen Team, [Qwen2.5: A Party of Foundation Models](https://qwen.ai/blog?id=qwen2.5), Qwen Blog, 2024.

103

- Chen et al., [Pix2Seq: A Language Modeling Framework for Object Detection](https://arxiv.org/abs/2109.10852), ICLR, 2022.

104

- Jiang et al., [Detect Anything via Next Point Prediction](https://arxiv.org/abs/2510.12798), arXiv:2510.12798, 2025.

105

- Liu et al., [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499), arXiv:2303.05499, 2023.

106

- Lin et al., [Microsoft COCO: Common Objects in Context](https://arxiv.org/abs/1405.0312), ECCV, 2014.

107

- Gupta et al., [LVIS: A Dataset for Large Vocabulary Instance Segmentation](https://arxiv.org/abs/1908.03195), CVPR, 2019.

108

- Li et al., [ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use](https://arxiv.org/abs/2504.07981), ACM MM, 2025.

109

110

## Model Architecture:

111

112

**Architecture Type:** Transformer-based vision-language model (VLM).

113

114

**Network Architecture:** Native-resolution VLM with the following components:

115

- Vision encoder: MoonViT

116

- Language model: Qwen2.5-3B-Instruct

117

- Multimodal projector: MLP projector

118

- Output formulation: Block-based structure for visual grounding

119

120

**Number of model parameters:** 3B.

121

122

LocateAnything extends a vision-language model with Parallel Box Decoding (PBD), a block-wise multi-token prediction framework for efficient visual grounding. Instead of autoregressive coordinate generation, the model predicts complete bounding boxes and points in parallel structured units, improving decoding efficiency while preserving geometric consistency. The architecture jointly optimizes next-token prediction and multi-token prediction to balance reasoning ability and parallel inference. Training follows a four-stage pipeline: initial multimodal knowledge adaptation using captioning, VQA, OCR, and related data, followed by grounding and dense-scene localization fine-tuning.

123

124

## Input(s):

125

126

**Input Type(s):** Image and Text.

127

128

**Input Format(s):**

129

- Image: RGB image input with original source resolution.

130

- Text: Natural-language prompt or task template, such as object categories, referring expressions, GUI instructions, OCR/layout requests, or pointing queries.

131

132

**Input Parameters:**

133

- Image: Two-Dimensional (2D)

134

- Text: One-Dimensional (1D)

135

136

**Other Properties Related to Input:**

137

- Production image resolution supports up to 2.5K.

138

- Prompt length supports up to 24K tokens.

139

- Training detection and grounding stages use a maximum sequence length of 25,600 tokens.

140

- Inference supports up to 8,192 newly generated tokens.

141

142

## Output(s):

143

144

**Output Type(s):** Text.

145

146

**Output Format(s):**

147

- Text: Model-generated token sequence containing semantic labels and structured coordinate tokens, such as bounding boxes (`<box> x1, y1, x2, y2 </box>`) and points (`<box> x, y </box>`).

148

149

**Output Parameters:**

150

- Text: One-Dimensional (1D)

151

- Bounding boxes/points: Two-Dimensional (2D) spatial coordinates

152

153

**Other Properties Related to Output:**

154

- Outputs are organized into fixed-length blocks (length 6), including Semantic, Box, Negative, and End blocks.

155

- A Box block encodes quantized spatial coordinates with structural tokens; unused positions are padded with `<null>`.

156

- Fast Mode predicts box-aligned blocks in parallel; Slow Mode uses autoregressive decoding; Hybrid Mode defaults to parallel decoding with fallback to autoregressive decoding for format irregularity or spatial ambiguity.

157

158

Our AI models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves improved training and inference performance compared to CPU-only solutions.

159

160

## Software Integration:

161

**Runtime Engine(s):**

162

* Transformers. The inference setup uses standard VLM generation with BF16 precision and KV cache. TensorRT, TensorRT-LLM, and Triton are not yet supported.

163

164

**Supported Hardware Microarchitecture Compatibility:**

165

166

* NVIDIA Ampere (e.g., A100)

167

* NVIDIA Blackwell

168

* NVIDIA Hopper (e.g., H100)

169

* NVIDIA Lovelace (e.g., L40, RTX 4090)

170

171

Deployment on embedded platforms such as NVIDIA Thor is possible with additional model optimization, including quantization, compression, or distillation. Other architectures may be supported depending on available memory, precision support, and software configuration.

172

173

**Supported Operating System(s):**

174

* Linux

175

176

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

177

178

## Model Version(s):

179

LocateAnything-3B: 3B-parameter research model variant evaluated in Hybrid Mode by default. Fast, Hybrid, and Slow inference modes are supported by the same model formulation.

180

181

LocateAnything-3B can be integrated into systems that require spatial grounding from natural language, such as GUI agents, robotics/embodied agents, document-understanding pipelines, OCR/text localization, and open-world detection workflows.

182

183

## Training, Testing, and Evaluation Datasets:

184

185

### Data Modality:

186

Image and Text. 

187

* Image 

188

* Text 

189

190

### Training Data Size:

191

**Image Training Data Size:** 

192

* 1 Million to 1 Billion Images - 12M unique images. 

193

194

**Text Training Data Size:** 

195

* 1 Billion to 10 Trillion Tokens - Derived from approximately 140M natural-language queries. 

196

197

**Data Collection Method by dataset:** 

198

- Hybrid: Human, Automated 

199

  Data is collected from human-curated and open-source datasets, as well as automated ingestion of publicly available data sources.

200

201

**Labeling Method by dataset:** 

202

- Hybrid: Human, Synthetic, Automated 

203

  Labeling includes original human or open-source annotations, along with model-assisted and synthetic annotation generation using Qwen3-VL, Molmo, SAM 3, and Rex-Omni, with automated post-verification.

204

205

**Properties:** The training data consists of supervised fine-tuning (SFT) datasets with multimodal inputs, primarily image-text pairs and structured annotations such as bounding boxes, points, and negative samples.

206

207

The data spans multiple domains, including grounding, open-world grounding, general and dense object detection, scene text detection, GUI understanding and grounding, document layout understanding, and OCR.

208

209

Modalities include visual inputs (images) and natural-language queries or instructions. The dataset is derived from a mixture of publicly available academic datasets, along with model-assisted and synthetic annotations. It may include publicly available and potentially copyrighted content; users are responsible for ensuring compliance with applicable usage rights.

210

211

The linguistic content primarily consists of short, task-oriented natural-language expressions, such as object categories, referring expressions, GUI instructions, OCR queries, and grounding prompts, typically in English.

212

213

## Evaluation Dataset:

214

215

**Data Collection Method by dataset:**

216

- Hybrid: Human, Automated

217

218

**Labeling Method by dataset:**

219

- Hybrid: Human, Synthetic, Automated

220

221

**Properties:** The evaluation datasets consist of publicly available benchmarks spanning visual grounding, object detection, document understanding, scene text detection, and GUI-related tasks. Modalities include image inputs paired with natural-language queries and structured annotations such as bounding boxes and points.

222

223

The evaluation suite covers both box-level and point-level grounding tasks, with approximately 48K images for box evaluation and 35K images for point evaluation across multiple datasets. These datasets span diverse domains including natural scenes, documents, aerial imagery, and human-centric interactions, enabling comprehensive assessment of localization accuracy and robustness.

224

225

Evaluation queries are typically short, task-oriented natural-language expressions such as referring phrases, object categories, and grounding prompts.

226

227

Performance is measured using box-based F1 at IoU thresholds of 0.5 and 0.95, as well as mean IoU for detection, layout, and OCR tasks. Point-based localization is evaluated based on whether predicted points fall within ground-truth segmentation masks or bounding boxes. Inference efficiency is reported in boxes per second (BPS) on a single NVIDIA H100 GPU with batch size 1.

228

229

## Quantitative Evaluation Benchmarks

230

231

### General Object Detection

232

233

<img src="assets/coco_lvis.png" width="700">

234

235

236

### Dense Object Detection

237

238

<img src="assets/dense_object_detection.png" width="700">

239

240

241

### GUI Understanding

242

243

<img src="assets/sspro.png" width="700">

244

245

246

### Layout Grounding and OCR

247

248

<img src="assets/layout_ocr.png" width="700">

249

250

251

### Referring Expression Grounding

252

253

<img src="assets/referring.png" width="700">

254

255

256

### Pointing

257

258

<img src="assets/pointing.png" width="700">

259

260

261

## Inference:

262

263

Test Hardware: H100 & A100

264

265

We suggest using `max_new_tokens=8192` and `generation_mode="hybrid"` to avoid truncated response and balance speed with accuracy.

266

267

### Batch Hybrid Inference

268

269

This release includes `batch_infer.py`, `batch_utils`, and `kernel_utils` for

270

high-throughput detection and grounding. The `la_flash` backend is a pure

271

FlashAttention-varlen sparse range executor: it keeps LocateAnything's hybrid

272

MTP decoding path, avoids dense `[B,H,Q,K]` SDPA masks, and does not require a

273

custom CUDA extension build.

274

275

Use it with:

276

277

```bash

278

python batch_infer.py \

279

--model . \

280

--attn la_flash \

281

--scheduler pipeline \

282

--batch-size 4 \

283

--image /path/to/image.jpg \

284

--query "person</c>car"

285

```

286

287

A100 4K probe, real 3840x2160 street image, `query=vehicle`,

288

`batch_size=4`, raw PIL input, `in_token_limit=25600`, hybrid MTP inference:

289

290

291

| --- | --- | ---: | ---: |

292

| `sdpa` | Dense SDPA masks | 8.2600 s | 35.12 GB |

293

| `la_flash` | FlashAttention sparse range plan | 8.0314 s | 11.71 GB |

294

295

See `batch_utils/README.md` and `kernel_utils/README.md` for runtime knobs and

296

implementation details.

297

298

### Installation

299

300

```bash

301

pip install opencv-python-headless==4.11.0.86 transformers==4.57.1 numpy==1.25.0 Pillow==11.1.0 peft torchvision decord==0.6.0 lmdb==1.7.5

302

```

303

304

> PyTorch (`torch`) must be installed separately according to your CUDA version. See [pytorch.org/get-started](https://pytorch.org/get-started/locally/).

305

306

Optional — [MagiAttention](https://sandai-org.github.io/MagiAttention/docs/main/user_guide/install.html) (Hopper / Blackwell GPUs only, recommended for faster MTP inference):

307

308

```bash

309

git clone https://github.com/SandAI-org/MagiAttention.git

310

cd MagiAttention

311

git checkout v1.0.5

312

git submodule update --init --recursive

313

pip install -r requirements.txt

314

pip install --no-build-isolation .

315

```

316

317

If MagiAttention is installed, the model will automatically use it for efficient MTP block-diffusion attention. If not installed, it will fall back to PyTorch SDPA — fully functional but slower for MTP decoding.

318

319

### Worker (recommended)

320

321

Below is a self-contained worker that loads the model once and serves perception queries via a unified `predict()` plus task-specific convenience methods. You can drop this class into any FastAPI / gRPC / Triton serving framework.

322

323

```python

324

import re

325

import torch

326

from PIL import Image

327

from transformers import AutoModel, AutoTokenizer, AutoProcessor

328

329

330

class LocateAnythingWorker:

331

"""Stateful worker that loads the model once and serves perception queries."""

332

333

def __init__(self, model_path: str, device: str = "cuda", dtype=torch.bfloat16):

334

self.device = device

335

self.dtype = dtype

336

337

self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

338

self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

339

self.model = AutoModel.from_pretrained(

340

model_path,

341

torch_dtype=dtype,

342

trust_remote_code=True,

343

).to(device).eval()

344

345

@torch.no_grad()

346

def predict(

347

self,

348

image: Image.Image,

349

question: str,

350

generation_mode: str = "hybrid", # "fast" (MTP) | "slow" (NTP/AR) | "hybrid"

351

max_new_tokens: int = 2048,

352

temperature: float = 0.7,

353

verbose: bool = True,

354

) -> dict:

355

messages = [

356

{"role": "user", "content": [

357

{"type": "image", "image": image},

358

{"type": "text", "text": question},

359

]}

360

]

361

362

text = self.processor.py_apply_chat_template(

363

messages, tokenize=False, add_generation_prompt=True

364

)

365

images, videos = self.processor.process_vision_info(messages)

366

inputs = self.processor(

367

text=[text], images=images, videos=videos, return_tensors="pt"

368

).to(self.device)

369

370

pixel_values = inputs["pixel_values"].to(self.dtype)

371

input_ids = inputs["input_ids"]

372

image_grid_hws = inputs.get("image_grid_hws", None)

373

374

response = self.model.generate(

375

pixel_values=pixel_values,

376

input_ids=input_ids,

377

attention_mask=inputs["attention_mask"],

378

image_grid_hws=image_grid_hws,

379

tokenizer=self.tokenizer,

380

max_new_tokens=max_new_tokens,

381

use_cache=True,

382

generation_mode=generation_mode,

383

temperature=temperature,

384

do_sample=True,

385

top_p=0.9,

386

repetition_penalty=1.1,

387

verbose=verbose,

388

)

389

390

result = {"answer": response[0] if isinstance(response, tuple) else response}

391

if isinstance(response, tuple) and len(response) >= 3:

392

result["history"] = response[1]

393

result["stats"] = response[2]

394

return result

395

396

# ---- Convenience methods for each task ----

397

398

def detect(self, image: Image.Image, categories: list[str], **kwargs) -> dict:

399

"""Object detection / document layout analysis."""

400

cats = "</c>".join(categories)

401

prompt = f"Locate all the instances that matches the following description: {cats}."

402

return self.predict(image, prompt, **kwargs)

403

404

def ground_single(self, image: Image.Image, phrase: str, **kwargs) -> dict:

405

"""Phrase grounding — single instance."""

406

prompt = f"Locate a single instance that matches the following description: {phrase}."

407

return self.predict(image, prompt, **kwargs)

408

409

def ground_multi(self, image: Image.Image, phrase: str, **kwargs) -> dict:

410

"""Phrase grounding — multiple instances."""

411

prompt = f"Locate all the instances that match the following description: {phrase}."

412

return self.predict(image, prompt, **kwargs)

413

414

def ground_text(self, image: Image.Image, phrase: str, **kwargs) -> dict:

415

"""Text grounding."""

416

prompt = f"Please locate the text referred as {phrase}."

417

return self.predict(image, prompt, **kwargs)

418

419

def detect_text(self, image: Image.Image, **kwargs) -> dict:

420

"""Scene text detection."""

421

prompt = "Detect all the text in box format."

422

return self.predict(image, prompt, **kwargs)

423

424

def ground_gui(self, image: Image.Image, phrase: str, output_type: str = "box", **kwargs) -> dict:

425

"""GUI grounding (box or point)."""

426

if output_type == "point":

427

prompt = f"Point to: {phrase}."

428

else:

429

prompt = f"Locate the region that matches the following description: {phrase}."

430

return self.predict(image, prompt, **kwargs)

431

432

def point(self, image: Image.Image, phrase: str, **kwargs) -> dict:

433

"""Pointing."""

434

prompt = f"Point to: {phrase}."

435

return self.predict(image, prompt, **kwargs)

436

437

# ---- Utility: parse model output ----

438

439

@staticmethod

440

def parse_boxes(answer: str, image_width: int, image_height: int) -> list[dict]:

441

"""Parse model output into pixel-coordinate bounding boxes.

442

443

Coordinates in model output are normalized integers in [0, 1000].

444

"""

445

boxes = []

446

for m in re.finditer(r"<box><(\d+)><(\d+)><(\d+)><(\d+)></box>", answer):

447

x1, y1, x2, y2 = [int(g) for g in m.groups()]

448

boxes.append({

449

"x1": x1 / 1000 * image_width,

450

"y1": y1 / 1000 * image_height,

451

"x2": x2 / 1000 * image_width,

452

"y2": y2 / 1000 * image_height,

453

})

454

return boxes

455

456

@staticmethod

457

def parse_points(answer: str, image_width: int, image_height: int) -> list[dict]:

458

"""Parse model output into pixel-coordinate points."""

459

points = []

460

for m in re.finditer(r"<box><(\d+)><(\d+)></box>", answer):

461

x, y = int(m.group(1)), int(m.group(2))

462

points.append({

463

"x": x / 1000 * image_width,

464

"y": y / 1000 * image_height,

465

})

466

return points

467

```

468

469

### Usage Example

470

471

```python

472

worker = LocateAnythingWorker("nvidia/LocateAnything-3B")

473

img = Image.open("example.jpg").convert("RGB")

474

475

# Object Detection

476

result = worker.detect(img, ["person", "car", "bicycle"])

477

print("Detection:", result["answer"])

478

479

# Phrase Grounding (multiple)

480

result = worker.ground_multi(img, "people wearing red shirts")

481

print("Grounding:", result["answer"])

482

483

# Scene Text Detection

484

result = worker.detect_text(img)

485

print("Text Detection:", result["answer"])

486

487

# Pointing

488

result = worker.point(img, "the traffic light")

489

print("Pointing:", result["answer"])

490

491

# GUI Grounding (point)

492

result = worker.ground_gui(img, "the search button", output_type="point")

493

print("GUI Point:", result["answer"])

494

495

# Parse structured output into pixel coordinates

496

w, h = img.size

497

boxes = LocateAnythingWorker.parse_boxes(result["answer"], w, h)

498

points = LocateAnythingWorker.parse_points(result["answer"], w, h)

499

```

500

501

### Supported Tasks & Prompt Templates

502

503

504

| --- | --- | --- | --- |

505

| Object Detection | `worker.detect(img, [...])` | Box | `Locate all the instances that matches the following description: [CATEGORIES].` |

506

| Phrase Grounding (single) | `worker.ground_single(img, phrase)` | Single Box | `Locate a single instance that matches the following description: [PHRASE].` |

507

| Phrase Grounding (multi) | `worker.ground_multi(img, phrase)` | Multiple Boxes | `Locate all the instances that match the following description: [PHRASE].` |

508

509

510

| Document Layout Analysis | `worker.detect(img, [...])` | Box | `Locate all the instances that matches the following description: [CATEGORIES].` |

511

| GUI Grounding (box) | `worker.ground_gui(img, phrase, "box")` | Box | `Locate the region that matches the following description: [PHRASE].` |

512

| GUI Grounding (point) / Pointing | `worker.ground_gui(img, phrase, "point")` / `worker.point(img, phrase)` | Point | `Point to: [PHRASE].` |

513

514

`[PHRASE]` is a free-form natural-language description; `[CATEGORIES]` is a comma-separated list (multiple categories may also be joined with `</c>`).

515

516

### Generation Modes

517

518

519

| --- | --- | --- | --- |

520

521

522

| `hybrid` (default) | MTP first, falls back to AR on uncertain boxes, switches back after box boundary | Balanced | Best overall |

523

524

## Batch Utils and Kernel Utils

525

526

This repository also includes optional utilities for high-throughput detection

527

runs:

528

529

- `batch_infer.py`: JSONL/image-query batch inference CLI.

530

- `batch_utils/`: batched hybrid generation runtime. See

531

`batch_utils/README.md`.

532

- `kernel_utils/`: LA Flash sparse range utilities. See

533

`kernel_utils/README.md`.

534

535

Run a small batch inference job:

536

537

```bash

538

python batch_infer.py \

539

--model . \

540

--attn la_flash \

541

--scheduler pipeline \

542

--batch-size 4 \

543

--image assets/pointing.png \

544

--query "the object being pointed at"

545

```

546

547

The batched sparse-plan decode runtime is intended for inference/evaluation and

548

does not support the training `labels` path. Training remains on the

549

MagiAttention backend.

550

551

## Ethical Considerations:

552

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

553

554

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

555

556

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).

557