batch_utils/README.md
2.0 KB · 54 lines · markdown Raw
1 # Batch Utils
2
3 `batch_utils` contains the optional batched hybrid generation runtime for
4 LocateAnything. It keeps the model loading, tokenization, image feature caching,
5 sampling, and scheduler code used by `batch_infer.py` and the detection
6 experiments.
7
8 ## Runtime Modes
9
10 - `LA_FLASH_ATTN=sdpa`: stock PyTorch SDPA path.
11 - `LA_FLASH_ATTN=eager`: eager attention path for debugging.
12 - `LA_FLASH_ATTN=magi`: MagiAttention path when MagiAttention is installed.
13 - `LA_FLASH_ATTN=la_flash`: LA Flash sparse range backend
14 from `kernel_utils`.
15
16 ## Common Knobs
17
18 | Variable | Default | Meaning |
19 | --- | --- | --- |
20 | `LA_FLASH_MODEL` | `nvidia/LocateAnything-3B` | HF model id or local model directory. |
21 | `LA_FLASH_ATTN` | `sdpa` | LLM attention backend. |
22 | `LA_FLASH_VISION_ATTN` | `auto` | Vision encoder attention: `auto`, `flash_attention_2`, `sdpa`, or `eager`. |
23 | `LA_FLASH_STRICT_ATTN` | `0` | Set `1` to fail instead of falling back to SDPA. |
24 | `LA_FLASH_HYBRID_SCHEDULER` | `eager` | Hybrid decode scheduler. |
25 | `LA_FLASH_HYBRID_GROUP_SIZE` | `0` | Scheduler group size; `0` lets the runtime decide. |
26 | `LA_FLASH_VISION_ENCODE_BATCH_SIZE` | `8` | Maximum images per MoonViT encode micro-batch. |
27 | `LA_FLASH_KV_PACK_TOKEN_BUDGET` | `0` | Optional KV packing memory cap for long-tail batches. |
28 | `LA_FLASH_DENSE_BACKEND` | `sdpa` | Dense worker/prefill attention backend. Keep this as `sdpa`; LA Flash is used for sparse range plans. |
29 | `LA_FLASH_SEGMENT_FASTPATH` | `auto` | Sparse MTP decode uses FlashAttention varlen multi-segment merge by default. |
30
31 ## CLI Example
32
33 ```bash
34 python batch_infer.py \
35 --model nvidia/LocateAnything-3B \
36 --attn la_flash \
37 --scheduler pipeline \
38 --batch-size 4 \
39 --image /path/to/image.jpg \
40 --query "person</c>car"
41 ```
42
43 For JSONL input, each row should contain:
44
45 ```json
46 {"image": "/path/to/image.jpg", "query": "person</c>car"}
47 ```
48
49 ## Training Boundary
50
51 This package is for inference and evaluation. Training remains on the
52 MagiAttention backend; the batched sparse-plan decode runtime does not support
53 the `labels` training path.
54