jetson-arena

LLM inference benchmarks on NVIDIA Jetson AGX Orin 64GB.

Device

Docker image: narandill/vllm:0.17.0-r36.4.tegra-aarch64-cp312-cu129-24.04 Config: FP16, eager mode (no CUDAGraph), FlashAttention v2

Model	Quant	TTFT avg (ms)	TTFT min (ms)	Decode (tok/s)	max_model_len
Qwen3.5-2B	none	232.9	231.8	12.3	4096
Qwen3.5-9B	none	310.4	304.6	8.5	2048
Qwen2.5-32B-Instruct-AWQ	AWQ 4-bit	383.0	298.1	4.9	4096

TTFT measured from streaming SSE response (time to first content token)
Warmup request sent before measurement to avoid one-time kernel compilation (~44s cold start in eager mode)
Qwen3.5 models are multimodal (vision encoder included); Qwen2.5-32B is text-only
Qwen2.5-32B-AWQ had prefix caching enabled (vLLM default for text-only models)
3 prompts per run: short factual, paragraph explanation, haiku

See results/ for structured JSON benchmark data.

This site is open source. Improve this page.