LLM inference benchmarks on NVIDIA Jetson AGX Orin 64GB.
| Spec | Value |
|---|---|
| Device | NVIDIA Jetson AGX Orin 64GB |
| Memory | 61.37 GiB (shared CPU+GPU) |
| Platform | JetPack L4T r36.4, tegra, aarch64 |
Docker image: narandill/vllm:0.17.0-r36.4.tegra-aarch64-cp312-cu129-24.04
Config: FP16, eager mode (no CUDAGraph), FlashAttention v2
| Model | Quant | TTFT avg (ms) | TTFT min (ms) | Decode (tok/s) | max_model_len |
|---|---|---|---|---|---|
| Qwen3.5-2B | none | 232.9 | 231.8 | 12.3 | 4096 |
| Qwen3.5-9B | none | 310.4 | 304.6 | 8.5 | 2048 |
| Qwen2.5-32B-Instruct-AWQ | AWQ 4-bit | 383.0 | 298.1 | 4.9 | 4096 |
See results/ for structured JSON benchmark data.