memex/notes/20260314_rtx_pro_6000_llm.org

#+TITLE: RTX Pro 6000 for Local LLM Inference
#+author: User
#+created: [2026-03-16 Mon 14:28]
#+ID: 20260314_rtx_pro_6000_llm
#+FILETAGS: hardware gpu ai llm inference nvidia

* RTX Pro 6000 Blackwell for Local LLM Inference

** The Headline

The RTX Pro 6000 Workstation Edition is a 96GB GDDR7 single-GPU solution that can replace a 4x RTX 4090 setup for 30B parameter models, with headroom for 70B models in FP8 quantization.

** Key Specifications

| Spec | RTX Pro 6000 | RTX 5090 | H100 PCIe |
|------|--------------|----------|-----------|
| VRAM | 96GB GDDR7 | 32GB GDDR7 | 80GB HBM2e |
| Memory Bandwidth | 1.792 TB/s | 1.792 TB/s | 2.0 TB/s |
| Architecture | Blackwell | Blackwell | Hopper |
| FP4 Support | Yes | Yes | No |
| FP8 Support | Yes | Yes | Yes |
| NVLink | No | No | No (PCIe) |
| ECC Memory | Yes | No | Yes |
| Form Factor | Workstation PCIe | Consumer PCIe | PCIe/SXM |

** What Fits in 96GB VRAM

| Model | Size | Precision | Fits? | Notes |
|-------|------|-----------|-------|-------|
| Llama 3.1 | 8B | FP16 | Yes | ~16GB (17% of VRAM) |
| Qwen2.5 | 14B | FP16 | Yes | ~28GB, ~68GB headroom |
| Qwen2.5 | 32B | FP8 | Yes | ~32GB, ~64GB headroom |
| Qwen2.5 | 32B | FP16 | Yes | ~64GB, ~32GB headroom |
| Llama 3.3 | 70B | Q4 (AWQ) | Yes | ~35-40GB, 56-61GB headroom |
| Llama 3.3 | 70B | FP8 | Yes | ~70GB, ~26GB headroom |
| Llama 3.3 | 70B | FP16 | No | ~140GB needed |
| Mixtral 8x7B | ~47B | Q4 | Yes | ~24GB, comfortable |
| Mixtral 8x7B | ~47B | FP16 | No | ~94GB, no KV cache headroom |

** Benchmark Highlights

CloudRift benchmarks (Oct 2025):

| GPU | Model | Precision | Tokens/sec |
|-----|-------|-----------|------------|
| RTX Pro 6000 | Qwen3-Coder-30B-AWQ | AWQ | ~8,400 |
| 4x RTX 4090 | Qwen3-Coder-30B-AWQ | AWQ | ~8,900 |

*One GPU versus four, with near-identical throughput and lower power draw.*

** Cost Comparison (Cloud Pricing, March 2026)

| GPU | On-demand $/hr | Spot $/hr | VRAM | Fits 70B FP8? |
|-----|---------------|-----------|------|---------------|
| RTX 5090 | $0.76 | — | 32GB | No |
| RTX Pro 6000 | $1.65 | $0.72 | 96GB | Yes (~26GB headroom) |
| H100 PCIe | $2.01 | — | 80GB | Yes (~10GB headroom) |
| H200 SXM5 | $4.23 | $1.43 | 141GB | Yes (extensive) |

** Best Use Cases

1. *30B model inference at high throughput* — Replaces 4x RTX 4090 setup
2. *32B models in FP8/FP16* — Single GPU, no tensor parallelism needed
3. *70B models in Q4/FP8* — Fits comfortably with KV cache headroom
4. *Development/testing* — $0.72/hr spot pricing for experimentation
5. *Diffusion pipelines* — SDXL + ControlNet + LoRA stacks simultaneously

** Limitations

- No NVLink — cannot scale beyond 96GB with tensor parallelism
- No MIG partitioning — not for multi-tenant enterprise serving
- 70B FP16 won't fit — need H200 (141GB) or multi-GPU
- High-batch throughput — H100 SXM wins at maximum concurrency (3.35 TB/s HBM3)

** Strategic Assessment

For a personal/enthusiast rack-mounted setup:

*Pros:*
- 96GB VRAM covers 95% of practical local LLM use cases
- Single-GPU simplicity (no multi-GPU orchestration)
- Blackwell FP4 support (doubles throughput vs FP8)
- Lower cost per token than H100 for 30B workloads
- ECC memory for reliability

*Cons:*
- No upgrade path beyond 96GB (no NVLink)
- GDDR7 bandwidth (1.792 TB/s) vs HBM3 (3.35 TB/s) matters at extreme concurrency
- Workstation GPU, not datacenter (no MIG, no cluster integration)

** Verdict

The RTX Pro 6000 is the sweet spot for a *modular, upgradable* home lab. It won't match an 8x H100 cluster, but it doesn't need to. For single-user inference, development, and experimentation with models up to 70B parameters, it's the most practical high-VRAM option available.

** Sources

- Spheron Network: "NVIDIA RTX PRO 6000 Blackwell for AI" (March 2026)
- CloudRift: "RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark" (Oct 2025)
- Medium/Data Science Collective: "Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000" (Feb 2026)

** Related

- [[file:20260313_gemini_embedding_2.org][Gemini Embedding-2 Research]]
- [[file:20260311_agora_v2_gap_analysis.org][Agora v2 Gap Analysis]]