Files
memex/notes/20260314_rtx_pro_6000_llm.org

4.1 KiB

RTX Pro 6000 for Local LLM Inference

RTX Pro 6000 Blackwell for Local LLM Inference

The Headline

The RTX Pro 6000 Workstation Edition is a 96GB GDDR7 single-GPU solution that can replace a 4x RTX 4090 setup for 30B parameter models, with headroom for 70B models in FP8 quantization.

Key Specifications

Spec RTX Pro 6000 RTX 5090 H100 PCIe
VRAM 96GB GDDR7 32GB GDDR7 80GB HBM2e
Memory Bandwidth 1.792 TB/s 1.792 TB/s 2.0 TB/s
Architecture Blackwell Blackwell Hopper
FP4 Support Yes Yes No
FP8 Support Yes Yes Yes
NVLink No No No (PCIe)
ECC Memory Yes No Yes
Form Factor Workstation PCIe Consumer PCIe PCIe/SXM

What Fits in 96GB VRAM

Model Size Precision Fits? Notes
Llama 3.1 8B FP16 Yes ~16GB (17% of VRAM)
Qwen2.5 14B FP16 Yes ~28GB, ~68GB headroom
Qwen2.5 32B FP8 Yes ~32GB, ~64GB headroom
Qwen2.5 32B FP16 Yes ~64GB, ~32GB headroom
Llama 3.3 70B Q4 (AWQ) Yes ~35-40GB, 56-61GB headroom
Llama 3.3 70B FP8 Yes ~70GB, ~26GB headroom
Llama 3.3 70B FP16 No ~140GB needed
Mixtral 8x7B ~47B Q4 Yes ~24GB, comfortable
Mixtral 8x7B ~47B FP16 No ~94GB, no KV cache headroom

Benchmark Highlights

CloudRift benchmarks (Oct 2025):

GPU Model Precision Tokens/sec
RTX Pro 6000 Qwen3-Coder-30B-AWQ AWQ ~8,400
4x RTX 4090 Qwen3-Coder-30B-AWQ AWQ ~8,900

One GPU versus four, with near-identical throughput and lower power draw.

Cost Comparison (Cloud Pricing, March 2026)

GPU On-demand $/hr Spot $/hr VRAM Fits 70B FP8?
RTX 5090 $0.76 32GB No
RTX Pro 6000 $1.65 $0.72 96GB Yes (~26GB headroom)
H100 PCIe $2.01 80GB Yes (~10GB headroom)
H200 SXM5 $4.23 $1.43 141GB Yes (extensive)

Best Use Cases

  1. 30B model inference at high throughput — Replaces 4x RTX 4090 setup
  2. 32B models in FP8/FP16 — Single GPU, no tensor parallelism needed
  3. 70B models in Q4/FP8 — Fits comfortably with KV cache headroom
  4. Development/testing — $0.72/hr spot pricing for experimentation
  5. Diffusion pipelines — SDXL + ControlNet + LoRA stacks simultaneously

Limitations

  • No NVLink — cannot scale beyond 96GB with tensor parallelism
  • No MIG partitioning — not for multi-tenant enterprise serving
  • 70B FP16 won't fit — need H200 (141GB) or multi-GPU
  • High-batch throughput — H100 SXM wins at maximum concurrency (3.35 TB/s HBM3)

Strategic Assessment

For a personal/enthusiast rack-mounted setup:

Pros:

  • 96GB VRAM covers 95% of practical local LLM use cases
  • Single-GPU simplicity (no multi-GPU orchestration)
  • Blackwell FP4 support (doubles throughput vs FP8)
  • Lower cost per token than H100 for 30B workloads
  • ECC memory for reliability

Cons:

  • No upgrade path beyond 96GB (no NVLink)
  • GDDR7 bandwidth (1.792 TB/s) vs HBM3 (3.35 TB/s) matters at extreme concurrency
  • Workstation GPU, not datacenter (no MIG, no cluster integration)

Verdict

The RTX Pro 6000 is the sweet spot for a modular, upgradable home lab. It won't match an 8x H100 cluster, but it doesn't need to. For single-user inference, development, and experimentation with models up to 70B parameters, it's the most practical high-VRAM option available.

Sources

  • Spheron Network: "NVIDIA RTX PRO 6000 Blackwell for AI" (March 2026)
  • CloudRift: "RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark" (Oct 2025)
  • Medium/Data Science Collective: "Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000" (Feb 2026)