#+TITLE: RTX Pro 6000 for Local LLM Inference #+author: User #+created: [2026-03-16 Mon 14:28] #+ID: 20260314_rtx_pro_6000_llm #+FILETAGS: hardware gpu ai llm inference nvidia @personal * RTX Pro 6000 Blackwell for Local LLM Inference ** The Headline The RTX Pro 6000 Workstation Edition is a 96GB GDDR7 single-GPU solution that can replace a 4x RTX 4090 setup for 30B parameter models, with headroom for 70B models in FP8 quantization. ** Key Specifications | Spec | RTX Pro 6000 | RTX 5090 | H100 PCIe | |------|--------------|----------|-----------| | VRAM | 96GB GDDR7 | 32GB GDDR7 | 80GB HBM2e | | Memory Bandwidth | 1.792 TB/s | 1.792 TB/s | 2.0 TB/s | | Architecture | Blackwell | Blackwell | Hopper | | FP4 Support | Yes | Yes | No | | FP8 Support | Yes | Yes | Yes | | NVLink | No | No | No (PCIe) | | ECC Memory | Yes | No | Yes | | Form Factor | Workstation PCIe | Consumer PCIe | PCIe/SXM | ** What Fits in 96GB VRAM | Model | Size | Precision | Fits? | Notes | |-------|------|-----------|-------|-------| | Llama 3.1 | 8B | FP16 | Yes | ~16GB (17% of VRAM) | | Qwen2.5 | 14B | FP16 | Yes | ~28GB, ~68GB headroom | | Qwen2.5 | 32B | FP8 | Yes | ~32GB, ~64GB headroom | | Qwen2.5 | 32B | FP16 | Yes | ~64GB, ~32GB headroom | | Llama 3.3 | 70B | Q4 (AWQ) | Yes | ~35-40GB, 56-61GB headroom | | Llama 3.3 | 70B | FP8 | Yes | ~70GB, ~26GB headroom | | Llama 3.3 | 70B | FP16 | No | ~140GB needed | | Mixtral 8x7B | ~47B | Q4 | Yes | ~24GB, comfortable | | Mixtral 8x7B | ~47B | FP16 | No | ~94GB, no KV cache headroom | ** Benchmark Highlights CloudRift benchmarks (Oct 2025): | GPU | Model | Precision | Tokens/sec | |-----|-------|-----------|------------| | RTX Pro 6000 | Qwen3-Coder-30B-AWQ | AWQ | ~8,400 | | 4x RTX 4090 | Qwen3-Coder-30B-AWQ | AWQ | ~8,900 | *One GPU versus four, with near-identical throughput and lower power draw.* ** Cost Comparison (Cloud Pricing, March 2026) | GPU | On-demand $/hr | Spot $/hr | VRAM | Fits 70B FP8? | |-----|---------------|-----------|------|---------------| | RTX 5090 | $0.76 | — | 32GB | No | | RTX Pro 6000 | $1.65 | $0.72 | 96GB | Yes (~26GB headroom) | | H100 PCIe | $2.01 | — | 80GB | Yes (~10GB headroom) | | H200 SXM5 | $4.23 | $1.43 | 141GB | Yes (extensive) | ** Best Use Cases 1. *30B model inference at high throughput* — Replaces 4x RTX 4090 setup 2. *32B models in FP8/FP16* — Single GPU, no tensor parallelism needed 3. *70B models in Q4/FP8* — Fits comfortably with KV cache headroom 4. *Development/testing* — $0.72/hr spot pricing for experimentation 5. *Diffusion pipelines* — SDXL + ControlNet + LoRA stacks simultaneously ** Limitations - No NVLink — cannot scale beyond 96GB with tensor parallelism - No MIG partitioning — not for multi-tenant enterprise serving - 70B FP16 won't fit — need H200 (141GB) or multi-GPU - High-batch throughput — H100 SXM wins at maximum concurrency (3.35 TB/s HBM3) ** Strategic Assessment For a personal/enthusiast rack-mounted setup: *Pros:* - 96GB VRAM covers 95% of practical local LLM use cases - Single-GPU simplicity (no multi-GPU orchestration) - Blackwell FP4 support (doubles throughput vs FP8) - Lower cost per token than H100 for 30B workloads - ECC memory for reliability *Cons:* - No upgrade path beyond 96GB (no NVLink) - GDDR7 bandwidth (1.792 TB/s) vs HBM3 (3.35 TB/s) matters at extreme concurrency - Workstation GPU, not datacenter (no MIG, no cluster integration) ** Verdict The RTX Pro 6000 is the sweet spot for a *modular, upgradable* home lab. It won't match an 8x H100 cluster, but it doesn't need to. For single-user inference, development, and experimentation with models up to 70B parameters, it's the most practical high-VRAM option available. ** Sources - Spheron Network: "NVIDIA RTX PRO 6000 Blackwell for AI" (March 2026) - CloudRift: "RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark" (Oct 2025) - Medium/Data Science Collective: "Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000" (Feb 2026) ** Related - [[file:20260313_gemini_embedding_2.org][Gemini Embedding-2 Research]] - [[file:20260311_agora_v2_gap_analysis.org][Agora v2 Gap Analysis]]