Files
hermes-brain/projects/server-rack-build.org
2026-06-01 03:03:11 +00:00

80 lines
3.8 KiB
Org Mode
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
:PROPERTIES:
:CREATED: [2026-05-31 Sun]
:ID: a1b2c3d4-e5f6-7a8b-9c0d-1e2f3a4b5c6d
:END:
#+title: Server Rack Build — Working Note
#+filetags: :infrastructure:rack:build:
#+STATUS: draft
* Overview
Building out a 10-20U open rack, server-grade components bought individually over months. This is the first racked node — triple duty as Passepartout host, Proxmox home server, and ZFS array. Node-1 (Protectli, i7, 6 NICs) stays as network edge.
Already have 10Gb networking, that's stable.
* Current topology
- **Node-1 (Protectli)**: Small form factor, i7, 6 NICs, no PCIe, no GPU, limited RAM. Network appliance / router.
- **Node-2 (racked)**: First rack server. Passepartout + Proxmox + ZFS + GPU for local Hermes inference.
* Chassis
- 3U or 4U rackmount
- Room for full-height GPU, hot-swap drive bays, sufficient airflow
- Open rack design, 10-20U growable
* Platform decision (TBD)
| Option | Pros | Cons |
|--------|------|------|
| Intel Xeon 6 (Granite Rapids) | Newest arch, 12-ch DDR5, 136 PCIe 5.0 lanes, AMX AI accelerators | LGA 4710 (new socket, new mobo cost), DDR5 only, expensive |
| AMD EPYC 7002 (Rome) | 128 PCIe 4.0 lanes, 8-ch DDR4, cheap on used market | Older gen, DDR4 (slower, but cheap), no AMX |
| AMD EPYC 9004/9005 (Genoa/Turin) | 160 PCIe 5.0 lanes, 12-ch DDR5 | More expensive than 7002, but current gen |
* GPU decision (TBD)
Local inference for Hermes. Candidates:
| Option | VRAM | Price | Notes |
|--------|------|-------|-------|
| Intel Arc Pro B70 | 32 GB GDDR6 | ~$949 MSRP | Battlemage workstation, air-cooled, 230W, PCIe 5.0 x16. Plug-and-play with standard toolchains. |
| Tenstorrent P150 (Blackhole) | 32 GB GDDR6 | ~$1,399 | RISC-V Tensix, open source stack, 300W. Software less mature, needs tt-forge compilation. 4x QSFP-DD for linking cards. |
| RTX 5090 | 32 GB GDDR7 | ~$2,000 | CUDA, best software ecosystem. Consumer card, may need blower mod for rack. |
| RTX 6000 Ada (used) | 48 GB GDDR6 | ~$4-5K used | More VRAM, enterprise. Higher price even used. |
Key consideration on P150: not CUDA, not a GPU in the conventional sense. Software maturity is the main cost, not the hardware price.
* Memory plan
Start with 2×64GB DDR5 ECC RDIMM, grow to 4×64GB → 8×64GB (full 512GB on 8-channel; or 384GB on 12-channel).
Tradeoff: running fewer DIMMs than full channel count reduces memory bandwidth proportionally. 2 DIMMs on 8-channel = 25% bandwidth. First to suffer: ZFS ARC performance, VM responsiveness. Compute (LLM inference) is fine since GPU has own VRAM.
Alternative: start with 4×64GB to get half bandwidth without crippling storage I/O, then grow to 8×64GB.
* Build order (over months)
1. Rack + chassis + PSU
2. Motherboard + CPU + RAM + boot drives (runs Proxmox + ZFS immediately)
3. HDDs for ZFS array (start with 2, grow)
4. GPU (last piece — when inference workload justifies it)
* Questions still open
- Intel Xeon 6 vs AMD EPYC (which gen)?
- DDR4 (EPYC 7002) vs DDR5 (everything else)?
- GPU: Intel Arc Pro B70 vs Tenstorrent P150 vs RTX 5090?
- Start with 2×64GB or 4×64GB on memory?
- Water cooling for CPU (Xeon 6 TDP may need it) or just air?
- Specific rack model / chassis model?
* Strategic framing
This node is a bootstrap between Stage 0 (current, conventional) and Stages 3-4 (Lisp machine, bare-metal, in-process LLM on dedicated silicon). DDR4's bandwidth ceiling won't matter because:
- Proxmox + ZFS + the Gate (Stage 2) don't stress 8-channel DDR4-3200
- GPU inference uses its own VRAM, not system memory
- By the time the Lisp machine arrives (different hardware entirely), this node graduates to NAS / Proxmox host duty
Part availability risk is acceptable — at 7+ years of life, the build has already paid for itself many times over, and a motherboard failure means re-platforming onto whatever is current, not trying to resurrect DDR4 infrastructure.