Choosing a Desktop Computer or Home Server for a Private Local LLM
What to buy (and why) for running private local LLMs at home: GPUs, RAM, storage, CPUs, future‑proofing, and private ways to connect from your other laptops.
What matters most for local LLMs
If your goal is fast, private inference at home, prioritize components in this order: GPU VRAM, system RAM, fast NVMe storage, then CPU. More VRAM lets you run larger models and/or less aggressive quantization, which improves quality and speed. System RAM and NVMe affect how many models you can keep locally and how quickly they load. CPU still matters for data prep, tokenization, and non‑GPU backends, but the GPU dominates LLM inference performance.
- GPU VRAM (most future‑proof): 24 GB is a great long‑term target for single‑GPU setups. It comfortably runs popular 7B–13B models with higher precision and longer context, and many 30B variants with quantization. See practical VRAM sizing in community and tool docs (e.g., model quantization in llama.cpp and GGUF notes GGUF).
- System RAM: 64 GB is a comfortable baseline; 128 GB adds headroom for larger context windows, embeddings, RAG indexes, and running multiple services. CPU‑only inference with llama.cpp benefits from more RAM and AVX2/AVX‑512 instruction support on modern CPUs.
- Storage: Prefer 2–4 TB NVMe (PCIe Gen4/Gen5) for models and vector stores; add HDDs for bulk archives. Faster NVMe reduces model load times and improves overall responsiveness when swapping models.
- CPU: 12–16 performance cores (e.g., Ryzen 9 / Core i9) is ample for inference rigs. If you plan heavy multitasking, container stacks, or multi‑GPU, higher‑lane workstation platforms (Threadripper PRO / Xeon W) add stability and I/O bandwidth.
Recommended configurations
Balanced private LLM desktop (quiet, powerful)
- GPU: 24 GB VRAM (e.g., RTX 4090 class) for long‑term viability. Note that consumer RTX 4090 lacks NVLink for fast GPU‑GPU memory pooling (specs).
- CPU: AMD Ryzen 9 7950X or Intel Core i9‑14900K
- Memory: 64–128 GB DDR5
- Storage: 2–4 TB NVMe Gen4/Gen5 for models + optional HDDs for cold storage
- PSU: Quality 1000 W (80+ Gold/Platinum). RTX 4090 boards are commonly rated ~450 W; size PSU with healthy headroom (NVIDIA).
- Cooling & case: High‑airflow case, premium air or 360 mm AIO; prioritize low noise if the system is in your office.
Home server for always‑on workloads
- Platform: Threadripper PRO / Xeon W for more PCIe lanes, ECC memory support, and multi‑GPU headroom (consumer platforms often drop GPUs to x8/x4 when slots are populated; workstation boards avoid many of these limits).
- Memory: 128–256 GB ECC
- GPU: 24–48 GB VRAM class; blower‑style or server‑oriented cooling for rackmounts
- Storage: Multiple NVMe (models, indexes) + RAID/ZFS for reliability
- Networking: 2.5/10 GbE if you’ll stream large embeddings or files to other machines
- PSU: 1200–1600 W for multi‑GPU headroom
Low‑power/entry option (CPU‑first)
- CPU‑only inference: Use llama.cpp with quantized GGUF models; expect lower throughput but excellent privacy and simplicity.
- Memory/Storage: 32–64 GB RAM, 1–2 TB NVMe
How large a model can I run?
Practical model size depends on VRAM, quantization, context length, and runtime. Tools like llama.cpp support quantization via GGUF to shrink memory needs, allowing 7B–13B models on modest GPUs/CPUs. Frameworks that serve transformer models with paged key‑value cache management (e.g., vLLM) can dramatically improve throughput and memory efficiency using PagedAttention.
Future‑proofing tips
- Prioritize VRAM: More VRAM extends useful life as models and context windows grow.
- Leave RAM slots free: Start with 2×32 GB or 2×64 GB to keep upgrade paths open.
- M.2 expansion: Choose boards with 3–4 M.2 NVMe slots and consider a U.2/U.3 path if you’ll scale storage.
- Platform lanes: If you might add multiple GPUs/NVMe cards/NICs, pick workstation platforms with ample PCIe lanes so devices stay at full bandwidth.
- Power and thermals: Over‑spec the PSU and cooling now to avoid replacements later; check GPU power specs (e.g., RTX 4090 ~450 W board power, NVIDIA).
Private ways to connect from your other laptops
Rule #1: Don’t expose your model API directly to the public Internet. Use your LAN or a private network overlay, and authenticate clients.
Option A — VPN overlay (WireGuard/Tailscale)
- WireGuard: Lightweight, audited VPN. Follow the official Quickstart to create keys and peer tunnels.
- Tailscale: Zero‑config WireGuard with device auth, ACLs, and MagicDNS for easy names like
llm-box.tailnet-name.ts.net
. See MagicDNS and ACLs.
Option B — SSH port‑forward (ad‑hoc, simple)
# Forward local port 11434 to the Ollama host over SSH
ssh -N -L 11434:127.0.0.1:11434 user@llm-box
Now point apps to http://127.0.0.1:11434
on your laptop; traffic is encrypted over SSH.
Option C — Reverse proxy with mTLS (inside LAN/VPN only)
Use Caddy to terminate TLS and require client certificates before proxying to your model server. See client certificate auth and reverse_proxy docs.
Serving your local model to clients
Ollama
- API endpoints: generation and chat APIs documented at Ollama API.
- Remote access: set
OLLAMA_HOST=0.0.0.0:11434
on the server and connect over your LAN/VPN; secure with firewall/VPN and never expose publicly. See Ollama docs.
vLLM (OpenAI‑compatible)
- Run the OpenAI‑compatible server: see vLLM serving. Clients can point their OpenAI SDK to your vLLM host URL.
- Under the hood, vLLM’s PagedAttention improves KV‑cache memory efficiency and throughput.
Quick checklist
- Target 24 GB VRAM if budget allows; 12–16 GB works well for 7B–13B with quantization.
- 64–128 GB RAM for comfort; more if you run RAG, embeddings, or multiple services.
- 2–4 TB NVMe for models; keep spare M.2 slots for growth.
- Choose platforms with enough PCIe lanes if you may add GPUs, NICs, or NVMe risers later.
- Size PSU and cooling with headroom; verify GPU physical fit and connectors.
- Expose model endpoints only on LAN/VPN; prefer WireGuard/Tailscale, SSH tunnels, or mTLS proxies.