Guide

Local LLM Hardware Guide

A practical guide to the hardware side of running large language models locally. What GPU for what model size, how much VRAM you actually need, when Apple Silicon makes sense, and how the cost math compares to cloud APIs. Written by the people who spec these deployments for a living, no vendor kickbacks, no sponsored parts.

Free 5-minute self-assessment → Call (412) 998-1370 →

The single most important question: what VRAM do I need?

VRAM is the binding constraint for running a local LLM. Everything else, system RAM, CPU, storage, matters, but VRAM is what determines whether a given model can run at all. Rough rule:

12GB7B model, quantized 4-bit

24GB13B 4-bit or 7B FP16

48GB30B 4-bit or 13B FP16

80GB+70B 4-bit (or multi-GPU)

These are rough minimums for inference with reasonable context windows (8K–16K tokens). For longer contexts, add 2–4GB per additional 8K tokens depending on the model. For fine-tuning, budget 3–5× these numbers.

Which GPU do you need for a local LLM?

Under $2,000, RTX 4090 (24GB)

The best single-GPU value in 2026 for LLM inference. Handles 7B–13B models at 4-bit comfortably, runs 30B models with tight context windows, offers real production throughput. The 4090 Super rumored for late 2026 may change the calculus; for now, the 4090 is the default recommendation for single-GPU workstation deployments.

$5K–$15K, multi-4090 or used A100 (40GB)

Two or four 4090s in a workstation chassis gives you 48–96GB of aggregate VRAM at a meaningfully lower cost than a single A100. Catch: consumer GPUs aren't designed for 24/7 data-center operation, and model parallelism across them requires tensor-parallel-capable inference engines (vLLM, TensorRT-LLM). Works well; requires more engineering than a single bigger card.

Used A100 40GB cards show up on secondary markets; if you're in a spot to buy one with a warranty, it's often the simplest path to 40GB of unified VRAM.

$15K–$40K per card, A100 80GB, H100, H200

Enterprise GPUs with ECC memory, data-center thermals, NVLink for multi-GPU scaling. The right answer for serious production load: hundreds of concurrent users, 70B models, fine-tuning. H100 and H200 are meaningfully faster than A100 for inference; H200 gives you 141GB VRAM on a single card. If your workload justifies it, these are worth the money. If it doesn't, don't.

Apple Silicon, M-series Ultra

Mac Studio M2 Ultra (192GB unified memory) and M3 Ultra run 70B models at respectable speeds, and the unified-memory architecture means there's no CPU-to-GPU transfer tax. Great for Mac-first organizations, solo developers, and workloads that don't require extreme concurrency. Slower than a single 4090 on small models, substantially better than a 4090 on large ones (where the 4090 simply can't hold the weights). M4 Ultra should land in 2026 and extend this lead.

System RAM, CPU, and storage

Everything else matters less than VRAM, but "less" doesn't mean "not at all."

System RAM: at least 2× your VRAM if you plan to load/unload models frequently. 64GB is a comfortable minimum for a 24GB-VRAM workstation; 128GB for a 48GB setup; 256GB+ for enterprise deployments.
CPU: mostly irrelevant for inference throughput on GPU, but matters for data preprocessing and for CPU-only fallback. A modest Ryzen or Xeon with plenty of PCIe lanes is fine.
Storage: NVMe SSD. A 70B model is 40–140GB of weights on disk depending on quantization; loading from HDD or SATA SSD is painful. Plan for 2–4TB of fast storage for a modest model zoo.
Network: 10GbE is nice for multi-node setups. 1GbE is fine if inference traffic is all local.
Power and cooling: actually think about this. A dual-4090 workstation pulls 900W+ under load. Commercial office PDUs often tap out at 1800W on a single circuit.

Inference engines: what runs on what

Ollama. Default recommendation for simplicity. Runs on macOS, Linux, Windows. Good enough for most single-node deployments. Not optimal for maximum throughput on enterprise GPUs.
vLLM. Production-grade inference server. Supports continuous batching, tensor parallelism, and PagedAttention. Best throughput on NVIDIA hardware. Default for multi-GPU and concurrent-user deployments.
llama.cpp. CPU-capable, quantization-rich. Good for edge devices, CPU fallback, and experimentation. Underlies many higher-level tools.
TensorRT-LLM. NVIDIA's optimized engine. Highest throughput and lowest latency on Ampere/Hopper GPUs. Worth the setup complexity for enterprise loads.
MLX (Apple). Apple's ML framework. Best performance on Apple Silicon; integrates with Ollama and llama.cpp.

Is running a local LLM cheaper than cloud APIs?

A single RTX 4090 at ~$1,800 runs indefinite inference on 7B–13B models. Equivalent cloud API spend at 50K queries/day (roughly ~5K tokens per query on 70B-equivalent quality) is in the $15K–$40K/month range. Break-even is measured in weeks, not years.

The math changes if: your load is very small (cloud wins), your load is extremely spiky (cloud may win during the spikes), you have no in-house ops capacity (cloud is less operationally demanding), or you genuinely need frontier model capability that no open-weight model currently provides (cloud is your only option). Otherwise, local wins on cost, compliance, and control.

Where we come in

We spec hardware and deployment architecture as part of the free AI Readiness Assessment or Tier 02 Deep Discovery, based on your actual projected load, latency targets, model requirements, and compliance posture. No affiliate links, no spec-sheet theater. See also: local AI deployment service.

Common questions

Do I need an NVIDIA GPU, or will AMD work?

NVIDIA is the default because the tooling (CUDA, cuDNN, TensorRT) is substantially more mature. AMD ROCm has improved and works for inference on most open models via llama.cpp or vLLM, but you'll hit rougher edges. For production, NVIDIA is the lower-friction choice. For experimentation or lab work where cost matters more than convenience, AMD is viable.

Is Apple Silicon really competitive with NVIDIA for LLMs?

For large models on a single machine, yes, sometimes. The unified-memory architecture means a Mac Studio with 192GB can hold a 70B model entirely in "VRAM" where even an H100 (80GB) can't. For small models and high concurrency, NVIDIA wins on raw throughput. The honest answer: different tools for different workloads.

What's the cheapest way to get started?

For experimentation: a Mac Studio M2 Ultra or a single RTX 4090 in a workstation. For production: depends on your load. A single 4090 handles real tier-1 support automation volume for mid-size businesses. We size hardware in discovery, over-speccing is a common early mistake.

How much does the power cost matter?

Less than most people assume. A dual-4090 rig at 900W average pulling 12 hours a day at commercial rates is roughly $50/month in electricity. Trivial compared to the cloud API bill it replaces. Cooling and circuit capacity matter more than the electricity itself.

Should I buy new H100s or wait for Blackwell (B100/B200)?

If you have a production workload today, buy what ships today. Blackwell is materially faster when it's available in volume, but "wait for the next generation" is a game you can play forever and never deploy. For most enterprise workloads, a used A100 or a current H100 is plenty; Blackwell is relevant mainly if you're training frontier models or running extreme inference concurrency.

Ready to start?

Three free ways to talk.

Take the free 5-minute self-assessment: eight questions about your business, instant written report by email. Or call (412) 998-1370 for the six-minute phone version, same report in your inbox 10 minutes later. Or book 30 minutes directly with Marc.

Take the self-assessment → Call (412) 998-1370 →