LLMs & AI Infrastructure
Learn how LLMs work — from neural networks and transformers to quantization, inference engines, and GPU hardware for self-hosted AI.
Learning Outcomes
Introduction — Why Run Your Own AI?
The Cloud AI Problem
Every time you ask ChatGPT a question, type a prompt into Claude, or use Google's Gemini, your words travel across the internet to a datacenter you do not control. Someone else's servers process your thoughts. Someone else's policies decide what you can and cannot ask.
This is convenient. It is also a compromise.
Three problems define the cloud AI experience: privacy, cost, and freedom.
The Privacy Problem
When you send a message to a cloud AI, you are sharing that data with the provider. Your business plans, personal conversations, code, medical questions — all of it passes through servers owned by OpenAI, Google, or Anthropic. Even with privacy policies, your data exists on infrastructure you cannot audit.
For individuals, this might be acceptable. For businesses handling sensitive data — medical records, legal documents, proprietary code — it can be a dealbreaker.
The Cost Problem
Cloud AI is not free, and it adds up fast.
| Service | Monthly Cost | Annual Cost |
|---|---|---|
| ChatGPT Plus | $20/month | $240/year |
| Claude Pro | $20/month | $240/year |
| GPT-4o API (moderate) | $50-200/month | $600-2,400/year |
| Enterprise API (heavy) | $500-5,000/month | $6,000-60,000/year |
Subscription prices as of March 2026. API cost ranges are estimates based on moderate to heavy usage patterns. OpenAI ChatGPT pricing, Anthropic Claude pricing. Last verified 2026-03-06.
Three Tiers of AI Cost
When comparing cloud vs self-hosted AI, keep these three cost tiers distinct:
- Per-developer API token spend — what one person pays for raw API calls (the table above and the breakdown below).
- Team token spend — per-developer cost × headcount. A 50-developer team at the “Heavy” tier = ~$7,500/month in tokens alone.
- Full program TCO — tokens + seat licenses ($20–$125/user/month) + platform tooling + governance overhead. Typically 2–5× the token spend. Section 11 breaks this down by team size.
API costs depend on how many tokens you process. Tokens are the units LLMs work with — roughly 1.3 tokens per English word, or about 0.75 words per token. Code is more token-dense than prose. Section 4 explains tokenization in depth; for now, think of tokens as “word-pieces” that determine both what the model processes and what you pay. The formula:
monthly_cost = (tokens_per_day × 30 / 1,000,000) × cost_per_1M_tokensThe blended rate. API providers charge differently for input and output tokens. GPT-4o costs $2.50/1M input and $10/1M output. At a typical 2:1 input:output ratio, the blended rate is ~$5/1M tokens (0.67 × $2.50 + 0.33 × $10 ≈ $5.00). More expensive models raise the blended rate: Claude Opus at $5/$25 blends to ~$12/1M at the same ratio. Output-heavy coding workflows shift toward 1:1, raising GPT-4o’s blended rate to ~$6-7/1M. This lesson uses the ~$5/1M rate consistently, including the enterprise crossover analysis in Section 11.[17][18]
Individual API Token Spend (per developer)
| What You’re Doing | Tokens/Day | Daily | Monthly |
|---|---|---|---|
| Light (Q&A, docs) | ~100K | ~$0.50 | ~$15 |
| Moderate (active coding) | ~500K | ~$2.50 | ~$75 |
| Heavy (agentic workflows) | ~1M | ~$5 | ~$150 |
| Power (multi-agent) | ~5M | ~$25 | ~$750 |
All figures use the ~$5/1M blended rate (GPT-4o). For Claude Opus (~$12/1M), multiply by ~2.4. These are API token costs only — not full program operating costs, which include seat licenses, platform tooling, and governance overhead. Pricing as of March 2026; API rates change frequently.
Cost Sensitivity — What If Blended Rates Change?
The ~$5/1M rate above assumes GPT-4o. Different models, output-heavy workflows, or premium tiers shift the blended rate significantly. Here is what the same usage tiers cost at three representative rates:
Light
- Tokens/Day
- 100K
- At $5/1M
- $15/mo
- At $10/1M
- $30/mo
- At $15/1M
- $45/mo
Moderate
- Tokens/Day
- 500K
- At $5/1M
- $75/mo
- At $10/1M
- $150/mo
- At $15/1M
- $225/mo
Heavy
- Tokens/Day
- 1M
- At $5/1M
- $150/mo
- At $10/1M
- $300/mo
- At $15/1M
- $450/mo
Power
- Tokens/Day
- 5M
- At $5/1M
- $750/mo
- At $10/1M
- $1,500/mo
- At $15/1M
- $2,250/mo
50 Heavy devs
- Tokens/Day
- 50M
- At $5/1M
- $7,500/mo
- At $10/1M
- $15,000/mo
- At $15/1M
- $22,500/mo
Calculated using monthly_cost = (tokens_per_day × 30 / 1M) × blended_rate. Rates represent GPT-4o ($5), Claude Opus ($10–12), and premium/output-heavy ($15) blends.. Last verified 2026-03-06.
These Are API Token Costs Only
Every number in the tables above is raw API token spend. Full program costs — including seat licenses ($20–$125/user/month), platform tooling, and governance overhead — are typically 2–5× higher. Section 11 breaks down the full operating-cost comparison and shows exactly when self-hosting becomes cheaper.
For a single user, $20/month for a subscription feels reasonable.[16][18] But modern agentic coding tools (Claude Code, Cursor, Aider) generate far more tokens than manual chat — system prompts, tool-call overhead, multi-turn reasoning chains, and iterative code-test-fix loops push a single active developer to 1M+ tokens/day. Multiply by your team size and the costs add up quickly: 50 heavy developers would spend ~$7,500/month in API tokens alone. Section 11 analyzes exactly when self-hosting becomes the better deal.
These figures are API token spend only, not full operating cost. Enterprise AI programs also include seat licenses ($20–$125/user/month), platform tooling, and governance overhead — making the total 2–5× higher than token spend alone. Section 11 presents the full operating-cost comparison, including the crossover point where self-hosting becomes cheaper than API access.
The Freedom Problem
Cloud providers can change their models, raise prices, add content filters, or discontinue services at any time. You have no control over model availability, response quality, or censorship policies.
When OpenAI deprecated GPT-3.5-turbo, applications that depended on it broke. When providers add new safety filters, workflows that previously worked stop functioning.
Self-hosted AI eliminates all three problems. Your data never leaves your hardware. The cost is fixed (electricity + hardware amortization). And the models you download today will work identically forever — no one can change them remotely.
Note
The Self-Hosted Alternative
Running your own AI means downloading open-weight language models and serving them on your own GPUs. The software stack looks like this:
Your Applications
Chat UI, Coding Assistant, Voice
Inference Engine (vLLM)
Serves the model via HTTP API
GPU Hardware
Consumer, Prosumer, Datacenter
Real-World Analogy: Cloud AI is like renting a car every day. Self-hosted AI is like buying your own car — higher upfront cost, but it is always in your driveway, no one reads your GPS history, and it works even when the internet is down.
What Is an LLM? — The Brain Analogy
Language Models: The Core Idea
A Large Language Model is a program that predicts the next word in a sequence. That is it. Every seemingly intelligent conversation, every piece of generated code, every creative story — all of it emerges from a system that is extraordinarily good at one task: given some text, predict what comes next.
When you type "The capital of France is", the model predicts "Paris" because it has seen millions of examples where those words are followed by "Paris." But this simple mechanism, scaled to billions of parameters and trillions of training examples, produces behavior that looks remarkably like understanding.
Neural Networks: Layers of Math That Learn
Under the hood, an LLM is a neural network — a mathematical structure loosely inspired by the brain. A neural network consists of layers of artificial "neurons," each connected to neurons in the next layer.
Input Layer
(tokens)
Hidden Layers
billions of connections
Output Layer
(50,000+ options)
Each connection between neurons has a weight — a number that controls how strongly one neuron influences the next. A model with 32 billion parameters has 32 billion of these weights. During training, each weight is adjusted slightly to make the model's predictions better.
Real-World Analogy: Think of a neural network as a massive telephone switchboard. Each connection has a dial (weight) that controls how strong the signal is. During training, the AI turns these billions of dials slightly, getting better at routing information from question to answer. During inference (when you chat), the dials are locked in place — the AI just uses what it learned.
Neurons, Weights, and Activation Functions
Each neuron performs three simple operations:
- Multiply each input by its weight
- Sum all the weighted inputs together
- Apply an activation function to decide whether to "fire"
The activation function introduces non-linearity — without it, stacking layers would be mathematically equivalent to a single layer. Individually, these operations are trivially simple. The power comes from scale — 32 billion weights organized into hundreds of layers create emergent capabilities that no individual weight explains.
Training vs Inference: Learning vs Using
These are the two fundamental phases of any AI model's life:
Training is the learning phase. The model reads enormous amounts of text (trillions of tokens) and adjusts its weights to minimize prediction errors. Training a frontier model costs millions of dollars in GPU compute.
Inference is the using phase. The weights are frozen. You send a prompt, the model processes it through its layers, and it generates a response.
| Aspect | Training | Inference |
|---|---|---|
| Purpose | Learn patterns from data | Generate responses |
| Weights | Constantly changing | Frozen (read-only) |
| Cost | $1M-$100M+ (frontier) | $0.001-$0.06 per 1K tokens |
| Duration | Weeks to months | Milliseconds to seconds |
| Hardware | Thousands of GPUs | 1-8 GPUs (for most models) |
| Who does it | AI labs (OpenAI, Meta, Alibaba) | You (on your own hardware) |
Training cost range ($1M-$100M+) and inference cost range ($0.001-$0.06/1K tokens) are industry estimates reflecting frontier model training (e.g., GPT-4, Llama 3) and major API providers. GPU counts and durations are order-of-magnitude estimates. As of early 2026..
Best Practice
How Training Actually Works: A Simplified Example
Imagine the model sees: "The cat sat on the ___"
- The model predicts "table" with 60% confidence
- The correct answer was "mat"
- The error (loss) is calculated: the model was wrong
- Through backpropagation, every weight in the network is adjusted slightly to make "mat" more likely next time
- This happens billions of times across trillions of training examples
After seeing enough examples, the model learns not just facts but patterns, reasoning strategies, and even coding conventions.
The Transformer Revolution — How Modern AI Thinks
The Problem with Older Architectures
Before 2017, the dominant architecture for processing text was the Recurrent Neural Network (RNN). These models processed text one word at a time, from left to right, carrying a "hidden state" that summarized everything they had seen so far.
This had a fatal flaw: by the time the model reached the 500th word, information about the 1st word was severely degraded. Long documents turned into mush.
Self-Attention: Looking at Everything Simultaneously
In 2017, a research paper titled "Attention Is All You Need" introduced the transformer architecture. Its key innovation was self-attention — the ability for every word in a sequence to directly attend to every other word, regardless of distance.
Traditional (RNN)
Sequential: processes one word at a time. Information about early words degrades by the end.
Long-range memory is weak
Transformer (Self-Attention)
Parallel: every word sees every other word simultaneously. No information decay.
Full context preserved
Key Takeaway: Transformers eliminated the sequential bottleneck — enabling models to scale from thousands to millions of tokens.
Key, Query, and Value: The Attention Mechanism
Self-attention works through three learned transformations of each word's representation:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information should I share?"
Real-World Analogy: Imagine you are in a library doing research. Your Query is your research question. You scan the Key (title) of every book on the shelf simultaneously. The books with matching keys get high attention scores. Then you read the Value (content) of the highest-scoring books.
Multi-Head Attention: Multiple Perspectives
Multi-head attention runs multiple attention mechanisms in parallel, each with different learned weights. A model with 64 attention heads examines the text from 64 different perspectives simultaneously.
Input: "The developer fixed the bug that crashed the server"
…up to 64+ heads analyzing different relationships
Real-World Analogy: Think of reading a mystery novel. Old AI read one word at a time, forgetting earlier clues by the end. Transformers read the entire book at once, with "spotlight teams" (attention heads). One team tracks the suspect, another tracks the murder weapon, another tracks the timeline. They all share notes to solve the mystery together.
Positional Encoding: Knowing Word Order
Since self-attention processes all words simultaneously, it has no inherent sense of order. Positional encoding adds a unique mathematical signature to each position in the sequence. Modern models use Rotary Position Embeddings (RoPE), which encode relative distances between tokens rather than absolute positions.
The AI Timeline
"Attention Is All You Need"
Transformers invented at Google — the architecture that powers all modern AI.
GPT-3 (175B parameters)
The scale breakthrough — in-context learning emerges for the first time.
ChatGPT — AI Goes Mainstream
Transformers meet 100 million users in 2 months.¹ The world changes.
The Open-Source Explosion
Llama, Mistral, Qwen release competitive models. MoE scales to 685B (DeepSeek V3.2) on consumer budgets.
Reasoning Models Change the Game
DeepSeek-R1, Qwen-QwQ, and Llama 4 launch with MoE architecture. Chain-of-thought reasoning emerges.
Specialists Beat Generalists
Open-weight models close the gap with proprietary (80.2% vs 80.9% on SWE-bench).² Model selection shifts from "biggest" to "best specialist for the task." Agentic coding agents go mainstream.
Note
Tokens, Context Windows, and the KV Cache
What Are Tokens?
LLMs do not process words. They process tokens — subword pieces that balance vocabulary size with representational efficiency.
Input: "Hello, world!"
Punctuation splits into its own token
Input: "def fibonacci(n):"
Long words are split into subword pieces
Input: "Supercalifragilistic"
Rare words get broken into known fragments
Input: "192.168.1.20"
Each separator becomes a token
A typical English word averages about 1.3 tokens. Code tends to be more token-dense than prose.
Warning
Token Calculator
How to use: Type or paste any text below to see how many tokens it would consume, or pick a preset to start with a realistic example.
Code is more token-dense than prose (÷3 vs ÷4). Switch the content type selector to see the difference.
Results
Characters: 115
Words: 19
Tokens ≈ 29
tokens ≈ ceil(characters / 4)
Context Windows: How Much the AI Remembers
The context window is the total amount of text the model can process in a single interaction — your input prompt plus the model's response. Everything outside the context window simply does not exist to the model.
GPT-3 (2020)
- Context Window
- 4,096 tokens
- ~Word Count
- ~3,000
- Use Case
- Short conversations
GPT-4o
- Context Window
- 128,000 tokens
- ~Word Count
- ~96,000
- Use Case
- Documents, codebases
Llama 3.3 70B
- Context Window
- 131,072 tokens
- ~Word Count
- ~98,000
- Use Case
- Documents, entire codebases
Claude Opus 4
- Context Window
- 200,000 tokens (1M beta)
- ~Word Count
- ~150,000
- Use Case
- Large codebases, long docs
Qwen2.5-Coder-32B
- Context Window
- 131,072 tokens
- ~Word Count
- ~98,000
- Use Case
- Entire codebases
Qwen3-Coder (Next)
- Context Window
- 262,144 tokens
- ~Word Count
- ~196,000
- Use Case
- Massive repositories
GPT-4.1
- Context Window
- 1,048,576 tokens
- ~Word Count
- ~786,000
- Use Case
- Books, full repositories
Gemini 2.5 Pro
- Context Window
- 1,000,000 tokens
- ~Word Count
- ~750,000
- Use Case
- Books, video transcripts
GPT-3 context is historical. Context windows are model-card-stated capabilities and may differ from effective quality at maximum length. Rankings are point-in-time as of March 2026 — context limits evolve rapidly. OpenAI GPT-4o model card, Anthropic Claude models, Meta Llama 3.3 model card, Qwen2.5-Coder model card, Google Gemini docs. Last verified 2026-03-06.
Real-World Analogy: The context window is like a desk. A 4K context window is a school desk — you can fit a few pages. A 128K context window is a conference table — you can spread out entire codebases. A 1M context window is a warehouse floor.
Context Budget Calculator
How to use: Pick a preset below to see a realistic context breakdown. Then adjust the numbers to match your own workflow — the calculator updates instantly.
If headroom turns red, your content exceeds the model’s context window. Either reduce retrieval tokens or choose a model with a larger window.
Total used: 6700 / 131,072 (5%)
Headroom: 124372
Fits
headroom = contextWindow - (system + prompt + retrieval + reserve)
The KV Cache: Memory for Attention
Sticky Notes Analogy
Imagine reading a long document and placing a sticky note on every page you've read, summarizing what matters. When you need to refer back, you check the sticky notes instead of re-reading every page. That's the KV cache — it stores “summaries” (Key and Value vectors) for every token the model has already processed, so it never needs to re-read from scratch. The catch: those sticky notes take up desk space (GPU memory), and the more pages you've read, the more desk space you need.
When the model generates text, it needs to compute attention between the new token and every previous token. The Key-Value (KV) cache stores the Key and Value matrices for all previous tokens. When generating the next token, the model only needs to compute the new token's Query and compare it against cached Keys — no re-computation needed.
Without KV Cache
Token 1: process 1 token
Token 2: re-process 1+2 = 3 ops
Token 3: re-process 1+2+3 = 6 ops
…
Token 1000: 500,500 ops
O(n²) — catastrophically slow
With KV Cache
Token 1: process 1, cache K₁V₁
Token 2: process 1, lookup = 2 ops
Token 3: process 1, lookup = 2 ops
…
Token 1000: 2 ops
O(n) — linear, fast
Warning
Coding Workflow Example
When you load a repository into a coding assistant (e.g., 50,000 tokens of source files), the KV cache stores attention data for every token. On a 32B GQA model (8 KV heads, 128 head dim, 64 layers, FP16), the formula is: cache_bytes = 2 × layers × kv_heads × head_dim × seq_len × 2. At 50K tokens that’s roughly 6–13GB of KV cache VRAM (depending on GQA group count and precision). Load a full 128K-token context and the cache can exceed 15GB. This is why your GPU might fit the model weights (18GB AWQ) but run out of memory when you open a large codebase — the KV cache is the invisible cost on top of the weights. Section 9 covers VRAM budgeting in detail.
Temperature and Sampling: Controlling Creativity
Temperature scales the probability distribution:
- Temperature 0.0: Always picks the highest-probability token (deterministic)
- Temperature 0.7: Moderate randomness (good default for most tasks)
- Temperature 1.5: High randomness (creative, sometimes incoherent)
Prompt: "The programmer wrote a function to…"
→ "calculate the sum of two numbers"
→ "parse JSON data from the API response"
→ "transcend the boundaries of recursion"
For coding tasks, low temperature (0.1-0.3) produces more reliable, deterministic output. For creative writing, higher temperature (0.7-1.0) adds variety.
Temperature in Practice: Coding Examples
Code generation, refactoring, type definitions — you want the most likely correct answer every time. A deterministic sort function should not vary between runs.
Code review, commit messages, documentation — some variety in phrasing is fine, but the content should stay accurate and focused.
Brainstorming, naming, creative exploration — higher randomness helps generate diverse alternatives for variable names, project ideas, or architectural approaches.
Where to set temperature: In API calls, pass temperature: 0.2 as a parameter. In chat UIs (Open WebUI, LM Studio, etc.), look for a settings gear or “Model Parameters” panel — the temperature slider is usually there.
Parameters, Model Sizes, and the MoE Revolution
What Are Parameters?
Parameters are the learnable weights in a neural network — the billions of “dials” we discussed in Section 2. When someone says “Qwen2.5-Coder-32B,” the “32B” means 32 billion parameters.
More parameters means more capacity to store knowledge and patterns. But it also means more VRAM required, slower inference, and diminishing returns at extreme scale.
Parameter Count vs Intelligence
Here is the counterintuitive truth: more parameters does not always mean smarter.
A 32B coding specialist might score 92.7% on HumanEval (a coding benchmark). A model with far more parameters, like GPT-4o, scores roughly the same on coding tasks. How can a smaller model match a larger one? Specialization.
| Factor | Effect on Quality |
|---|---|
| Training data quality | More impactful than model size |
| Data specialization | Coding models trained on trillions of code tokens |
| Architecture optimization | Better attention patterns, efficient heads |
| Training recipe | Learning rate schedules, curriculum ordering |
| Parameter count | Important but not decisive |
A 32B model trained exclusively on high-quality code outperforms a 70B model trained on general internet text — at coding tasks. The specialist beats the generalist at its specialty.
Real-World Analogy
Dense vs Mixture of Experts (MoE)
In a dense model, every parameter is active for every token. A 70B dense model processes every token through all 70 billion parameters.
In a Mixture of Experts (MoE) model, only a fraction of parameters are active per token. A router network decides which “experts” (subsets of the model) handle each token.
Dense Model (70B)
All 70B parameters active for every token
100% active — maximum VRAM, slower
MoE Model (Mixtral 8×7B)
Router picks 2 of 8 experts per token
28% active — less compute VRAM, fast
MoE achieves the knowledge capacity of a large model with the compute cost of a smaller one. Mixtral 8x7B has 47B total parameters but only activates 13B per token, giving it the speed of a 13B model with the knowledge of a much larger one.
Hospital Analogy
The MoE Tradeoff
| Aspect | Dense | MoE |
|---|---|---|
| Total parameters | All active | Many, but most idle |
| Active parameters | 100% | 10-30% per token |
| VRAM (weights) | Full model size | Full model size (all experts loaded) |
| VRAM (compute) | Full model | Only active experts |
| Inference speed | Proportional to total params | Proportional to active params |
| Knowledge capacity | Limited by total params | Larger (distributed across experts) |
| Best example | Llama 3.3 70B | DeepSeek V3.2 (685B total, ~37B active) |
MoE Range
VRAM Warning
Scaling Laws and Diminishing Returns
Research by OpenAI and others has established scaling laws: predictable relationships between model size, training data, compute budget, and performance.
The key finding: doubling model size does not double performance. Each doubling yields progressively smaller improvements. Going from 7B to 13B is a massive jump. Going from 70B to 140B is a modest one.
This is why the industry is shifting focus from “make it bigger” to “make it smarter”:
- Better training data (quality over quantity)
- Specialized training (code, math, reasoning)
- Architectural innovations (MoE, efficient attention)
- Post-training optimization (RLHF, DPO, reasoning chains)
Quantization Deep Dive — Shrinking Models Without Losing Intelligence
What Is Quantization?
Neural network weights are numbers. In full precision, each weight is stored as a 32-bit floating-point number (FP32), using 4 bytes of storage. Quantization reduces this precision — storing weights in fewer bits — to shrink the model and speed up inference.
A 32B parameter model in FP32 requires:
32,000,000,000 parameters × 4 bytes = 128 GBThat does not fit on any single consumer GPU. Quantization is what makes large models accessible.
VRAM Estimator
How to use: Pick a preset below or enter your own model size and quantization level. The GPU cards show whether the model fits on common hardware.
If a GPU card turns red, the model won’t fit — try a lower quantization (fewer bits) or a larger GPU. Tensor parallel splits the model across multiple GPUs.
Model weights: 16.00 GB
Total VRAM: 19.60 GB
Per-GPU VRAM: 19.60 GB
RTX 4090
24 GB
✓
RTX 5090
32 GB
✓
H100
80 GB
✓
weightsGB = (params × bits) / 8 | totalGB = weights × overhead + kvCache | perGPU = total / TP
The Precision Ladder
Bar width = relative model size for a 32B parameter model
Image Compression Analogy
AWQ: Activation-Aware Weight Quantization
AWQ is a state-of-the-art 4-bit quantization method developed by MIT researchers. It works by observing which weights matter most during actual model inference (the “activation-aware” part) and protecting those weights from aggressive quantization.
How AWQ works:
- Run the model on calibration data (real prompts and responses)
- Identify which weights produce the largest activations (most important weights)
- Scale important weights up before quantization (protecting them from precision loss)
- Quantize all weights to 4-bit
- During inference, the scaling is reversed mathematically
The result: 75% reduction in model size with minimal quality degradation.
AWQ Quality Preservation
HumanEval (coding)
- FP16 (Baseline)
- 92.7%
- AWQ (4-bit)
- 92.7%
- Degradation
- 0.0%
MBPP (coding)
- FP16 (Baseline)
- 90.2%
- AWQ (4-bit)
- 89.8%
- Degradation
- 0.4%
Perplexity
- FP16 (Baseline)
- 5.12
- AWQ (4-bit)
- 5.39
- Degradation
- 5.3%
Model Size
- FP16 (Baseline)
- ~64 GB
- AWQ (4-bit)
- ~18 GB
- Degradation
- -72%
Benchmark data for Qwen2.5-Coder-32B-Instruct. Qwen2.5-Coder-32B-Instruct-AWQ model card. Last verified 2026-03-06.
Zero degradation on HumanEval. The coding benchmark scores are identical because AWQ preserves the weights that matter for code generation.
Quantization Format Comparison
AWQ
- Bits
- 4-bit
- Quality
- Excellent
- Speed
- Very Fast (Marlin)
- GPU Support
- NVIDIA only
- Best For
- Production serving via vLLM
GPTQ
- Bits
- 4-bit
- Quality
- Good
- Speed
- Fast
- GPU Support
- NVIDIA only
- Best For
- Alternative to AWQ, broad support
GGUF
- Bits
- 2-8 bit
- Quality
- Variable
- Speed
- Medium
- GPU Support
- CPU + GPU
- Best For
- llama.cpp, Ollama, Apple Silicon
EXL2
- Bits
- Variable
- Quality
- Excellent
- Speed
- Very Fast
- GPU Support
- NVIDIA only
- Best For
- ExLlamaV2, flexible bit allocation
BNB (NF4)
- Bits
- 4-bit
- Quality
- Good
- Speed
- Medium
- GPU Support
- NVIDIA
- Best For
- QLoRA fine-tuning
When to use which:
- AWQ + vLLM: Best choice for NVIDIA GPU production serving. This is what most dual-GPU setups should use.
- GGUF + llama.cpp: Best for Apple Silicon (M-series Macs) or CPU-only servers. Also good for mixed CPU+GPU inference.
- EXL2: Best for single-GPU setups with ExLlamaV2. Offers per-layer bit allocation for maximum quality at a target size.
- GPTQ: Legacy format, still widely available. Use when AWQ is not available for your model.
Marlin Kernel Acceleration
Marlin is an NVIDIA-optimized CUDA kernel specifically designed for 4-bit quantized inference. It is not a quantization method — it is a speed optimization for already-quantized models.
Standard 4-bit inference: the GPU must unpack 4-bit weights to 16-bit, compute, then handle the results. Marlin performs the computation directly on 4-bit data using specialized CUDA instructions, eliminating the unpack step.
Throughput (tokens/sec)
- Standard 4-bit
- 68
- Marlin 4-bit
- 741
- Improvement
- 10.9x faster
Latency per token
- Standard 4-bit
- 14.7ms
- Marlin 4-bit
- 1.35ms
- Improvement
- 10.9x faster
Quality
- Standard 4-bit
- Baseline
- Marlin 4-bit
- Identical
- Improvement
- 0% change
Throughput measured on A100 80GB with Llama-class models. Exact numbers depend on model architecture, batch size, and GPU. IST-DASLab Marlin GitHub. Last verified 2026-03-06.
Critical point: Marlin changes speed, not quality. The model produces identical outputs whether using Marlin or standard kernels. It is purely a computational optimization.
vLLM automatically uses Marlin kernels when serving AWQ-quantized models on supported NVIDIA GPUs.
Best Practice
Inference Engines — The Software That Makes It Fast
What Is an Inference Engine?
You have the model weights. You have the GPU. But you still need software to bring them together — that’s the inference engine. It loads the model, accepts your prompts, and returns responses. Different engines suit different needs: some are dead simple for personal use, others handle dozens of concurrent users in production.
An inference engine is the software that loads a model into GPU memory (or CPU memory) and serves it via an API. Think of it as the “runtime” — just as Python is the runtime for Python scripts, an inference engine is the runtime for language models. Several engines exist, each optimized for different hardware and deployment scenarios.
The inference engine handles:
- Loading model weights into GPU VRAM
- Processing incoming prompts (tokenization, attention computation)
- Managing the KV cache across multiple concurrent requests
- Returning generated text via an API endpoint
The Four Engines You Need to Know
Ollama
Easiest start — download and run models with one command. Built on llama.cpp, hides all complexity.
llama.cpp
Maximum control — bare-metal C++ runtime that runs on any hardware, including CPU-only machines.
vLLM
Production throughput — high-concurrency server with PagedAttention for teams and multi-user APIs.
SGLang
Production alternative — optimized for structured generation, multi-turn caching, and repeated prompts.
Start with Ollama to experiment. Move to vLLM or SGLang when you need to serve multiple users.
Easiest to start. One command downloads and runs a model. Built on llama.cpp. Best for: personal experimentation.
Runs on any hardware — NVIDIA, AMD, Apple Silicon, even CPU-only. Best for: Mac users and CPU-based deployments.
Open-source engine from UC Berkeley. Production-grade — handles many concurrent users with high efficiency. Best for: teams and multi-user APIs.
Rising alternative to vLLM with structured generation and RadixAttention prefix caching. Best for: production serving with structured output or repeated prompts.
Maximum NVIDIA-optimized speed. Higher setup complexity. Best for: high-throughput production on NVIDIA GPUs.
The sections below explain how each engine works under the hood. If you just want to get started, Ollama is the fastest path.
vLLM: The Production Standard
vLLM (pronounced “v-L-L-M”) is the most widely used inference engine for GPU-based deployments. Created at UC Berkeley, it introduced two key innovations that make it dramatically faster than alternatives.
PagedAttention: Traditional inference engines allocate a contiguous block of GPU memory for each request’s KV cache. If the request might need up to 128K tokens, the engine reserves memory for 128K tokens — even if the actual conversation only uses 2K tokens. This wastes enormous amounts of VRAM.
PagedAttention borrows the concept of virtual memory from operating systems. Instead of contiguous allocation, it stores KV cache entries in fixed-size “pages” scattered across GPU memory. Pages are allocated on demand and freed immediately when no longer needed.
Traditional Memory Allocation
Total VRAM utilization: ~25%
PagedAttention
Total VRAM utilization: ~85% — Request 3 accepted
Airline Seat Analogy
Continuous Batching: Instead of waiting for all requests in a batch to finish before starting new ones, vLLM adds new requests to the running batch as old ones complete. This keeps GPU utilization high even with variable-length requests, typically delivering 2–4× higher throughput compared to static batching under mixed workloads (the exact gain depends on request-length variance and concurrency).
Sushi Bar Analogy: Static batching is like a traditional restaurant — the kitchen prepares all dishes for a table at once, and no new orders are accepted until the entire table is served. Even if one person orders a quick appetizer and another orders a slow-cooked steak, the appetizer person waits. Continuous batching is like a sushi bar with a conveyor belt — as soon as one dish is finished, the chef starts the next order immediately. Every seat at the bar gets served as fast as their individual order allows, and no seat sits idle while waiting for someone else’s order.
vLLM Performance
Concurrent throughput (16 users)
- vLLM
- ~800 tokens/sec
- Ollama
- ~40 tokens/sec
- Improvement
- ~20x
P99 latency
- vLLM
- ~80ms
- Ollama
- ~1,200ms
- Improvement
- ~15x
Max concurrent users
- vLLM
- 50+
- Ollama
- 3-5
- Improvement
- 10x+
GPU memory efficiency
- vLLM
- 85-95%
- Ollama
- 40-60%
- Improvement
- ~2x
Order-of-magnitude estimates based on community benchmarks of AWQ-quantized 32B-class models on a single A100 80GB GPU, batch size 16, mixed prompt/completion workload. Actual numbers vary widely by model architecture, quantization format, prompt length, and hardware. Treat multipliers as approximate ranges, not exact figures. vLLM documentation. Last verified 2026-03-06.
vLLM is designed for production serving — multiple concurrent users with guaranteed low latency. Ollama is designed for single-user simplicity.
Throughput Estimator
How to use: Pick a preset to see realistic numbers for common setups, then tweak the values to match your own hardware.
Higher batch sizes increase throughput but also VRAM usage. If latency exceeds 10–15 seconds, reduce concurrent users or add GPUs.
Aggregate tokens/sec: 34.00
Requests/sec: 0.07
Latency per request: 14.71 sec
aggregate = baseline × GPUs × efficiency | latency = avgTokens / (aggregate / users)
Tensor Parallelism: Splitting Across GPUs
When a model is too large for a single GPU, tensor parallelism (TP) splits the model across multiple GPUs. Each GPU holds a portion of each layer and they communicate to produce the final result.
Single GPU (Model fits entirely)
GPU 0 (32GB)
All 64 layers loaded
Full model weights
Tensor Parallel = 2 (Model split across 2 GPUs)
GPU 0 (32GB)
Left half of weights
64 layers × 50%
GPU 1 (32GB)
Right half of weights
64 layers × 50%
Combined: 64GB usable VRAM
Mural Painting Analogy
Interconnect bandwidth determines how fast GPUs communicate during tensor parallelism. Consumer GPUs use PCIe 5.0 (64 GB/s), which is sufficient for inference. Datacenter GPUs use NVLink (900 GB/s on H100) for much faster inter-GPU communication. See the GPU Hardware section for the full interconnect comparison table.
vLLM Sleep Mode: Instant Model Switching
Plain-Language Summary
Normally, switching between AI models (e.g., a coding model during work hours and a chat model in the evening) means shutting one down and loading another from disk — a process that takes 1-3 minutes. Sleep mode keeps the idle model’s data in your computer’s regular RAM (not GPU memory), so switching takes just a few seconds. Think of it like putting an app to sleep vs. fully quitting and relaunching it.
Technically, sleep mode in vLLM 0.8+ offloads model weights from GPU VRAM to CPU RAM, freeing GPU memory for a different model. When the original model is needed again, weights are copied back from RAM — far faster than reloading from disk.
Traditional
1. Stop model (5–10s)
2. Unload VRAM (10–20s)
3. Load from disk (30–120s)
4. Warm up (5–10s)
50–160 seconds
Sleep Mode
1. Offload to CPU RAM (1–3s)
2. Load new weights (3–10s)
4–13 seconds
10–40× faster
Instant Resume (L1)
1. Offload to CPU RAM (0.1–0.5s)
2. Reload from RAM (0.5–6s)
0.6–6.5 seconds
18–200× faster
# Put current model to sleep (offload to CPU RAM)
POST /v1/sleep
{"level": "l1"} # Instant Resume: weights stay in CPU RAM
# Wake up and reload
POST /v1/wakePractical example: You run a coding assistant (Qwen3-Coder-Next, 48GB) during work hours and a general chat model (Llama 3.3 70B, 40GB) in the evening. Without sleep mode, switching means a 1–3 minute cold reload from disk each time. With sleep mode, the idle model sits in CPU RAM and reloads to the GPU in seconds when you need it.
Operational limits: Sleep mode requires enough CPU RAM to hold the offloaded weights (e.g., ~40GB for a 32B AWQ model). Systems with limited RAM cannot use this feature. Also, sleep/wake latency scales with model size — a 70B model takes longer to offload and reload than a 7B model.
Engine Comparison
Primary Use
- vLLM
- Production GPU serving
- SGLang
- Structured + fast GPU
- llama.cpp
- Universal (CPU/GPU)
- Ollama
- Easy local use
- TensorRT-LLM
- Maximum NVIDIA speed
GPU Support
- vLLM
- NVIDIA (CUDA), AMD (ROCm)
- SGLang
- NVIDIA (CUDA), AMD (ROCm)
- llama.cpp
- NVIDIA, AMD, Apple
- Ollama
- NVIDIA, Apple
- TensorRT-LLM
- NVIDIA only
CPU Support
- vLLM
- No
- SGLang
- No
- llama.cpp
- Yes (excellent)
- Ollama
- Yes (via llama.cpp)
- TensorRT-LLM
- No
Multi-GPU
- vLLM
- Tensor parallelism
- SGLang
- Tensor parallelism
- llama.cpp
- Limited
- Ollama
- No
- TensorRT-LLM
- Full support
Concurrent Users
- vLLM
- 50+
- SGLang
- 50+
- llama.cpp
- 1-3
- Ollama
- 1-3
- TensorRT-LLM
- 50+
Quantization
- vLLM
- AWQ, GPTQ, FP8
- SGLang
- AWQ, GPTQ, FP8
- llama.cpp
- GGUF (2-8 bit)
- Ollama
- GGUF
- TensorRT-LLM
- FP8, INT4
Setup Complexity
- vLLM
- Medium
- SGLang
- Medium
- llama.cpp
- Low
- Ollama
- Very Low
- TensorRT-LLM
- High
Best For
- vLLM
- Multi-user production
- SGLang
- Structured gen, caching
- llama.cpp
- Mac/CPU deployments
- Ollama
- Personal single-user
- TensorRT-LLM
- Maximum performance
Feature comparison based on official documentation for each engine. Capabilities evolve rapidly — verify against current docs before deployment decisions. vLLM docs, SGLang docs, llama.cpp, Ollama, NVIDIA TensorRT-LLM. Last verified 2026-03-06.
Note
Choosing the Right Engine
The right engine depends on your deployment scenario, not the engine’s feature list. Start from your use case:
Personal experimentation
Single user, single GPU or CPU, quick setup
One-command install. Pulls models from registry. No configuration needed. LM Studio offers a GUI alternative.
Developer workstation (Mac)
Apple Silicon, unified memory, local coding assistant
Native Apple Silicon support. Unified memory lets you load larger models than discrete GPU VRAM would allow.
Multi-user internal API
5–50 concurrent users, NVIDIA GPU(s)
PagedAttention and continuous batching handle concurrent load efficiently. OpenAI-compatible API for easy integration.
High-throughput production
50+ users, maximum GPU utilization, SLA requirements
vLLM for flexibility and broad model support. TensorRT-LLM for maximum NVIDIA-optimized throughput at higher setup cost.
Emerging Alternatives
Use Ollama if: you want to try a model in under 5 minutes with zero configuration.
Use llama.cpp if: you are on a Mac, have an AMD GPU, or need CPU-only inference.
Use vLLM if: you need to serve multiple users, want an OpenAI-compatible API, or need multi-GPU tensor parallelism.
Use TensorRT-LLM if: you need maximum throughput on NVIDIA hardware and can invest time in setup and optimization.
Model Families and How to Choose
Major Open-Weight Model Families
The open-weight model ecosystem has exploded since Meta released Llama 2 in 2023. Dozens of families now compete across coding, reasoning, and general intelligence. Here are the most significant families as of early 2026, organized by origin:
“Open innovation is the foundation of AI progress.”
Qwen — Alibaba Cloud, China
The Qwen family from Alibaba’s research lab has become a dominant force in open-weight AI. Qwen2.5 and Qwen3 models consistently top benchmarks across coding, math, and general intelligence.
- Qwen2.5-Coder-32B-Instruct: A strong code generation specialist. Trained on 5.5 trillion code tokens. Scores 92.7% on HumanEval. 128K context window fits entire codebases. Best suited for code completion and generation tasks rather than agentic tool-use workflows.
- Qwen3-32B: General-purpose powerhouse with hybrid thinking modes (can switch between fast response and deep reasoning).
- Qwen3-Coder-Next: 80B MoE model (only ~3B active per token) with 262K context. Explicitly trained for agentic coding with tool-use and recovery behaviors.
- Qwen3-Coder-30B-A3B: Mid-range agentic coding model (30.5B total, 3.3B active, 262K context). Strong tool-calling capability with smaller VRAM footprint.[6]
License: Apache 2.0 (fully permissive, commercial use allowed, no restrictions).
DeepSeek — DeepSeek AI, China
DeepSeek made headlines with V3 and the R1 reasoning model, both trained at a fraction of typical costs.
- DeepSeek-R1-Distill-Qwen-32B: A 32B model distilled from the 671B R1 reasoning model. Inherits deep chain-of-thought reasoning. Beats OpenAI’s o1-mini on math and logic benchmarks.[3]
- DeepSeek V3 (0324): 685B MoE (37B active per token). Scores 73.1% on SWE-bench Verified and 74.2% on Aider Polyglot — the highest agentic coding scores among open-weight models.[4]
- DeepSeek V3.2 variants: V3.2-Exp and V3.2-Speciale are newer iterations with enhanced reasoning. Benchmark values above are sourced from the V3-0324 release.[7]
License: MIT (fully permissive).
VRAM Requirement
Llama — Meta, USA
Meta’s Llama family established the open-weight movement. Llama 3.3 and 4.0 represent the latest generations.
- Llama 3.3 70B: The general-purpose workhorse. Strong across all tasks, well-supported by every inference engine. Reliable tool-calling support with the
llama3_jsonparser in vLLM. - Llama 4 Scout (109B MoE): 109B total parameters, ~17B active, 10M context (model-card-stated capability). Latest-generation model with improved reasoning and native function calling support.[8]
- Llama 4 Maverick (400B MoE): 400B total parameters, ~17B active. Flagship model requiring multi-GPU or aggressive quantization for consumer deployment.
License: Llama Community License (free for commercial use under 700M monthly active users).
Mistral — Mistral AI, France
- Mixtral 8x7B: 47B total parameters, 13B active. The original “cheap but capable” MoE model.
- Mistral Large (2): 123B dense model, competitive with GPT-4.
- Mistral Small 3.1 24B: Efficient model for resource-constrained setups.
License: Apache 2.0.
Yi — 01.AI, China
- Yi-Coder-9B: Good coding quality for its size. Fits on a single 12GB GPU.
- Yi-34B: Strong general model at a moderate size.
License: Apache 2.0.
Stale Models
GLM — Zhipu AI, China
Zhipu AI’s GLM family focuses on bilingual Chinese/English capability and has emerged as a strong coding contender.
- GLM-4.7: 120B+ parameters with strong coding and reasoning. Scores 94.2% on HumanEval.
- GLM-4.7-Flash: Lighter variant with thinking and tool-calling capabilities. Scores 59.2% on SWE-bench Verified.
- GLM-5: 744B MoE (40B active). Scores 77.8% on SWE-bench — one of the highest among open models.[5]
License: MIT.
MiniMax — MiniMax AI, China
- MiniMax M2.5: Scores 80.2% on SWE-bench Verified — the highest among all open-weight models, within 0.7 points of Claude Opus 4.5.[2]
- MiniMax PRISM: Official uncensored variant for unrestricted use cases.
License: Modified MIT (see terms for exact grants/restrictions).[13]
Kimi — Moonshot AI, China
- Kimi-Dev-72B: Strong development-focused model with function-calling support.
- Kimi K2 / K2.5: Latest generation. K2.5 scores 76.8% on SWE-bench Verified.
License: Modified MIT (commercial use allowed; attribution required above 100M users).
GPT-OSS — OpenAI, USA
OpenAI’s first open-weight releases, focused on accessibility and transparency.
- gpt-oss-20b: Compact model requiring only 16GB VRAM in its native MXFP4 format. Designed for reliable tool calling.
- gpt-oss-120b: Larger variant requiring ~80GB VRAM. Stronger coding capability.
License: Apache 2.0.
IBM Granite — IBM, USA
- Granite 3.3 8B / 34B: Optimized for enterprise deployment, SQL generation, and business analytics.
- Granite 4.0: Latest generation with improved code generation.
- Trained exclusively on license-permissible data — the safest choice for IP-sensitive deployments.
License: Apache 2.0.
Microsoft Phi-4 — Microsoft, USA
- Phi-4 (14B): Matches much larger models on reasoning benchmarks despite its compact size. Only ~7GB VRAM with AWQ quantization.
- Phi-4-mini (3.8B): Ultra-compact with built-in function calling support.
License: MIT.
NVIDIA Nemotron — NVIDIA, USA
NVIDIA’s Nemotron family focuses on agentic and multi-agent workloads, combining NVIDIA’s inference optimization with open-weight accessibility.[9]
- Nemotron 3 Nano (30B MoE): 30B total parameters, ~3B active per token, 1M context. Hybrid latent MoE architecture optimized for efficient agentic workloads. Up to 4x higher token throughput compared with Nemotron 2 Nano.
- Nemotron 3 Super (~100B MoE): ~100B total, ~10B active. Mid-range model for reasoning and coding workloads.
- Nemotron 3 Ultra (~500B MoE): ~500B total, ~50B active. Flagship model requiring multi-GPU deployment on NVIDIA Blackwell architecture.
License: NVIDIA Open Model License (see terms for exact grants and restrictions).[10]
Gemma — Google, USA
Google’s Gemma family provides open-weight models optimized for edge deployment and multilingual tasks, with clear licensing terms.[11]
- Gemma 3 27B: Strong multilingual model (140+ languages) with 128K context. Trained on 14 trillion tokens. Good balance of capability and efficiency for on-device or edge deployment.
- Gemma 3 4B: Compact variant for resource-constrained environments.
License: Gemma Terms (see terms for exact commercial/redistribution/patent conditions).
OLMo — AllenAI, USA
AllenAI’s OLMo family releases weights, training code, training data, and evaluation code — one of the few model families to open-source the full training pipeline.[12]
- OLMo 2 14B: Research-grade model designed for scientific study of LLM behavior. Trained on the Dolma dataset with full reproducibility documentation.
- OLMo 2 7B: Smaller variant suitable for academic experimentation on consumer hardware.
License: Apache 2.0.[12]
The Comprehensive Model Comparison
Qwen3-Coder-Next
- Params
- 80B
- Active
- ~3B
- Context
- 262K
- HumanEval
- ~93%
- SWE-bench
- 70.6%
- VRAM (AWQ)
- ~48GB
- Best For
- Agentic coding
Qwen3-Coder-30B-A3B
- Params
- 30.5B
- Active
- 3.3B
- Context
- 262K
- HumanEval
- ~90%
- SWE-bench
- --
- VRAM (AWQ)
- ~18GB
- Best For
- Lightweight agentic
Qwen2.5-Coder-32B
- Params
- 32B
- Active
- 32B
- Context
- 128K
- HumanEval
- 92.7%
- SWE-bench
- --
- VRAM (AWQ)
- ~18GB
- Best For
- Code generation
DeepSeek-R1-Distill-32B
- Params
- 32B
- Active
- 32B
- Context
- 32K
- HumanEval
- 79.2%
- SWE-bench
- --
- VRAM (AWQ)
- ~18GB
- Best For
- Reasoning, math
Llama 3.3 70B
- Params
- 70B
- Active
- 70B
- Context
- 128K
- HumanEval
- 88.4%
- SWE-bench
- --
- VRAM (AWQ)
- ~40GB
- Best For
- General purpose
Llama 4 Scout
- Params
- 109B
- Active
- ~17B
- Context
- 10M
- HumanEval
- ~89%
- SWE-bench
- --
- VRAM (AWQ)
- ~55GB
- Best For
- Latest general
GLM-4.7
- Params
- 120B+
- Active
- 120B+
- Context
- 128K
- HumanEval
- 94.2%
- SWE-bench
- --
- VRAM (AWQ)
- ~60GB+
- Best For
- Bilingual coding
MiniMax M2.5
- Params
- --
- Active
- --
- Context
- --
- HumanEval
- --
- SWE-bench
- 80.2%
- VRAM (AWQ)
- --
- Best For
- Maximum SWE-bench
Phi-4
- Params
- 14B
- Active
- 14B
- Context
- 16K
- HumanEval
- ~82%
- SWE-bench
- --
- VRAM (AWQ)
- ~7GB
- Best For
- Compact reasoning
GPT-OSS-20B
- Params
- 20B
- Active
- 20B
- Context
- --
- HumanEval
- --
- SWE-bench
- 34.0%
- VRAM (AWQ)
- ~16GB
- Best For
- Reliable tool calling
IBM Granite 34B
- Params
- 34B
- Active
- 34B
- Context
- 32K
- HumanEval
- ~86%
- SWE-bench
- --
- VRAM (AWQ)
- ~18GB
- Best For
- Enterprise, IP-safe
DeepSeek V3 (0324)
- Params
- 685B
- Active
- 37B
- Context
- 128K
- HumanEval
- --
- SWE-bench
- 73.1%
- VRAM (AWQ)
- ~350GB
- Best For
- Maximum quality
Nemotron 3 Nano
- Params
- 30B
- Active
- ~3B
- Context
- 1M
- HumanEval
- --
- SWE-bench
- --
- VRAM (AWQ)
- ~18GB
- Best For
- Efficient agentic
Gemma 3 27B
- Params
- 27B
- Active
- 27B
- Context
- 128K
- HumanEval
- --
- SWE-bench
- --
- VRAM (AWQ)
- ~16GB
- Best For
- Multilingual, edge
OLMo 2 14B
- Params
- 14B
- Active
- 14B
- Context
- --
- HumanEval
- --
- SWE-bench
- --
- VRAM (AWQ)
- ~8GB
- Best For
- Open-science research
HuggingFace model cards for each model. SWE-bench scores from SWE-bench Verified leaderboard. VRAM estimates assume AWQ 4-bit quantization. Llama 3.3, DeepSeek V3, GLM-5, Qwen2.5-Coder, SWE-bench Verified leaderboard, Nemotron 3, Gemma 3, OLMo 2. Last verified 2026-03-06.
Commercial Model Comparison
GPT-4o
- Provider
- OpenAI
- HumanEval
- ~92%
- Cost per 1M tokens
- $2.50 / $10
- Privacy
- No
- Customizable
- No
Claude Opus
- Provider
- Anthropic
- HumanEval
- ~90%
- Cost per 1M tokens
- $5 / $25
- Privacy
- No
- Customizable
- No
Gemini 2.5 Pro
- Provider
- HumanEval
- ~88%
- Cost per 1M tokens
- $1.25 / $10
- Privacy
- No
- Customizable
- No
Qwen2.5-Coder-32B
- Provider
- Self-hosted
- HumanEval
- 92.7%
- Cost per 1M tokens
- $0 (hardware cost)
- Privacy
- Yes
- Customizable
- Yes
DeepSeek-R1-32B
- Provider
- Self-hosted
- HumanEval
- 79.2%
- Cost per 1M tokens
- $0 (hardware cost)
- Privacy
- Yes
- Customizable
- Yes
API pricing (input/output per 1M tokens). HumanEval scores from respective model cards. Pricing is highly fluid — check provider pages for current rates. As of March 2026. OpenAI pricing, Anthropic pricing, Google Gemini pricing. Last verified 2026-03-06.
The self-hosted models match or exceed commercial APIs on coding tasks, with zero per-token costs and full data privacy. The tradeoff is upfront hardware investment and maintenance responsibility.
Understanding Licenses
Apache 2.0
- Commercial Use
- Yes
- Modify
- Yes
- Distribute
- Yes
- Patent Grant
- Yes
- Notable Models
- Qwen, Mistral, Yi, GPT-OSS, Granite, OLMo
MIT
- Commercial Use
- Yes
- Modify
- Yes
- Distribute
- Yes
- Patent Grant
- No
- Notable Models
- DeepSeek, GLM, Phi-4
Llama Community
- Commercial Use
- Yes (< 700M MAU)
- Modify
- Yes
- Distribute
- Yes
- Patent Grant
- No
- Notable Models
- Llama 3
Proprietary
- Commercial Use
- Via API only
- Modify
- No
- Distribute
- No
- Patent Grant
- No
- Notable Models
- GPT-4, Claude
Modified MIT
- Commercial Use
- Yes
- Modify
- See terms
- Distribute
- See terms
- Patent Grant
- See terms
- Notable Models
- MiniMax, Kimi
Gemma Terms
- Commercial Use
- See terms
- Modify
- See terms
- Distribute
- See terms
- Patent Grant
- See terms
- Notable Models
- Gemma
NVIDIA Open Model
- Commercial Use
- See terms
- Modify
- See terms
- Distribute
- See terms
- Patent Grant
- See terms
- Notable Models
- Nemotron
Open-weight vs open-source: “Open-weight” means the model weights are publicly available. “Open-source” means the weights, training code, AND training data are all available. Most “open” models are actually open-weight — the training process is proprietary.
Apache 2.0 Gold Standard
How to Choose: The Decision Framework
Step 1: Define Your Task
- Code generation (autocomplete, scaffolding) → Qwen2.5-Coder-32B or Yi-Coder-9B
- Agentic coding (tool use, file editing, debugging) → Qwen3-Coder-Next, Qwen3-Coder-30B-A3B, or Nemotron 3 Nano
- Reasoning/Math → DeepSeek-R1-Distill-32B (chain-of-thought reasoning)
- General chat → Qwen3-32B, Llama 4 Scout, or Llama 3.3 70B
- Lightweight/fast → Phi-4, Yi-Coder-9B, or Gemma 3 4B (5-7GB VRAM)
- Multilingual/edge → Gemma 3 27B (140+ languages, 128K context)
- Enterprise (IP-sensitive) → IBM Granite (license-permissible training data)
- Research/open-science → OLMo 2 (full training pipeline available)
Step 2: Check VRAM Budget
- Single 12GB GPU → Yi-Coder-9B, Phi-4, or smaller
- Single 24GB GPU → Qwen2.5-Coder-32B-AWQ, Qwen3-Coder-30B-A3B-AWQ, DeepSeek-R1-32B-AWQ
- Dual 32GB GPUs → Qwen3-Coder-Next-AWQ (80B MoE), Llama 3.3 70B-AWQ, Llama 4 Scout-AWQ (all TP=2)
- 80GB+ (H100) → Up to 120B FP16, 400B+ in AWQ
Step 3: Consider Context Needs
- Short conversations (< 4K) → Any model
- Code files (8K-32K) → Need 32K+ context model
- Full codebases (32K-128K) → Qwen2.5-Coder, Qwen3-32B, Llama (128K native)
- Massive repositories (128K-262K) → Qwen3-Coder-Next (262K context)
- Ultra-long documents (1M+) → MiniMax, Kimi (API), Gemini API, Nemotron 3 Nano (1M), Llama 4 Scout (10M model-card capability)
Vehicle Analogy
GPU Hardware — From Gaming Cards to Datacenter Supercomputers
GPU Architecture Fundamentals
For a quick recommendation based on your use case and budget, skip to the decision framework in Section 8.
A GPU (Graphics Processing Unit) was originally designed to render video game graphics — thousands of simple calculations in parallel. It turns out this same architecture is perfect for neural network inference, which also requires massive parallelism.
Modern AI GPUs contain three types of processing units:
- CUDA Cores / Stream Processors: General-purpose parallel processors. Handle the basic matrix multiplications that drive neural network computation. Thousands per GPU.
- Tensor Cores / Matrix Accelerators: Specialized units that perform matrix multiply-and-accumulate operations in a single clock cycle. 4-16x faster than CUDA cores for AI workloads. This is where actual inference math happens.
- VRAM (Video RAM): High-bandwidth memory attached directly to the GPU. This is where model weights, the KV cache, and intermediate computations live. VRAM is the single most important specification for AI inference.
Why VRAM Is the Bottleneck
For AI inference, the processing pipeline is:
- Load model weights from VRAM into compute units
- Load input tokens
- Compute attention and feed-forward layers
- Store KV cache entries back to VRAM
- Output next token
The bottleneck is almost always step 1 — moving data from VRAM to compute units. This is called being memory-bandwidth-bound. More VRAM means you can load larger models. Faster VRAM bandwidth means you can feed the compute units faster.
Consumer Tier: Gaming GPUs for AI
RTX 4060 Ti
- VRAM
- 16GB GDDR6
- Bandwidth
- 288 GB/s
- Tensor Cores
- 136
- FP16 TFLOPS
- 22.1
- Power
- 160W
- Price
- $400
RTX 4080
- VRAM
- 16GB GDDR6X
- Bandwidth
- 717 GB/s
- Tensor Cores
- 304
- FP16 TFLOPS
- 48.7
- Power
- 320W
- Price
- $1,200
RTX 4090
- VRAM
- 24GB GDDR6X
- Bandwidth
- 1,008 GB/s
- Tensor Cores
- 512
- FP16 TFLOPS
- 82.6
- Power
- 450W
- Price
- $1,600
RTX 5090
- VRAM
- 32GB GDDR7
- Bandwidth
- 1,792 GB/s
- Tensor Cores
- 680
- FP16 TFLOPS
- 104.8
- Power
- 575W
- Price
- $2,000
MSRP prices; actual retail may vary. RTX 5090 specs based on launch specifications. NVIDIA GeForce product pages. Last verified 2026-03-06.
Key insights for consumer GPUs:
- RTX 4090 (24GB): Runs most 7B-13B models in FP16, 32B models in AWQ/4-bit. The workhorse of the hobbyist AI community.
- RTX 5090 (32GB): Runs 32B models in AWQ comfortably with room for large KV caches. Two in tandem (64GB total) can run 70B AWQ models.
- Multi-GPU: Consumer GPUs communicate via PCIe 5.0 (64 GB/s). Fast enough for inference but limits training efficiency.
- Limitation: No NVLink support. Consumer motherboards typically support 2 GPUs maximum for AI workloads.
Bandwidth Advantage
Prosumer Tier: Apple Silicon and AMD
Apple Silicon (M-series)
M3 Pro
- Unified Memory
- 36GB
- Bandwidth
- 150 GB/s
- GPU Cores
- 18
- Neural Engine
- 16-core
- Best For
- Small models (7-13B)
M3 Max
- Unified Memory
- 128GB
- Bandwidth
- 400 GB/s
- GPU Cores
- 40
- Neural Engine
- 16-core
- Best For
- Medium models (32-70B)
M3 Ultra
- Unified Memory
- 192-512GB
- Bandwidth
- 800 GB/s
- GPU Cores
- 80
- Neural Engine
- 32-core
- Best For
- Large models (70B FP16)
M4 Ultra (2026)
- Unified Memory
- 256GB+
- Bandwidth
- 1,000+ GB/s
- GPU Cores
- 80+
- Neural Engine
- 32-core+
- Best For
- Very large models
M4 Ultra specs are pre-release estimates. Apple Mac Studio specs. Last verified 2026-03-06.
Apple Silicon’s unique advantage is unified memory — the CPU and GPU share the same memory pool. A Mac Studio M3 Ultra with 192GB (base; up to 512GB in Mac Pro) can load a 70B model in FP16 without quantization. No NVIDIA GPU at any consumer price point can do this.
The downside: Apple’s GPU architecture is optimized for different workloads than NVIDIA’s Tensor Cores. Token-for-token, NVIDIA GPUs are faster, but Apple Silicon can load larger models.
AMD MI300X
VRAM
- MI300X
- 192GB HBM3
Bandwidth
- MI300X
- 5,300 GB/s
FP16 TFLOPS
- MI300X
- 653.7
Power
- MI300X
- 750W
Price
- MI300X
- ~$10,000-15,000
Price is estimate from enterprise channels. AMD Instinct MI300X product page. Last verified 2026-03-06.
The MI300X is AMD’s datacenter GPU with a massive 192GB of HBM3 memory. It can run 70B models in full FP16 on a single card. However, software support (ROCm) lags behind NVIDIA’s CUDA ecosystem.
Datacenter Tier: The NVIDIA AI Factory
A100 (2020)
- VRAM
- 80GB HBM2e
- Bandwidth
- 2,039 GB/s
- Interconnect
- NVLink 600GB/s
- TF32 TFLOPS
- 312
- Power
- 400W
- Price
- ~$10,000
H100 SXM (2023)
- VRAM
- 80GB HBM3
- Bandwidth
- 3,350 GB/s
- Interconnect
- NVLink 900GB/s
- TF32 TFLOPS
- 989
- Power
- 700W
- Price
- ~$25,000-35,000
H200 (2024)
- VRAM
- 141GB HBM3e
- Bandwidth
- 4,800 GB/s
- Interconnect
- NVLink 900GB/s
- TF32 TFLOPS
- 989
- Power
- 700W
- Price
- ~$30,000-40,000
B200 (2025)
- VRAM
- 192GB HBM3e
- Bandwidth
- 8,000 GB/s
- Interconnect
- NVLink 1,800GB/s
- TF32 TFLOPS
- 2,250+
- Power
- 1,000W
- Price
- ~$35,000-50,000
GB200 NVL72
- VRAM
- 13.5TB agg.
- Bandwidth
- 72x NVLink
- Interconnect
- NVSwitch fabric
- TF32 TFLOPS
- 162,000+
- Power
- 120kW
- Price
- ~$3,000,000+
GB300 NVL72
- VRAM
- 20.7TB agg.
- Bandwidth
- 72x NVLink
- Interconnect
- NVSwitch fabric
- TF32 TFLOPS
- ~180,000+
- Power
- ~120kW
- Price
- TBD
TF32 TFLOPS shown (dense Tensor Core); FP16 TFLOPS are approximately 2× these values. H100/H200 FP16 dense = 1,979 TFLOPS. GB300 NVL72 TF32 estimated from SM count increase (+11% vs B200). Prices are estimates — datacenter GPUs are typically sold through enterprise channels. NVIDIA A100, H100, H200, B200, GB300 NVL72. Last verified 2026-03-06.
NVIDIA Blackwell & Blackwell Ultra (2025)
- Memory: 288 GB HBM3e per GPU (GB300) vs 192 GB (GB200) — a 50% increase, enabling larger models per GPU.
- Aggregate rack memory: ~20.7 TB (GB300 NVL72) vs ~13.5 TB (GB200 NVL72) — fits quantized trillion-parameter models.
- Memory bandwidth: 8 TB/s per GPU (GB300) vs ~8 TB/s (GB200) — sustained high throughput for memory-bound inference.
- Compute: 11% more SMs (160 vs 144), with architectural gains focused on FP4 inference and 2× attention throughput for reasoning workloads.
Key datacenter-only features:
- HBM (High Bandwidth Memory): Stacked memory chips with 3-8x the bandwidth of consumer GDDR. An H100’s 3,350 GB/s feeds data to Tensor Cores fast enough to keep them fully utilized.
- NVLink: A direct GPU-to-GPU interconnect with 14x the bandwidth of PCIe 5.0. Makes tensor parallelism across multiple GPUs nearly as efficient as a single larger GPU.
- NVSwitch: A fabric switch connecting up to 576 GPUs in a single domain with full bisection bandwidth. The GB200 NVL72 rack operates as a single logical machine with 13.5TB of aggregate GPU memory.
The Interconnect Hierarchy
Consumer GPUs (RTX 5090)
Datacenter (14× PCIe)
Multi-GPU fabric (28× PCIe)
- PCIe 5.0 (consumer): Fine for 2-GPU tensor parallelism on inference. The ~2% overhead is negligible.
- NVLink (datacenter): Essential for 4-8 GPU tensor parallelism and training.
- NVSwitch (mega-scale): Required for 72+ GPU configurations.
VRAM Budgeting: The Math That Matters
Every GPU deployment needs a VRAM budget:
Total VRAM = Weights + KV Cache + Activations + Overhead
Model Weights
Parameters × Bytes per param
FP16: params × 2 bytes
INT4/AWQ: params × 0.5 bytes
KV Cache
2 × layers × heads × head_dim × context × batch × 2B
Grows with context length and batch size
Activation Memory
≈ 1–2 GB
Varies by batch size
CUDA Overhead
≈ 0.5–1 GB
Driver, context, buffers
Example: Qwen3-Coder-Next-AWQ on Dual 32GB GPUs (TP=2)
Inputs
Model: 80B params (AWQ 4-bit)
GPUs: 2 × 32GB (TP=2)
Context: 32K tokens
Batch: 1 user
Note: All 80B params must be loaded even though only 3B are active per token.
Per-GPU Calculation
Weights: 80B × 0.5B / 2 GPUs = 20 GB
KV Cache: ~2 GB / 2 GPUs = 1 GB
Activations: ~1 GB
CUDA Overhead: ~0.5 GB
20 + 1 + 1 + 0.5 = 22.5 GB per GPU
9.5 GB headroomExample: Llama 3.3 70B-AWQ on Dual 32GB GPUs (TP=2)
Inputs
Model: 70B params (AWQ 4-bit)
GPUs: 2 × 32GB (TP=2)
Context: 8K tokens
Batch: 1 user
Per-GPU Calculation
Weights: 70B × 0.5B / 2 GPUs = 17.5 GB
KV Cache: ~2.8 GB / 2 GPUs = 1.4 GB
Activations: ~1.5 GB
CUDA Overhead: ~0.5 GB
17.5 + 1.4 + 1.5 + 0.5 = 20.9 GB per GPU
11.1 GB headroomBest Practice
Power and Cooling at Scale
RTX 4090
- Power Draw
- 450W
- Annual Electricity (24/7)
- ~$400/year
- Cooling Requirement
- Standard air cooling
RTX 5090
- Power Draw
- 575W
- Annual Electricity (24/7)
- ~$500/year
- Cooling Requirement
- Enhanced air cooling
Dual RTX 5090
- Power Draw
- 1,150W
- Annual Electricity (24/7)
- ~$1,000/year
- Cooling Requirement
- Good airflow, possibly liquid
H100 SXM
- Power Draw
- 700W
- Annual Electricity (24/7)
- ~$600/year
- Cooling Requirement
- Liquid cooling recommended
8x H100 DGX
- Power Draw
- 10,200W
- Annual Electricity (24/7)
- ~$9,000/year
- Cooling Requirement
- Dedicated liquid cooling
GB200 NVL72
- Power Draw
- 120,000W
- Annual Electricity (24/7)
- ~$105,000/year
- Cooling Requirement
- Industrial liquid cooling
GB300 NVL72
- Power Draw
- ~120,000W
- Annual Electricity (24/7)
- ~$105,000/year
- Cooling Requirement
- Industrial liquid cooling
Derived calculations. Power draw from NVIDIA spec sheets. Annual electricity assumes 24/7 operation at $0.12/kWh (U.S. low-cost region rate; national average is ~$0.17/kWh). Actual costs vary by location and usage pattern.. Last verified 2026-03-06.
At the consumer level, power is manageable. At the datacenter level, power and cooling become the dominant operating costs — often exceeding the hardware amortization cost.
Emerging Technologies and the Future
The Attention Problem
The transformer’s self-attention mechanism has a fundamental limitation: it scales quadratically with context length. Processing a 128K token context requires 128K × 128K = 16 billion attention computations. Doubling the context to 256K quadruples this to 64 billion.
This O(n²) scaling is why long-context inference is so expensive and why researchers are exploring alternatives.
Flash Attention: Making Attention Faster (Today)
Flash Attention (by Tri Dao, Stanford) does not change the math of attention — it changes how the computation is organized in GPU memory. By reordering memory access patterns to maximize GPU cache utilization, Flash Attention computes exact attention 2-4x faster with 5-20x less memory overhead.
Flash Attention is already integrated into vLLM and is the default for all modern inference engines.
Gated DeltaNet: Beyond Transformers
Gated DeltaNet is a linear attention architecture that achieves O(n) scaling — doubling context length only doubles cost, not quadruples it.
At 1M tokens: transformer is 1,000,000× more expensive
Gated DeltaNet is still in the research phase, but early results show quality approaching transformers on standard benchmarks. If it works at scale, it could enable million-token context windows on consumer hardware.
YaRN: Extending Context Without Retraining
YaRN (Yet another RoPE extension method) allows models trained with short context windows to be extended to longer contexts without full retraining. It modifies the positional encoding (RoPE) to handle positions beyond the original training range.
A model trained with 4K context can be extended to 32K or even 128K using YaRN, with some quality degradation at extreme extensions. This is valuable because training long-context models is extremely expensive.
Speculative Decoding: Two Models, One Speed
Speculative decoding uses a small, fast model (the “draft” model) to predict multiple tokens ahead, then verifies them with the large, accurate model in a single forward pass.
Without Speculative Decoding
Large model generates one token at a time:
Total: 1,500ms
With Speculative Decoding
Small model drafts 3 tokens: [50ms]
Large model verifies all 3 in one pass: [600ms]
Total: 650ms (2.3× faster)
The key insight: verifying multiple tokens in parallel is nearly as fast as generating one token. If the draft model’s predictions are mostly correct, the total throughput improves 2-3x.
Model Distillation: Teaching Small Models to Think Big
Distillation trains a small “student” model to mimic a large “teacher” model. The student learns not just the correct answers but the teacher’s probability distributions — capturing nuanced knowledge that would require a much larger dataset to learn directly.
Notable example: DeepSeek-R1-Distill-Qwen-32B is a 32B model distilled from the 671B DeepSeek-R1. It inherits the 671B model’s deep reasoning abilities despite being 21x smaller. This is why it beats OpenAI’s o1-mini on reasoning benchmarks.[3]
LoRA and QLoRA: Customizing Models
LoRA (Low-Rank Adaptation) allows you to fine-tune a model by training only a tiny fraction of its parameters. Instead of modifying all 32 billion weights, LoRA adds small trainable matrices (typically 0.1-1% of total parameters) that adjust the model’s behavior.
QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning on a single consumer GPU:
Full Fine-Tuning
Train all 32B parameters
Requires: 128GB+ VRAM
Hardware: Multiple A100/H100 GPUs
Cost: $1,000–10,000+
QLoRA Fine-Tuning
Train ~32M parameters (0.1%)
Model in 4-bit quantization
Requires: 24GB VRAM (single RTX 4090)
Cost: $5–50 electricity
Use cases for fine-tuning:
- Adapting a general model to your coding style
- Teaching a model domain-specific terminology
- Improving performance on a narrow task (e.g., SQL generation)
RAG: Giving AI Access to Your Documents
Retrieval-Augmented Generation (RAG) solves a fundamental limitation: LLMs only know what was in their training data. They cannot access your private documents, recent emails, or local files.
User Query
“What was our Q4 revenue?”
Embedding Model
Convert query to vector
Vector Database
Search similar docs
Retrieved Context
“Q4 2025 revenue was $12.3M, up 23% from Q3…”
LLM Prompt
Context + Question → Model
Answer
“Your Q4 revenue was $12.3M, a 23% increase…”
RAG vs Fine-Tuning vs Prompt Engineering
Prompt Engineering
- Changes Model?
- No
- Updates Real-Time?
- N/A
- Cost
- Free
- Best For
- Adjusting behavior, few-shot examples
RAG
- Changes Model?
- No
- Updates Real-Time?
- Yes
- Cost
- Low
- Best For
- Access to private/changing documents
Fine-Tuning
- Changes Model?
- Yes
- Updates Real-Time?
- No (requires retrain)
- Cost
- Medium
- Best For
- Permanent behavior changes
Tool Use and Function Calling
Modern LLMs can use external tools through function calling. Instead of only generating text, the model can output structured requests to execute code, search the web, query databases, or call APIs.
How Function Calling Works
When a model supports tool use, it follows a four-step process:
- Detect when a tool is needed — the model recognizes that the user’s request requires external action
- Generate structured tool calls — instead of text, the model outputs JSON matching defined tool schemas
- Wait for tool results — the external system executes the tool and returns results
- Incorporate results — the model reads the tool output and crafts its final response
User asks
"What is the latest version of Python?"
LLM decides
I need current information. Let me call a search tool.
Tool call
search("latest Python version 2026")
Tool result
Python 3.13.2, released January 2026
LLM responds
"The latest stable version of Python is 3.13.2, released in January 2026."
Models with Native Tool Use Support
Not all models are trained for function calling. Tool-use capability requires specific training. Models designed for tool use include:
- Qwen3-Coder-Next / Qwen3-Coder-30B-A3B: Trained specifically for agentic coding workflows with recovery behaviors
- Llama 4 Scout/Maverick: Native function calling support
- IBM Granite: Strong enterprise function-calling benchmarks
- Salesforce xLAM: Purpose-built for multi-turn tool use
- OpenAI gpt-oss: Designed for reliable agentic tasks
Critical Distinction
tool_calls JSON. This distinction — code generation vs agentic tool use — is one of the most important concepts in modern AI deployment.Agentic Coding: Beyond Code Completion
Agentic coding is the next evolution beyond code completion. Instead of generating code snippets in a chat window, agentic systems use tool calls to read project structures, create and modify files, run tests, execute shell commands, debug iteratively, and use version control.
User request
"Fix the authentication bug in the login system"
Agent reads
read_file("src/auth/login.py") — examines the code
Agent identifies
Missing input validation on line 47
Agent writes
write_file("src/auth/login.py", fixed_code)
Agent tests
run_command("pytest tests/test_auth.py -v") — all pass
Agent responds
"Fixed! The issue was missing input validation on line 47."
Deep Dive Available
Tool Use vs RAG: When to Use Which
| Aspect | RAG | Tool Use |
|---|---|---|
| Purpose | Access static documents | Execute dynamic operations |
| Response time | Fast (vector search) | Variable (depends on tool) |
| Data freshness | Depends on index updates | Real-time |
| Best for | Knowledge bases, manuals | Web search, code execution, APIs |
| Example | “What’s in our Q4 report?” | “What’s the weather right now?” |
Use RAG when the answer is in documents you control. Use tool calling when the answer requires real-time data or computation.
Deployment Patterns — From Single GPU to Enterprise Clusters
Level 1: Single GPU, Single User
User
(browser)
Inference Engine
(Ollama / vLLM)
GPU
(1×24GB)
Typical setup: Ollama on a single RTX 4090 running a 32B AWQ model. Perfect for personal use — chatting, coding assistance, document analysis. Setup time: 30 minutes.
Limitations: One user at a time. Cannot run models larger than VRAM allows. No redundancy.
Level 2: Multi-GPU, Single Node
User
(browser)
vLLM
(TP=2)
GPU 0
(32GB)
GPU 1
(32GB)
64GB combined VRAM
Benefits over Level 1:
- Run 70B models (AWQ) that do not fit on one GPU
- Model-switching strategies (sleep mode, multiple compose files)
- Can serve 5-10 concurrent users with vLLM
Model-switching pattern: When you need different models for different tasks, use mode-switching scripts. With vLLM sleep mode, switching takes 4-13 seconds instead of 50-160 seconds.
Parallelism Primer
Tensor parallelism (TP) splits each layer across multiple GPUs within one machine. Every GPU computes a portion of every layer and they synchronize via fast NVLink. This reduces per-GPU memory usage and works well when GPUs are tightly connected.
Pipeline parallelism (PP) assigns different layers to different machines. Machine 1 processes layers 1–32, then passes results to Machine 2 for layers 33–64 over the network. Higher latency than TP, but it’s the only option when the model doesn’t fit in a single node’s GPU memory.
Level 3: Multi-GPU, Multi-Node
Users
Load Balancer
Node 1
Layers 1–32 · 8× H100 (640GB)
Node 2
Layers 33–64 · 8× H100 (640GB)
1.28TB combined GPU memory
Pipeline parallelism has higher latency than tensor parallelism (network round-trips between nodes), but it is the only way to serve models that do not fit in a single node’s GPU memory.
Terminology You’ll Need
- Kubernetes (K8s): An open-source system for automatically deploying, scaling, and managing containers (lightweight, isolated application packages).
- Pod: The smallest deployable unit in Kubernetes — one or more containers running together. Think of it as a single “instance” of your application.
- HPA (Horizontal Pod Autoscaler): A Kubernetes feature that automatically adds or removes pods based on metrics like GPU utilization.
- NFS (Network File System): A way to share files over a network so multiple pods can read the same model weights without duplicating them.
Level 4: Kubernetes + vLLM Autoscaling
Kubernetes Cluster
vLLM Pod 1
(8×H100)
vLLM Pod 2
(8×H100)
vLLM Pod 3
(8×H100)
← Auto-scaled →
Shared Model Storage (NFS)
Horizontal Pod Autoscaler (HPA)
Scale up: >80% GPU · Scale down: <20% GPU
How it works:
- A Kubernetes Horizontal Pod Autoscaler monitors GPU utilization
- When demand exceeds capacity, new vLLM pods are launched
- Each pod attaches to a GPU node and loads the model from shared storage
- A load balancer distributes requests across all active pods
- When demand drops, pods are scaled down to save resources
Self-Hosted vs API: The Cost Analysis
In Section 1, we established that API costs scale linearly with token volume at a blended rate of ~$5/1M tokens (GPT-4o). Now that you understand the hardware stack, let’s calculate exactly when self-hosting becomes cheaper. Every number below is derivable from the same formula: monthly_cost = (tokens_per_day × 30 / 1,000,000) × $5.
What is colocation? Colocation means renting rack space in a datacenter where you install your own GPU servers. The datacenter provides power, cooling, and network connectivity; you own and manage the hardware. This is how most organizations run self-hosted AI at enterprise scale — it avoids building your own datacenter while keeping full control of the hardware.
8×H100 DGX system (~$200K) over 3-year lifecycle
If you already own the hardware (e.g., existing DGX nodes), your cost is colocation only: ~$3,500/mo. The crossover analysis below uses the full $9,000/mo figure.
1M tok/day
- ~Team Size
- ~1 dev
- API Token Spend (GPT-4o)
- $150/mo
- Self-Hosted Total (8xH100)
- $9,000/mo
- Winner
- API (60×)
10M tok/day
- ~Team Size
- ~10 devs
- API Token Spend (GPT-4o)
- $1,500/mo
- Self-Hosted Total (8xH100)
- $9,000/mo
- Winner
- API (6×)
30M tok/day
- ~Team Size
- ~30 devs
- API Token Spend (GPT-4o)
- $4,500/mo
- Self-Hosted Total (8xH100)
- $9,000/mo
- Winner
- API (2×)
60M tok/day
- ~Team Size
- ~60 devs
- API Token Spend (GPT-4o)
- $9,000/mo
- Self-Hosted Total (8xH100)
- $9,000/mo
- Winner
- ~Break-even
100M tok/day
- ~Team Size
- ~100 devs
- API Token Spend (GPT-4o)
- $15,000/mo
- Self-Hosted Total (8xH100)
- $9,000/mo
- Winner
- Self-hosted (1.7×)
500M tok/day
- ~Team Size
- Large org
- API Token Spend (GPT-4o)
- $75,000/mo
- Self-Hosted Total (8xH100)
- $14,000/mo*
- Winner
- Self-hosted (5×)
Derived calculations using the Section 1 formula. API costs based on GPT-4o pricing ($2.50/$10 per 1M tokens, blended ~$5/1M at ~2:1 input:output ratio). Self-hosted total cost: ~$5,500/mo hardware amortization (8xH100 DGX @ ~$200K over 3 years) + ~$3,500/mo colocation = ~$9,000/mo. The 500M tok/day row assumes scaling to two DGX nodes (~$14,000/mo).. Last verified 2026-03-06.
The crossover point for an 8×H100 setup at $9,000/month total cost is approximately 60 million tokens per day (~1.8B tokens/month, or roughly 60 active developers). Below that, API is cheaper. Above that, self-hosting wins — and the savings compound rapidly with scale. For more expensive models (Claude Opus at ~$12/1M blended), the crossover drops to roughly 25 million tokens/day.
Important Context for the Crossover Analysis
- Quality gap: API gives you frontier models (GPT-4o, Claude Opus) while self-hosted runs open-weight models (typically 70B-class). The quality difference matters for complex reasoning tasks.
- Operational costs excluded: The self-hosted figure covers hardware + colocation only. Staffing, monitoring, security patches, and model updates add to the real cost.
- Premium models shift crossover: Using Claude Opus (~$12/1M) instead of GPT-4o (~$5/1M) moves the break-even to ~25M tok/day, making self-hosting attractive at smaller scale.
- Point-in-time data: API prices have trended downward. These figures reflect March 2026 pricing and will shift as providers compete.
API tokens are typically 20–40% of total AI program cost. The right column shows what engineering leaders actually budget.
| Scenario | Team | API Token Spend | Program TCO |
|---|---|---|---|
| Small startup | 5 | ~$750/mo | ~$2K–$8K/mo |
| Mid-size team | 30 | ~$4,500/mo | ~$15K–$55K/mo |
| Enterprise division | 100 | ~$15,000/mo | ~$50K–$200K+/mo |
API Token Spend = per-developer token cost from Section 1 (“Heavy” tier, ~$150/mo) × team size. Program TCO includes seat licenses ($20–$125/user/mo), platform tooling, governance overhead, and API spend. Ranges reflect variation in tool stack and organizational complexity.
For consumer hardware, the comparison depends on how you use AI:
Casual chat
- Cloud Cost
- $20/month (subscription)
- Self-Hosted Monthly
- ~$33/month electricity
- Winner
- API
Active coding (1M tok/day)
- Cloud Cost
- ~$150/month (API)
- Self-Hosted Monthly
- ~$33/month electricity
- Winner
- Self-hosted*
Heavy / team (5M+ tok/day)
- Cloud Cost
- $750+/month (API)
- Self-Hosted Monthly
- ~$33/month electricity
- Winner
- Self-hosted*
Derived calculations. Electricity assumes dual RTX 5090 at 1,150W total GPU power draw, ~8 hrs/day active usage at $0.12/kWh (U.S. low-cost region; national average is ~$0.17/kWh, which raises the figure to ~$47/month). Formula: 1,150W × 8hr × 30 days = 276 kWh × $0.12 ≈ $33/month. Hardware ($4,000-5,000) is a one-time purchase not included in monthly figures. API costs from OpenAI/Anthropic pricing pages, March 2026.. Last verified 2026-03-06.
*Self-hosted requires a one-time hardware purchase of $4,000-5,000 for the dual RTX 5090 setup. Payback period: at the “Active coding” tier ($150/month API vs $33/month electricity), monthly savings of ~$117 pay back the hardware in roughly 3 years. At the “Heavy” tier ($750+/month API vs $33/month electricity), payback drops to under 7 months.
Beyond pure cost, self-hosted consumer GPUs provide unlimited usage with no rate limits, full data privacy, zero censorship, and offline operation. For developers and teams where these properties matter, the hardware investment is justified regardless of the pure cost comparison.
Data Sovereignty and Compliance
For regulated industries, self-hosting is not just about cost — it is about legal requirements:
- HIPAA (Healthcare): Patient data cannot leave your infrastructure without a BAA. Most AI API providers do not offer BAAs.
- GDPR (EU): Requires data to be processed within the EU. Self-hosting on EU-located servers guarantees compliance.
- ITAR (Defense): Certain technical data cannot leave the United States. Self-hosted AI on air-gapped networks is the only option.
- Financial Regulations: Many banks and funds require all data processing to occur on audited infrastructure.
Best Practice
Hands-On Exercises and Summary
Exercise 1: VRAM Budget Calculator
Objective: Calculate whether specific models fit on different GPU configurations.
Hardware:
- Config A: Single RTX 4090 (24GB VRAM)
- Config B: Dual RTX 5090 (32GB × 2 = 64GB total, TP=2)
- Config C: Single H100 SXM (80GB VRAM)
Models:
- Model 1: Yi-Coder-9B (FP16) — 9B × 2 bytes = 18GB weights
- Model 2: Qwen2.5-Coder-32B-AWQ (4-bit) — 32B × 0.5 bytes = 16GB weights
- Model 3: Llama 3.3 70B-AWQ (4-bit) — 70B × 0.5 bytes = 35GB weights
Instructions:
- For each model, calculate total VRAM needed: Weights + KV cache at 8K context (~0.5GB for 9B, ~1.3GB for 32B, ~2.8GB for 70B) + Overhead (~1.5GB)
- For TP=2 configurations, divide weights and KV cache by 2 per GPU
- Fill in the compatibility table
Yi-Coder-9B FP16
- Total VRAM
- ?
- Config A (24GB)
- Fits / No
- Config B (64GB TP=2)
- Fits / No
- Config C (80GB)
- Fits / No
Qwen2.5-Coder-32B AWQ
- Total VRAM
- ?
- Config A (24GB)
- Fits / No
- Config B (64GB TP=2)
- Fits / No
- Config C (80GB)
- Fits / No
Llama 3.3 70B AWQ
- Total VRAM
- ?
- Config A (24GB)
- Fits / No
- Config B (64GB TP=2)
- Fits / No
- Config C (80GB)
- Fits / No
Exercise 2: Model Selection Decision
Objective: Choose the right model for three real-world scenarios.
Available Models: Qwen3-Coder-Next-AWQ (~48GB, agentic coding, 262K context), Qwen2.5-Coder-32B-AWQ (~18GB, code generation, 128K), DeepSeek-R1-Distill-32B-AWQ (~18GB, reasoning, 32K), Llama 3.3 70B-AWQ (~40GB, general purpose, 128K), Yi-Coder-9B (~5GB, lightweight, 128K).
Scenario A: You need to refactor a complex 2,000-line Python file with multiple classes and async functions.
Scenario B: You are analyzing a legal contract and need to identify potential risks and contradictions between clauses.
Scenario C: You are running a Discord bot that needs to answer quick questions from 20 concurrent users about a game wiki.
Exercise 3: Multi-Model Deployment Strategy
Objective: Design a complete model-switching deployment for a dual 32GB GPU setup.
Scenario: You manage a dual-GPU server (64GB total VRAM) that needs to support three workloads:
- Morning (8am-12pm): Software development — need a coding assistant
- Afternoon (1pm-5pm): Document analysis and strategy — need deep reasoning
- Evening (6pm-10pm): Casual conversation and voice assistant — need a fast chat model
Your Task:
- Select models for each time slot. Calculate VRAM requirements and verify all three fit.
- Design the switching sequence. Write the vLLM sleep mode API calls needed to switch between modes.
- Calculate switching overhead: cold start vs sleep mode L2 vs L1.
- Document edge cases: mid-conversation switches, urgent model access, 24GB VRAM constraint.
Deliverable: A one-page deployment plan with model selections, VRAM calculations, API sequences, and timing estimates.
Key Takeaways
After completing this lesson, you understand:
- LLMs are transformer-based neural networks that predict the next token, with intelligence emerging from scale and specialization
- Self-attention allows every token to attend to every other token, enabling understanding across long contexts
- The KV cache trades VRAM for speed, and grows linearly with context length
- AWQ quantization reduces model size by 75% with minimal quality loss, enabling large models on consumer hardware
- vLLM’s PagedAttention and continuous batching make production-grade serving possible on personal GPUs
- Model selection is about matching the right specialist to the right task, not choosing the biggest model
- GPU hardware ranges from $400 consumer cards to $3M datacenter racks, with VRAM being the critical bottleneck at every tier
- Emerging technologies (Flash Attention, speculative decoding, linear attention) will continue making AI more accessible
- Self-hosting becomes cost-effective at surprisingly low usage levels on consumer hardware
- Architect your applications with OpenAI-compatible APIs to freely switch between cloud and self-hosted
Resources and Further Reading
Official Documentation:
- vLLM: docs.vllm.ai
- Hugging Face Model Hub: huggingface.co/models
- NVIDIA CUDA Toolkit: developer.nvidia.com/cuda-toolkit
Foundational Papers:
- “Attention Is All You Need” (Vaswani et al., 2017) — The transformer paper
- “Language Models are Few-Shot Learners” (Brown et al., 2020) — GPT-3 / scaling laws
- “AWQ: Activation-aware Weight Quantization” (Lin et al., 2023) — Quantization method
- “Efficient Memory Management for LLM Serving with PagedAttention” (Kwon et al., 2023) — vLLM
Community Resources:
- r/LocalLLaMA (Reddit) — Self-hosted AI community
- Hugging Face Open LLM Leaderboard — Model benchmarks
- LMSys Chatbot Arena — Blind model comparisons
Next Lesson Preview
Continue to Lesson 05: Coding LLMs & Agentic AI to dive deep into coding benchmarks, the 22 model families landscape, agentic tool calling, vLLM parser matching, censorship analysis, and how to select and deploy the right coding model for any scenario.
Sources and References
Model Cards and Specifications
- [1] Meta Llama 3.3 70B Instruct — HuggingFace. 70B params, 128K context. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [2] DeepSeek V3 (0324) — HuggingFace. 685B MoE, 37B active. SWE-bench Verified: 73.1%, Aider Polyglot: 74.2%. https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [3] GLM-5 — HuggingFace. 744B MoE, 40B active. SWE-bench: 77.8%. Also: arXiv paper (2602.15763v1). https://huggingface.co/zai-org/GLM-5 (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [4] Qwen2.5-Coder-32B-Instruct — HuggingFace. HumanEval: 92.7%, MBPP: 90.2%. https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [5] DeepSeek-R1 — HuggingFace. R1-Distill-Qwen-32B outperforms o1-mini on AIME 2024 math benchmark. https://huggingface.co/deepseek-ai/DeepSeek-R1 (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [6] Qwen3-Coder-30B-A3B-Instruct — HuggingFace. 30.5B MoE, 3.3B active, 262K context. https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct (opens in new tab) (as of 2026-02-25, verified 2026-03-06)
- [7] DeepSeek V3.2-Exp model card — HuggingFace. V3.2 variant with enhanced reasoning. https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp (opens in new tab) (as of 2026-02-25, verified 2026-03-06)
- [8] Meta Llama 4 Scout 17B-16E Instruct — HuggingFace. 109B MoE, 17B active, 10M context. https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct (opens in new tab) (as of 2026-02-25, verified 2026-03-06)
- [9] NVIDIA Nemotron 3 Family announcement. Open models for agentic AI workloads. https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models (opens in new tab) (as of 2025-12-15, verified 2026-03-06)
- [10] NVIDIA Llama Nemotron model family documentation. https://docs.nvidia.com/nemo/megatron-bridge/latest/models/llm/llama-nemotron.html (opens in new tab) (as of 2026-02-25, verified 2026-03-06)
- [11] Gemma 3 27B IT model card — Google. 27B params, 128K context, 140+ languages. https://huggingface.co/google/gemma-3-27b-it (opens in new tab) (as of 2026-02-25, verified 2026-03-06)
- [12] OLMo 2 14B Instruct model card — AllenAI. Apache 2.0 license, weights + training code + eval code released. https://huggingface.co/allenai/OLMo-2-1124-13B-Instruct (opens in new tab) (as of 2026-02-25, verified 2026-03-06)
- [13] MiniMax M2.5 model card — Modified MIT license. SWE-bench Verified: 80.2%. https://huggingface.co/MiniMaxAI/MiniMax-M2.5 (opens in new tab) (as of 2026-02-25, verified 2026-03-06)
- [14] OpenAI GPT-4.1 model card. 1,048,576 token context window (1M tokens). https://platform.openai.com/docs/models/gpt-4.1 (opens in new tab) (as of 2026-02-28, verified 2026-03-06)
Benchmarks and Rankings
- [15] SWE-bench Verified Leaderboard. Open vs proprietary gap: MiniMax M2.5 at 80.2% vs Claude Opus 4.5 at 80.9%. Rankings are point-in-time. https://www.swebench.com/ (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [16] HumanEval benchmark — OpenAI. Individual model scores sourced from respective model cards. https://github.com/openai/human-eval (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
Pricing
- [17] OpenAI ChatGPT Plans. ChatGPT Plus: $20/month. ChatGPT Pro: $200/month. Pricing as of March 2026. https://openai.com/chatgpt/pricing/ (opens in new tab) (as of 2026-03-05, verified 2026-03-06)
- [18] OpenAI API Pricing. GPT-4o: $2.50/$10 per 1M tokens. Pricing as of March 2026. https://openai.com/api/pricing/ (opens in new tab) (as of 2026-03-05, verified 2026-03-06)
- [19] Anthropic Pricing. Claude Pro: $20/month. Claude Opus 4 API: $5/$25 per 1M tokens (200K standard, 1M beta context). Pricing as of March 2026. https://www.anthropic.com/pricing (opens in new tab) (as of 2026-03-05, verified 2026-03-06)
- [20] Google Gemini API Pricing. Gemini 2.5 Pro: $1.25/$10 per 1M tokens (≤200K context). Pricing as of March 2026. https://ai.google.dev/gemini-api/docs/pricing (opens in new tab) (as of 2026-03-05, verified 2026-03-06)
Hardware
- [21] NVIDIA A100 Tensor Core GPU. 80GB HBM2e, 400W TDP, 312 TFLOPS FP16. https://www.nvidia.com/en-us/data-center/a100/ (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [22] NVIDIA H100 Tensor Core GPU. 80GB HBM3, 700W TDP, 989 TFLOPS FP16 dense. https://www.nvidia.com/en-us/data-center/h100/ (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [23] NVIDIA H200 Tensor Core GPU. 141GB HBM3e, 700W TDP, 4.8 TB/s bandwidth. https://www.nvidia.com/en-us/data-center/h200/ (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [24] NVIDIA B200 / DGX B200. 192GB HBM3e, 1,000W TDP, 8 TB/s bandwidth. https://www.nvidia.com/en-us/data-center/dgx-b200/ (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [25] AMD Instinct MI300X. 192GB HBM3, 5,300 GB/s bandwidth. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
Software and Tools
- [26] vLLM Documentation. PagedAttention, continuous batching, sleep mode features. https://docs.vllm.ai/ (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [27] Marlin CUDA Kernel — IST-DASLab. 4-bit quantized inference acceleration. https://github.com/IST-DASLab/marlin (opens in new tab) (as of 2026-02-23, verified 2026-03-06)
- [28] llama.cpp — ggml-org. C/C++ LLM inference engine supporting CPU + GPU (NVIDIA, AMD, Apple Silicon). GGUF model format. https://github.com/ggml-org/llama.cpp (opens in new tab) (as of 2026-02-26, verified 2026-03-06)
- [29] Ollama — Local LLM runner built on llama.cpp. One-command model download and serving. https://ollama.com/ (opens in new tab) (as of 2026-02-26, verified 2026-03-06)
- [30] NVIDIA TensorRT-LLM — Optimized inference engine for NVIDIA GPUs. FP8/INT4 quantization, multi-GPU tensor parallelism. https://github.com/NVIDIA/TensorRT-LLM (opens in new tab) (as of 2026-02-26, verified 2026-03-06)
- [31] SGLang — Inference engine with structured generation and RadixAttention prefix caching. Active development. https://docs.sglang.ai/ (opens in new tab) (as of 2026-02-26, verified 2026-03-06)
- [32] Hugging Face Text Generation Inference (TGI). Production inference with HF model hub integration. https://huggingface.co/docs/text-generation-inference/main/en/index (opens in new tab) (as of 2026-02-26, verified 2026-03-06)
Industry Reports
- [33] Reuters: ChatGPT sets record for fastest-growing user base. 100 million monthly active users by January 2023, per UBS/Similarweb data. Published Feb 1, 2023. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ (opens in new tab) (as of 2023-02-01, verified 2026-03-06)
Methodology Notes
Derived calculations (electricity costs, break-even analyses) use the following assumptions: Electricity rate: $0.12/kWh (U.S. low-cost region; national average is ~$0.17/kWh). API cost blending: ~$5 per 1M tokens (GPT-4o blended at 2:1 input:output ratio). Self-hosted monthly costs: ~$9,000/mo total (hardware amortization ~$5,500/mo for 8xH100 DGX @ ~$200K over 3yr + colocation ~$3,500/mo). Individual API token spend tiers: Light ~$15/mo, Moderate ~$75/mo, Heavy ~$150/mo, Power ~$750/mo. API tokens are typically 20-40% of total AI program cost. All volatile data in this lesson was last verified on 2026-03-06. Benchmark scores, API pricing, and model rankings change frequently. Quarterly review recommended.