Coding LLMs & Agentic AI
Explore coding-specialized LLMs, benchmarks, agentic tool calling, parser matching, and how to select and deploy the right model for your coding workflow.
Learning Outcomes
Introduction — Why Coding LLMs Matter
In Lesson 04, we explored how LLMs work under the hood — from neural networks and transformers to quantization, inference engines, and GPU hardware. You now understand what these models are and how to run them.
This lesson narrows the focus to a single, high-impact domain: coding. Not all LLMs are equal at writing code. Some are general-purpose models that happen to know some Python. Others are purpose-built coding specialists, trained on trillions of code tokens, fine-tuned on instruction-following for development tasks, and optimized for the specific patterns that make code generation useful.
The difference matters. A coding-specialized 32B model can outperform a general-purpose 70B model on programming benchmarks — while using half the VRAM and running twice as fast. Choosing the right model for coding is not just a preference; it is a resource and quality decision.
What You Will Learn
Coding Benchmarks — Measuring Code Intelligence
How do you know if a model is good at coding? You cannot just ask it to write “Hello World” and declare victory. The industry uses standardized benchmarks that test increasingly realistic coding scenarios.
| Benchmark | What It Measures | Format |
|---|---|---|
| HumanEval | Function-level code generation from docstrings | 164 Python problems, pass@1 |
| SWE-bench Verified | Real-world GitHub issue resolution across full repos | Subset of 500 verified solvable issues |
| Aider Polyglot | Multi-language code editing via natural language instructions | 225 tasks across Python, JS, Java, C++, more |
| LiveCodeBench | Contamination-free coding problems from competitive programming | Continuously updated problem set |
Why Multiple Benchmarks Matter
A model that scores 92% on HumanEval (isolated function generation) might score only 40% on SWE-bench (full-repo issue resolution). These benchmarks test fundamentally different skills: writing a function from a docstring is very different from navigating a 50-file codebase, understanding the bug, and producing a correct multi-file patch.
Benchmark Volatility
Score Landscape — How Top Models Compare
The table below shows approximate performance tiers for major open-weight coding models as of early 2026. Scores shift with every release — treat this as a landscape orientation, not a leaderboard.
| Model | Parameters | Benchmark Tier | Best For |
|---|---|---|---|
| Qwen2.5-Coder-32B | 32B dense | Elite — top scores across HumanEval, Aider, and multi-language tasks | All-round self-hosted coding assistant |
| DeepSeek-Coder-V2 | 236B MoE (~21B active) | Elite — competitive with closed-source models on code generation | High-quality generation when VRAM allows |
| Codestral 22B | 22B dense | Strong — excellent fill-in-the-middle and completion | Fast code completion, IDE autocomplete |
| StarCoder2-15B | 15B dense | Solid — broad language coverage, strong for its size | Multi-language projects, resource-constrained setups |
| Code Llama 70B | 70B dense | Established — strong baseline, surpassed by newer specialists | Mature ecosystem, extensive tooling support |
Approximate performance tiers based on published model cards and benchmark repositories as of early 2026. Scores change with every model release. Always verify against primary sources before deployment decisions. HumanEval, Aider Leaderboards. Last verified 2026-02-27.
The Complexity Ladder
Coding benchmarks form a complexity ladder. At the base sits HumanEval: generating an isolated function from a docstring. In the middle, Aider Polyglot tests multi-language code editing via natural language instructions. At the top, SWE-bench demands full-repo navigation, bug identification, and multi-file patching to resolve real GitHub issues.
A model’s position on this ladder determines what tasks you can trust it with. A model that excels at HumanEval but struggles on SWE-bench is suitable for code completion but not for autonomous issue resolution. Understanding where each benchmark sits tells you what the score actually measures.
Coding Model Families — The Landscape
Not every LLM is trained for code. The major coding model families are purpose-built: trained on curated code datasets, fine-tuned on coding instructions, and benchmarked against programming-specific evaluations. Understanding the landscape helps you pick the right tool.
What Makes a “Coding Model”?
A coding model differs from a general-purpose model in three ways: training data (heavy code corpus), fine-tuning targets (code completion, generation, editing instructions), and evaluation focus (HumanEval, SWE-bench, not just MMLU).
| Family | Organization | Param Range | Specialty | License |
|---|---|---|---|---|
| Qwen2.5-Coder | Alibaba / Qwen | 0.5B – 32B | Full-stack coding, instruction following, multi-language | Apache 2.0 |
| DeepSeek-Coder | DeepSeek | 1.3B – 236B (MoE) | Code generation, project-level reasoning | Model License (research + commercial) |
| Code Llama | Meta | 7B – 70B | Code completion, infilling, Python specialization | Llama Community License |
| StarCoder2 | BigCode | 3B – 15B | 600+ languages, fill-in-the-middle, broad coverage | BigCode OpenRAIL-M |
| Codestral | Mistral AI | 22B | Fast completion, FIM, low-latency IDE use | Mistral AI Non-Production License |
| Yi-Coder | 01.AI | 1.5B – 9B | Lightweight coding, long context (128K) | Apache 2.0 |
Major open-weight coding model families as of early 2026. New families and versions are released frequently — this table captures the established landscape, not an exhaustive list. HuggingFace Models. Last verified 2026-02-27.
Coverage Matrix — What Each Family Does Best
Each coding model family has strengths and gaps. The matrix below maps families to common coding tasks, showing where each excels.
| Family | Completion | Generation | FIM | Multi-lang | Agentic |
|---|---|---|---|---|---|
| Qwen2.5-Coder | ✓ | ✓ | ✓ | ✓ | ✓ |
| DeepSeek-Coder | ✓ | ✓ | — | ✓ | ✓ |
| Code Llama | ✓ | ✓ | ✓ | — | — |
| StarCoder2 | ✓ | — | ✓ | ✓ | — |
| Codestral | ✓ | ✓ | ✓ | ✓ | — |
| Yi-Coder | ✓ | — | — | ✓ | — |
\u2713 = strong support based on published benchmarks and model documentation. Coverage evolves with each release.
Generalists vs Specialists
The Open-Weight Coding Landscape
The open-weight ecosystem for coding models has matured rapidly. Families like Qwen-Coder, DeepSeek-Coder, and StarCoder2 offer models across a range of parameter counts, making self-hosted coding assistance accessible from consumer GPUs to datacenter deployments.
Qwen2.5-Coder
Sizes: 0.5B / 1.5B / 3B / 7B / 14B / 32B
Best all-round open-weight coding model as of early 2026. Top benchmark tier across HumanEval, Aider, and multi-language tasks.
Apache 2.0 license. The 32B variant matches cloud-tier models on coding benchmarks.
DeepSeek-Coder
Sizes: 1.3B / 6.7B / 33B / V2 236B MoE
Strong code generation and project-level reasoning. V2 uses MoE architecture for efficient inference.
V2 activates ~21B of 236B total params. Needs significant VRAM for weights but runs at ~21B speed.
Code Llama
Sizes: 7B / 13B / 34B / 70B
Mature ecosystem with extensive community tooling. Python-specialized variant available.
Based on Llama 2. Surpassed by newer specialists but remains widely supported.
StarCoder2
Sizes: 3B / 7B / 15B
Broadest language coverage (600+ languages). Strong fill-in-the-middle support.
BigCode OpenRAIL-M license. Trained on The Stack v2, the largest open code dataset.
Codestral
Sizes: 22B
Fast completion with strong FIM support. Designed for low-latency IDE integration.
Mistral AI Non-Production License. 32K context window. 80+ programming languages.
Yi-Coder
Sizes: 1.5B / 9B
Lightweight coding specialist with 128K context window. Excellent for constrained hardware.
Apache 2.0 license. 9B variant punches above its weight on code generation tasks.
Agentic Architecture — Tool Calling and Agent Loops
The most powerful use of coding LLMs is not generating code in isolation — it is running them as agents that can read files, write code, run tests, and iterate on failures. This is the difference between a code completion tool and an AI pair programmer.
The Agent Loop
An agentic coding workflow follows a loop: the model receives a task, decides which tools to call (read file, write file, run command), executes the tool calls, observes the results, and decides what to do next. This continues until the task is complete or the model determines it cannot proceed.
1. Receive Task
User describes goal
2. Plan
Model decides action
3. Tool Call
Read, write, or run
4. Observe
Read tool result
5. Decide
Done or iterate?
Function Calling vs Tool Use
“Function calling” and “tool use” describe the same capability: the model outputs structured JSON indicating which function to call with what arguments, instead of generating plain text. The runtime (your agent framework) executes the function and feeds the result back to the model.
Real-World Example
Simple Generation vs Agentic Workflow
The difference between a chatbot that generates code and an agent that writes software is the loop. Simple generation is one-shot: prompt in, code out. Agentic workflows iterate autonomously until the task is done.
| Aspect | Simple Generation | Agentic Workflow |
|---|---|---|
| Input | Single prompt with full context | High-level task description |
| Context gathering | User provides all context manually | Agent reads files, searches codebase, gathers context |
| Output | One-shot code block | Multi-step: edits, tests, iterations |
| Error handling | User fixes errors and re-prompts | Agent runs tests, reads errors, self-corrects |
| Tool use | None — text in, text out | File read/write, shell commands, search |
| Iterations | 1 (single turn) | 5–50+ (autonomous loop) |
Tool Call Parsers — Format Specs and Engine Matching
For a model to call tools, it needs to output structured data in a specific format. Different model families use different formats. The inference engine must know how to parse the model’s output into actual function calls. This is where tool call parsers come in.
Why Format Matching Matters
If you serve a model that outputs Hermes-style tool calls through an engine configured for ChatML format, the tool calls will fail silently or produce garbage. The parser must match the model’s training format.
Format Comparison
Each model family uses a different syntax for tool calls. The inference engine must parse the model’s raw output and extract structured function call data. If the parser expects Hermes format but the model outputs Mistral format, the call fails silently.
| Format | Syntax Pattern | Used By |
|---|---|---|
| Hermes-2 | <tool_call>{...JSON...}</tool_call> | NousResearch Hermes models, many fine-tunes |
| Llama 3.1+ native | <|python_tag|> or JSON function blocks | Llama 3.1, 3.2, 3.3 Instruct models |
| Mistral tool_use | [TOOL_CALLS] {...JSON...} | Mistral, Codestral, Mixtral Instruct |
| Qwen function_call | ✿FUNCTION✿ or JSON block in assistant turn | Qwen2.5 Instruct, Qwen2.5-Coder |
| DeepSeek JSON | JSON function call in assistant response | DeepSeek-V2, DeepSeek-V3, DeepSeek-Coder |
Tool calling format specifications as of early 2026. Formats evolve with model releases — always check the model's chat template for the authoritative format. vLLM Tool Calling. Last verified 2026-02-27.
vLLM Parser Configuration
vLLM supports multiple tool call parsers via the --tool-call-parser flag. Choosing the correct parser for your model is essential for reliable agentic workflows.
# Hermes-2 fine-tune (most community models)
vllm serve my-model --tool-call-parser hermes
# Llama 3.1+ Instruct (native Meta format)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tool-call-parser llama3_json
# Mistral / Codestral
vllm serve mistralai/Codestral-22B-v0.1 \
--tool-call-parser mistral
# DeepSeek-V3 (dedicated parser)
vllm serve deepseek-ai/DeepSeek-V3 \
--tool-call-parser deepseek_v3
# Enable tool calling on the OpenAI-compatible endpoint
# Then call with tools=[...] in your API requestCommon Pitfall
Parser Matching Decision Table
Use this table to look up the correct --tool-call-parser flag for your model. When in doubt, check the model’s tokenizer_config.json for its chat template.
| Model Family | vLLM Parser Flag | Notes |
|---|---|---|
| Hermes-2 fine-tunes | hermes | Most common for community fine-tunes |
| Llama 3.1+ Instruct | llama3_json | Native Llama tool calling format |
| Mistral / Codestral | mistral | Mistral-specific tool_use format |
| Qwen2.5 Instruct | internlm | Qwen3-Coder uses qwen3_xml — check chat template |
| DeepSeek-V3 | deepseek_v3 | Dedicated parser since vLLM 0.8+ |
Parser flag mappings based on vLLM documentation and model chat templates as of early 2026. Verify against the current vLLM release and your model's specific version. vLLM Tool Calling Docs. Last verified 2026-02-27.
Censorship and Alignment in Coding Models
When you ask a coding model to write a penetration testing script, bypass a rate limiter, or implement a web scraper, some models will refuse. Others will comply without hesitation. This difference comes from alignment training — the safety fine-tuning applied after the model learns to code.
What Alignment Does to Coding Models
Alignment (RLHF, DPO, or constitutional AI methods) teaches models to refuse harmful requests. For general chat, this is straightforward. For coding, the boundary is murkier: security research, DevOps automation, and system administration often require code that looks like it could be misused.
Refusal Patterns in Practice
The impact of alignment on coding varies by task. Standard application development is unaffected. Security research, DevOps tooling, and system-level code trigger refusals more frequently in heavily aligned models.
| Task Type | Aligned Model Response | Uncensored Model Response |
|---|---|---|
| Write a port scanner | May refuse or add extensive warnings about legality | Generates the code directly |
| Bypass rate limiter | Often refuses, citing potential misuse | Provides implementation options |
| Web scraper for site X | May warn about Terms of Service, sometimes refuses | Generates the scraper code |
| Reverse-engineer binary | Often refuses or adds legal disclaimers | Provides analysis approach |
| Standard CRUD API | Generates normally | Generates normally |
Note that refusal behavior varies between model versions and even between quantization levels of the same model. The patterns above are generalizations — always test with your specific model and use case.
Uncensored Variants
Some open-weight models are released in “uncensored” or “abliterated” variants where safety fine-tuning has been partially removed. These models are more compliant for legitimate coding tasks but also remove guardrails entirely.
Responsibility Note
Heavily Aligned
Standard
Base Model
Abliterated
Most self-hosted coding use cases are best served by standard or lightly aligned models. Moving further right increases compliance but removes all safety boundaries.
Model Selection — Choosing the Right Coding Model
With dozens of coding-capable models available, how do you choose? The answer depends on your task type, hardware, latency requirements, and quality threshold.
Task-Based Selection Framework
Different coding tasks have different model requirements. Code completion (fill-in-the-middle) needs speed above all. Code generation needs quality and instruction following. Code review needs reasoning depth. Agentic workflows need reliable tool calling.
Code Completion
Speed > QualityFIM-capable, low latency. Smaller models (7B–15B) at higher quantization for fast autocomplete.
Code Generation
Quality > SpeedInstruction-tuned, strong HumanEval scores. Larger models (32B+) for complex function generation.
Code Review
Reasoning depthLarge context window, strong reasoning. 32B+ models that can hold entire files and explain issues.
Agentic Workflows
Tool calling reliabilityReliable function calling, iterative reasoning. Must support tool-call parser in your inference engine.
Quick Decision Flow
1. What is your primary task?
Completion → prioritize speed. Generation/Review → prioritize quality. Agentic → prioritize tool calling.
2. How much VRAM do you have?
8–12 GB → 7B models. 16–24 GB → up to 32B (Q4). 48+ GB → 70B models.
3. Do you need tool calling?
Yes → check parser support in vLLM for your model family (Section 5).
4. Latency or quality?
Latency-sensitive → smaller model + higher quantization. Quality-sensitive → largest model that fits.
Hardware-Constrained Selection
Your GPU determines your ceiling. A consumer RTX 4090 (24 GB VRAM) runs different models than a dual-A6000 workstation (96 GB total). The model selection framework must account for what actually fits.
| Hardware Tier | VRAM | Max Dense Model (Q4) | Recommended Coding Models |
|---|---|---|---|
| Consumer GPU | 8–12 GB | ~7B–13B | Qwen2.5-Coder-7B, Yi-Coder-9B, StarCoder2-7B |
| Enthusiast GPU | 16–24 GB | ~14B–32B | Qwen2.5-Coder-32B (Q4), Codestral 22B, StarCoder2-15B |
| Workstation GPU | 48 GB | ~70B (Q4) | Code Llama 70B (Q4), Qwen2.5-Coder-32B (FP16) |
| Dual GPU | 48–96 GB | ~70B (FP16) | DeepSeek-Coder-V2 (quantized), Code Llama 70B (FP16) |
| Datacenter | 80+ GB per GPU | ~140B+ or large MoE | DeepSeek-Coder-V2 (full), enterprise-scale models |
Model size estimates assume Q4 quantization unless noted. Actual VRAM usage depends on context length, KV cache, and engine overhead. Use the VRAM calculator from Lesson 04 for precise estimates.. Last verified 2026-02-27.
Framework, Not Rankings
Deployment for Coders — IDE Integration and Workflows
A coding model is only useful if it integrates into your actual development workflow. This section covers the tools that connect self-hosted models to your editor, terminal, and CI pipeline.
IDE Integration Options
The three dominant approaches to local AI coding assistance are: editor extensions (Continue.dev for VS Code/JetBrains), terminal agents (Aider for git-aware pair programming), and AI-native editors (Cursor, built from the ground up around AI interaction).
| Feature | Continue.dev | Aider | Cursor |
|---|---|---|---|
| Type | IDE extension | Terminal agent | AI-native editor |
| Editor Support | VS Code, JetBrains | Any (terminal-based) | Cursor only (VS Code fork) |
| Local Model Support | Yes (OpenAI-compatible API) | Yes (OpenAI-compatible API) | Limited (primarily cloud) |
| Agentic Mode | Basic (edit + apply) | Full (git-aware, multi-file) | Full (Composer agent) |
| Git Integration | No | Yes (auto-commit) | Basic |
| Open Source | Yes (Apache 2.0) | Yes (Apache 2.0) | No (proprietary) |
| Best For | IDE-integrated completions with local models | Terminal-first pair programming | All-in-one AI coding (cloud-first) |
Feature comparison based on official documentation as of early 2026. All three tools are under active development — verify current capabilities before choosing. Continue.dev, Aider, Cursor. Last verified 2026-02-27.
GPU
GPU + Model
Qwen2.5-Coder-32B loaded in VRAM
vLLM
Inference Engine
vLLM serves OpenAI-compatible API
API
API Endpoint
localhost:8000/v1/chat/completions
IDE
IDE Integration
Continue.dev / Aider connects to API
All data stays on your machine. No external API calls, no per-token costs.
Self-Hosted vs Cloud-Backed
Every IDE integration tool can connect to either cloud APIs (OpenAI, Anthropic) or local inference endpoints (vLLM, Ollama). Running your own model gives you privacy, zero per-token cost, and the ability to use uncensored or fine-tuned models.
The Self-Hosted Advantage
Quick Setup: Continue.dev + vLLM
The most common self-hosted coding setup pairs Continue.dev (IDE extension) with vLLM (inference engine). Here is the minimal configuration.
# Serve Qwen2.5-Coder-32B with tool calling enabled
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
--tool-call-parser internlm \
--max-model-len 32768 \
--gpu-memory-utilization 0.90{
"models": [
{
"title": "Qwen2.5-Coder-32B (Local)",
"provider": "openai",
"model": "Qwen/Qwen2.5-Coder-32B-Instruct",
"apiBase": "http://localhost:8000/v1",
"apiKey": "not-needed"
}
]
}After both are running, open VS Code, trigger Continue.dev (Ctrl+L or Cmd+L), and start coding with your local model. Completions and chat are served from your GPU with zero cloud dependency.
Hands-On Exercises and Summary
Exercise 1: Benchmark Comparison
Pick two coding models from different families (for example, Qwen2.5-Coder-32B and DeepSeek-Coder-V2). Compare their HumanEval, SWE-bench, and Aider Polyglot scores. Which model wins at each benchmark? What does this tell you about their strengths?
Exercise 2: Tool Call Format Matching
For a model you want to serve with vLLM, identify the correct --tool-call-parser flag. Check the model’s documentation or chat template to determine which format it was trained on.
Exercise 3: Model Selection Decision
You have an RTX 4090 (24 GB VRAM) and need a coding assistant for daily Python development. Using the model selection framework from Section 7, determine which model and quantization level gives you the best quality within your VRAM budget.
Answers Not Provided
Lesson Summary
Coding LLMs are not just general models that happen to write code — the best ones are purpose-built specialists trained on curated code corpora and evaluated against programming-specific benchmarks. The agentic revolution adds tool calling, multi-step reasoning, and IDE integration to turn these models into genuine development partners.
- Benchmarks (HumanEval, SWE-bench, Aider) measure different coding capabilities
- Coding model families offer specialization advantages at every parameter count
- Agentic workflows use tool calling to read, write, test, and iterate on code
- Tool call parser matching is critical for reliable function calling
- Alignment affects what coding tasks a model will and will not perform
- Model selection is task-dependent, hardware-constrained, and framework-driven
- Self-hosted coding assistants offer privacy, cost, and customization advantages
Further Reading
- Aider LLM Leaderboards — Live benchmark tracking
- SWE-bench — Real-world coding evaluation
- vLLM Documentation — Tool calling and parser configuration
- Continue.dev Documentation — IDE integration setup
Sources and References
Model Cards and Specifications
- [1] Qwen2.5-Coder-32B-Instruct — Alibaba Qwen. 32B params, 128K context, open-source SOTA coding model. https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [2] DeepSeek-Coder-V2 — coding-specialized MoE variant. https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [3] Code Llama 70B — Meta. Fine-tuned from Llama 2 for code generation. https://huggingface.co/meta-llama/CodeLlama-70b-Instruct-hf (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [4] StarCoder2-15B — BigCode. Trained on The Stack v2, 600+ languages. https://huggingface.co/bigcode/starcoder2-15b (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [5] Codestral 22B — Mistral AI. 22B params, 32K context, coding-specialized with FIM support. https://mistral.ai/news/codestral/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [6] Yi-Coder — 01.AI. 1.5B–9B params, 128K context, lightweight coding specialist. https://huggingface.co/01-ai/Yi-Coder-9B-Chat (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
Benchmarks and Rankings
- [7] HumanEval — OpenAI. 164 hand-written Python programming problems. https://github.com/openai/human-eval (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [8] SWE-bench Verified — Princeton. Real-world GitHub issue resolution. https://www.swebench.com/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [9] Aider Polyglot Benchmark — Multi-language code editing evaluation. https://aider.chat/docs/leaderboards/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [10] LiveCodeBench — Contamination-free coding benchmark from competitive programming. https://livecodebench.github.io/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [11] SWE-agent — Princeton. Agent framework for autonomous software engineering on SWE-bench. https://github.com/princeton-nlp/SWE-agent (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
Software and Tools
- [12] Continue.dev — Open-source AI code assistant for VS Code and JetBrains. https://continue.dev/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [13] Aider — AI pair programming in your terminal. https://aider.chat/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [14] Cursor — AI-first code editor built on VS Code. https://cursor.com/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [15] vLLM Tool Calling Documentation — parser flags, format specs, and supported models. https://docs.vllm.ai/en/latest/features/tool_calling.html (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [16] NousResearch Hermes-2 Function Calling Format — community standard for open-weight tool use. https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [17] Llama 3.1 Tool Use — Meta native function calling format for Llama 3.1+ models. https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
- [18] Mistral AI Function Calling — native tool_use format for Mistral and Codestral models. https://docs.mistral.ai/capabilities/function_calling/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
Methodology Notes
Benchmark scores in this lesson are point-in-time measurements that change with every model release. HumanEval, SWE-bench, and Aider Polyglot scores were last verified on 2026-02-27. Model selection advice is framework-based rather than hardcoded rankings — the "best" model depends on your specific coding task, hardware, and latency requirements. Tool calling format specifications reflect the state of vLLM, llama.cpp, and related inference engines as of early 2026. Quarterly review recommended.