Lesson05

intermediate

Coding LLMs & Agentic AI

Explore coding-specialized LLMs, benchmarks, agentic tool calling, parser matching, and how to select and deploy the right model for your coding workflow.

35-50 min

Updated 2026-02-27

6 Topics

Coding LLMsBenchmarksAgentic AITool CallingModel SelectionIDE Integration

Learning Outcomes

Explain how coding-specialized LLMs differ from general-purpose models in architecture, training data, and benchmark performance

Compare major coding model families (Qwen-Coder, DeepSeek-Coder, CodeLlama, StarCoder) across benchmarks like HumanEval, SWE-bench, and Aider Polyglot

Analyze how agentic tool calling works — from function call formats and parser matching to multi-step agent loops that write, test, and debug code

Evaluate the tradeoffs between censored and uncensored model variants for coding tasks, including alignment behavior and refusal patterns

Design a model selection framework that matches coding scenarios (completion, generation, review, agentic) to the right model size and family

Architect a local coding assistant setup using IDE integrations (Continue.dev, Aider, Cursor) backed by self-hosted inference engines

Explainclick to reveal
Explain how coding-specialized LLMs differ from general-purpose models in architecture, training data, and benchmark performance
Compareclick to reveal
Compare major coding model families (Qwen-Coder, DeepSeek-Coder, CodeLlama, StarCoder) across benchmarks like HumanEval, SWE-bench, and Aider Polyglot
Analyzeclick to reveal
Analyze how agentic tool calling works — from function call formats and parser matching to multi-step agent loops that write, test, and debug code
Evaluateclick to reveal
Evaluate the tradeoffs between censored and uncensored model variants for coding tasks, including alignment behavior and refusal patterns
Designclick to reveal
Design a model selection framework that matches coding scenarios (completion, generation, review, agentic) to the right model size and family
Architectclick to reveal
Architect a local coding assistant setup using IDE integrations (Continue.dev, Aider, Cursor) backed by self-hosted inference engines

Introduction — Why Coding LLMs Matter

In Lesson 04, we explored how LLMs work under the hood — from neural networks and transformers to quantization, inference engines, and GPU hardware. You now understand what these models are and how to run them.

This lesson narrows the focus to a single, high-impact domain: coding. Not all LLMs are equal at writing code. Some are general-purpose models that happen to know some Python. Others are purpose-built coding specialists, trained on trillions of code tokens, fine-tuned on instruction-following for development tasks, and optimized for the specific patterns that make code generation useful.

The difference matters. A coding-specialized 32B model can outperform a general-purpose 70B model on programming benchmarks — while using half the VRAM and running twice as fast. Choosing the right model for coding is not just a preference; it is a resource and quality decision.

What You Will Learn

This lesson covers the full landscape of coding LLMs: how they are benchmarked, the major model families specialized for code, how agentic tool calling works, what tool-call parsers do inside inference engines, alignment and censorship considerations, and how to select and deploy the right model for your coding workflow.

Coding Benchmarks — Measuring Code Intelligence

How do you know if a model is good at coding? You cannot just ask it to write “Hello World” and declare victory. The industry uses standardized benchmarks that test increasingly realistic coding scenarios.

Benchmark	What It Measures	Format
HumanEval	Function-level code generation from docstrings	164 Python problems, pass@1
SWE-bench Verified	Real-world GitHub issue resolution across full repos	Subset of 500 verified solvable issues
Aider Polyglot	Multi-language code editing via natural language instructions	225 tasks across Python, JS, Java, C++, more
LiveCodeBench	Contamination-free coding problems from competitive programming	Continuously updated problem set

Why Multiple Benchmarks Matter

A model that scores 92% on HumanEval (isolated function generation) might score only 40% on SWE-bench (full-repo issue resolution). These benchmarks test fundamentally different skills: writing a function from a docstring is very different from navigating a 50-file codebase, understanding the bug, and producing a correct multi-file patch.

Benchmark Volatility

Scores change with every model release. The numbers in this lesson are point-in-time snapshots. Always check the primary source before making deployment decisions based on benchmark claims.

Score Landscape — How Top Models Compare

The table below shows approximate performance tiers for major open-weight coding models as of early 2026. Scores shift with every release — treat this as a landscape orientation, not a leaderboard.

Model	Parameters	Benchmark Tier	Best For
Qwen2.5-Coder-32B	32B dense	Elite — top scores across HumanEval, Aider, and multi-language tasks	All-round self-hosted coding assistant
DeepSeek-Coder-V2	236B MoE (~21B active)	Elite — competitive with closed-source models on code generation	High-quality generation when VRAM allows
Codestral 22B	22B dense	Strong — excellent fill-in-the-middle and completion	Fast code completion, IDE autocomplete
StarCoder2-15B	15B dense	Solid — broad language coverage, strong for its size	Multi-language projects, resource-constrained setups
Code Llama 70B	70B dense	Established — strong baseline, surpassed by newer specialists	Mature ecosystem, extensive tooling support

Approximate performance tiers based on published model cards and benchmark repositories as of early 2026. Scores change with every model release. Always verify against primary sources before deployment decisions. HumanEval, Aider Leaderboards. Last verified 2026-02-27.

The Complexity Ladder

Coding benchmarks form a complexity ladder. At the base sits HumanEval: generating an isolated function from a docstring. In the middle, Aider Polyglot tests multi-language code editing via natural language instructions. At the top, SWE-bench demands full-repo navigation, bug identification, and multi-file patching to resolve real GitHub issues.

A model’s position on this ladder determines what tasks you can trust it with. A model that excels at HumanEval but struggles on SWE-bench is suitable for code completion but not for autonomous issue resolution. Understanding where each benchmark sits tells you what the score actually measures.

Coding Model Families — The Landscape

Not every LLM is trained for code. The major coding model families are purpose-built: trained on curated code datasets, fine-tuned on coding instructions, and benchmarked against programming-specific evaluations. Understanding the landscape helps you pick the right tool.

What Makes a “Coding Model”?

A coding model differs from a general-purpose model in three ways: training data (heavy code corpus), fine-tuning targets (code completion, generation, editing instructions), and evaluation focus (HumanEval, SWE-bench, not just MMLU).

Family	Organization	Param Range	Specialty	License
Qwen2.5-Coder	Alibaba / Qwen	0.5B – 32B	Full-stack coding, instruction following, multi-language	Apache 2.0
DeepSeek-Coder	DeepSeek	1.3B – 236B (MoE)	Code generation, project-level reasoning	Model License (research + commercial)
Code Llama	Meta	7B – 70B	Code completion, infilling, Python specialization	Llama Community License
StarCoder2	BigCode	3B – 15B	600+ languages, fill-in-the-middle, broad coverage	BigCode OpenRAIL-M
Codestral	Mistral AI	22B	Fast completion, FIM, low-latency IDE use	Mistral AI Non-Production License
Yi-Coder	01.AI	1.5B – 9B	Lightweight coding, long context (128K)	Apache 2.0

Major open-weight coding model families as of early 2026. New families and versions are released frequently — this table captures the established landscape, not an exhaustive list. HuggingFace Models. Last verified 2026-02-27.

Coverage Matrix — What Each Family Does Best

Each coding model family has strengths and gaps. The matrix below maps families to common coding tasks, showing where each excels.

Family × Capability Coverage

Family	Completion	Generation	FIM	Multi-lang	Agentic
Qwen2.5-Coder	✓	✓	✓	✓	✓
DeepSeek-Coder	✓	✓	—	✓	✓
Code Llama	✓	✓	✓	—	—
StarCoder2	✓	—	✓	✓	—
Codestral	✓	✓	✓	✓	—
Yi-Coder	✓	—	—	✓	—

\u2713 = strong support based on published benchmarks and model documentation. Coverage evolves with each release.

Generalists vs Specialists

Some general-purpose models (GPT-4o, Claude, DeepSeek V3) are excellent at coding despite not being “coding models.” The distinction matters most at smaller parameter counts where specialization provides the biggest quality lift per VRAM dollar.

The Open-Weight Coding Landscape

The open-weight ecosystem for coding models has matured rapidly. Families like Qwen-Coder, DeepSeek-Coder, and StarCoder2 offer models across a range of parameter counts, making self-hosted coding assistance accessible from consumer GPUs to datacenter deployments.

Family Quick Reference

Qwen2.5-Coder

Sizes: 0.5B / 1.5B / 3B / 7B / 14B / 32B

Best all-round open-weight coding model as of early 2026. Top benchmark tier across HumanEval, Aider, and multi-language tasks.

Apache 2.0 license. The 32B variant matches cloud-tier models on coding benchmarks.

DeepSeek-Coder

Sizes: 1.3B / 6.7B / 33B / V2 236B MoE

Strong code generation and project-level reasoning. V2 uses MoE architecture for efficient inference.

V2 activates ~21B of 236B total params. Needs significant VRAM for weights but runs at ~21B speed.

Code Llama

Sizes: 7B / 13B / 34B / 70B

Mature ecosystem with extensive community tooling. Python-specialized variant available.

Based on Llama 2. Surpassed by newer specialists but remains widely supported.

StarCoder2

Sizes: 3B / 7B / 15B

Broadest language coverage (600+ languages). Strong fill-in-the-middle support.

BigCode OpenRAIL-M license. Trained on The Stack v2, the largest open code dataset.

Codestral

Sizes: 22B

Fast completion with strong FIM support. Designed for low-latency IDE integration.

Mistral AI Non-Production License. 32K context window. 80+ programming languages.

Yi-Coder

Sizes: 1.5B / 9B

Lightweight coding specialist with 128K context window. Excellent for constrained hardware.

Apache 2.0 license. 9B variant punches above its weight on code generation tasks.

Agentic Architecture — Tool Calling and Agent Loops

The most powerful use of coding LLMs is not generating code in isolation — it is running them as agents that can read files, write code, run tests, and iterate on failures. This is the difference between a code completion tool and an AI pair programmer.

The Agent Loop

An agentic coding workflow follows a loop: the model receives a task, decides which tools to call (read file, write file, run command), executes the tool calls, observes the results, and decides what to do next. This continues until the task is complete or the model determines it cannot proceed.

The Agent Loop

1. Receive Task

User describes goal

2. Plan

Model decides action

3. Tool Call

Read, write, or run

4. Observe

Read tool result

5. Decide

Done or iterate?

←Steps 2–5 repeat until task is complete or model stops→

Function Calling vs Tool Use

“Function calling” and “tool use” describe the same capability: the model outputs structured JSON indicating which function to call with what arguments, instead of generating plain text. The runtime (your agent framework) executes the function and feeds the result back to the model.

Real-World Example

When you ask an AI coding agent to “fix the login bug,” it does not just generate code. It reads the relevant files, identifies the issue, writes a fix, runs the test suite, reads the test output, and iterates if tests fail. Each step is a tool call.

Simple Generation vs Agentic Workflow

The difference between a chatbot that generates code and an agent that writes software is the loop. Simple generation is one-shot: prompt in, code out. Agentic workflows iterate autonomously until the task is done.

Aspect	Simple Generation	Agentic Workflow
Input	Single prompt with full context	High-level task description
Context gathering	User provides all context manually	Agent reads files, searches codebase, gathers context
Output	One-shot code block	Multi-step: edits, tests, iterations
Error handling	User fixes errors and re-prompts	Agent runs tests, reads errors, self-corrects
Tool use	None — text in, text out	File read/write, shell commands, search
Iterations	1 (single turn)	5–50+ (autonomous loop)

Tool Call Parsers — Format Specs and Engine Matching

For a model to call tools, it needs to output structured data in a specific format. Different model families use different formats. The inference engine must know how to parse the model’s output into actual function calls. This is where tool call parsers come in.

Why Format Matching Matters

If you serve a model that outputs Hermes-style tool calls through an engine configured for ChatML format, the tool calls will fail silently or produce garbage. The parser must match the model’s training format.

Format Comparison

Each model family uses a different syntax for tool calls. The inference engine must parse the model’s raw output and extract structured function call data. If the parser expects Hermes format but the model outputs Mistral format, the call fails silently.

Format	Syntax Pattern	Used By
Hermes-2	`<tool_call>{...JSON...}</tool_call>`	NousResearch Hermes models, many fine-tunes
Llama 3.1+ native	`<\|python_tag\|> or JSON function blocks`	Llama 3.1, 3.2, 3.3 Instruct models
Mistral tool_use	`[TOOL_CALLS] {...JSON...}`	Mistral, Codestral, Mixtral Instruct
Qwen function_call	`✿FUNCTION✿ or JSON block in assistant turn`	Qwen2.5 Instruct, Qwen2.5-Coder
DeepSeek JSON	`JSON function call in assistant response`	DeepSeek-V2, DeepSeek-V3, DeepSeek-Coder

Tool calling format specifications as of early 2026. Formats evolve with model releases — always check the model's chat template for the authoritative format. vLLM Tool Calling. Last verified 2026-02-27.

vLLM Parser Configuration

vLLM supports multiple tool call parsers via the --tool-call-parser flag. Choosing the correct parser for your model is essential for reliable agentic workflows.

vLLM Tool Call Parser Examplesbash

# Hermes-2 fine-tune (most community models)
vllm serve my-model --tool-call-parser hermes

# Llama 3.1+ Instruct (native Meta format)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tool-call-parser llama3_json

# Mistral / Codestral
vllm serve mistralai/Codestral-22B-v0.1 \
  --tool-call-parser mistral

# DeepSeek-V3 (dedicated parser)
vllm serve deepseek-ai/DeepSeek-V3 \
  --tool-call-parser deepseek_v3

# Enable tool calling on the OpenAI-compatible endpoint
# Then call with tools=[...] in your API request

Common Pitfall

Many “tool calling does not work” issues are not model problems — they are parser mismatch problems. Always verify that your inference engine’s parser setting matches your model’s trained format.

Parser Matching Decision Table

Use this table to look up the correct --tool-call-parser flag for your model. When in doubt, check the model’s tokenizer_config.json for its chat template.

Model Family	vLLM Parser Flag	Notes
Hermes-2 fine-tunes	`hermes`	Most common for community fine-tunes
Llama 3.1+ Instruct	`llama3_json`	Native Llama tool calling format
Mistral / Codestral	`mistral`	Mistral-specific tool_use format
Qwen2.5 Instruct	`internlm`	Qwen3-Coder uses qwen3_xml — check chat template
DeepSeek-V3	`deepseek_v3`	Dedicated parser since vLLM 0.8+

Parser flag mappings based on vLLM documentation and model chat templates as of early 2026. Verify against the current vLLM release and your model's specific version. vLLM Tool Calling Docs. Last verified 2026-02-27.

Censorship and Alignment in Coding Models

When you ask a coding model to write a penetration testing script, bypass a rate limiter, or implement a web scraper, some models will refuse. Others will comply without hesitation. This difference comes from alignment training — the safety fine-tuning applied after the model learns to code.

What Alignment Does to Coding Models

Alignment (RLHF, DPO, or constitutional AI methods) teaches models to refuse harmful requests. For general chat, this is straightforward. For coding, the boundary is murkier: security research, DevOps automation, and system administration often require code that looks like it could be misused.

Refusal Patterns in Practice

The impact of alignment on coding varies by task. Standard application development is unaffected. Security research, DevOps tooling, and system-level code trigger refusals more frequently in heavily aligned models.

Task Type	Aligned Model Response	Uncensored Model Response
Write a port scanner	May refuse or add extensive warnings about legality	Generates the code directly
Bypass rate limiter	Often refuses, citing potential misuse	Provides implementation options
Web scraper for site X	May warn about Terms of Service, sometimes refuses	Generates the scraper code
Reverse-engineer binary	Often refuses or adds legal disclaimers	Provides analysis approach
Standard CRUD API	Generates normally	Generates normally

Note that refusal behavior varies between model versions and even between quantization levels of the same model. The patterns above are generalizations — always test with your specific model and use case.

Uncensored Variants

Some open-weight models are released in “uncensored” or “abliterated” variants where safety fine-tuning has been partially removed. These models are more compliant for legitimate coding tasks but also remove guardrails entirely.

Responsibility Note

The ability to run uncensored models is a feature of self-hosted AI. It comes with responsibility. The lesson covers the technical landscape neutrally — what alignment does, how it affects coding workflows, and what the tradeoffs are. Ethical use remains the operator’s responsibility.

The Alignment Spectrum

Heavily Aligned

Strict safety guardrails, frequent refusals on edge cases

Claude, GPT-4o

Standard

Balanced safety, handles most coding tasks

Qwen, DeepSeek official

Base Model

No alignment training, raw capabilities

Pre-instruct weights

Abliterated

Safety training actively removed

Community uncensored fine-tunes

Most self-hosted coding use cases are best served by standard or lightly aligned models. Moving further right increases compliance but removes all safety boundaries.

Model Selection — Choosing the Right Coding Model

With dozens of coding-capable models available, how do you choose? The answer depends on your task type, hardware, latency requirements, and quality threshold.

Task-Based Selection Framework

Different coding tasks have different model requirements. Code completion (fill-in-the-middle) needs speed above all. Code generation needs quality and instruction following. Code review needs reasoning depth. Agentic workflows need reliable tool calling.

Task-Based Model Selection

Code Completion

Speed > Quality

FIM-capable, low latency. Smaller models (7B–15B) at higher quantization for fast autocomplete.

Code Generation

Quality > Speed

Instruction-tuned, strong HumanEval scores. Larger models (32B+) for complex function generation.

Code Review

Reasoning depth

Large context window, strong reasoning. 32B+ models that can hold entire files and explain issues.

Agentic Workflows

Tool calling reliability

Reliable function calling, iterative reasoning. Must support tool-call parser in your inference engine.

Quick Decision Flow

Model Selection Decision Flow

1. What is your primary task?

Completion → prioritize speed. Generation/Review → prioritize quality. Agentic → prioritize tool calling.

2. How much VRAM do you have?

8–12 GB → 7B models. 16–24 GB → up to 32B (Q4). 48+ GB → 70B models.

3. Do you need tool calling?

Yes → check parser support in vLLM for your model family (Section 5).

4. Latency or quality?

Latency-sensitive → smaller model + higher quantization. Quality-sensitive → largest model that fits.

Hardware-Constrained Selection

Your GPU determines your ceiling. A consumer RTX 4090 (24 GB VRAM) runs different models than a dual-A6000 workstation (96 GB total). The model selection framework must account for what actually fits.

Hardware Tier	VRAM	Max Dense Model (Q4)	Recommended Coding Models
Consumer GPU	8–12 GB	~7B–13B	Qwen2.5-Coder-7B, Yi-Coder-9B, StarCoder2-7B
Enthusiast GPU	16–24 GB	~14B–32B	Qwen2.5-Coder-32B (Q4), Codestral 22B, StarCoder2-15B
Workstation GPU	48 GB	~70B (Q4)	Code Llama 70B (Q4), Qwen2.5-Coder-32B (FP16)
Dual GPU	48–96 GB	~70B (FP16)	DeepSeek-Coder-V2 (quantized), Code Llama 70B (FP16)
Datacenter	80+ GB per GPU	~140B+ or large MoE	DeepSeek-Coder-V2 (full), enterprise-scale models

Model size estimates assume Q4 quantization unless noted. Actual VRAM usage depends on context length, KV cache, and engine overhead. Use the VRAM calculator from Lesson 04 for precise estimates.. Last verified 2026-02-27.

Framework, Not Rankings

This section provides a decision framework rather than a hardcoded “best model” list. Model rankings change monthly. The criteria for selecting them do not.

Deployment for Coders — IDE Integration and Workflows

A coding model is only useful if it integrates into your actual development workflow. This section covers the tools that connect self-hosted models to your editor, terminal, and CI pipeline.

IDE Integration Options

The three dominant approaches to local AI coding assistance are: editor extensions (Continue.dev for VS Code/JetBrains), terminal agents (Aider for git-aware pair programming), and AI-native editors (Cursor, built from the ground up around AI interaction).

Feature	Continue.dev	Aider	Cursor
Type	IDE extension	Terminal agent	AI-native editor
Editor Support	VS Code, JetBrains	Any (terminal-based)	Cursor only (VS Code fork)
Local Model Support	Yes (OpenAI-compatible API)	Yes (OpenAI-compatible API)	Limited (primarily cloud)
Agentic Mode	Basic (edit + apply)	Full (git-aware, multi-file)	Full (Composer agent)
Git Integration	No	Yes (auto-commit)	Basic
Open Source	Yes (Apache 2.0)	Yes (Apache 2.0)	No (proprietary)
Best For	IDE-integrated completions with local models	Terminal-first pair programming	All-in-one AI coding (cloud-first)

Feature comparison based on official documentation as of early 2026. All three tools are under active development — verify current capabilities before choosing. Continue.dev, Aider, Cursor. Last verified 2026-02-27.

Self-Hosted Coding Assistant Architecture

GPU

GPU + Model

Qwen2.5-Coder-32B loaded in VRAM

vLLM

Inference Engine

vLLM serves OpenAI-compatible API

API

API Endpoint

localhost:8000/v1/chat/completions

IDE

IDE Integration

Continue.dev / Aider connects to API

All data stays on your machine. No external API calls, no per-token costs.

Self-Hosted vs Cloud-Backed

Every IDE integration tool can connect to either cloud APIs (OpenAI, Anthropic) or local inference endpoints (vLLM, Ollama). Running your own model gives you privacy, zero per-token cost, and the ability to use uncensored or fine-tuned models.

The Self-Hosted Advantage

With a local vLLM server running a coding model, your IDE integration sends completions to localhost. No data leaves your machine. No API keys. No usage limits. The only cost is your GPU’s electricity.

Quick Setup: Continue.dev + vLLM

The most common self-hosted coding setup pairs Continue.dev (IDE extension) with vLLM (inference engine). Here is the minimal configuration.

Step 1: Start vLLM with a coding modelbash

# Serve Qwen2.5-Coder-32B with tool calling enabled
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
  --tool-call-parser internlm \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Step 2: Configure Continue.dev (~/.continue/config.json)json

{
  "models": [
    {
      "title": "Qwen2.5-Coder-32B (Local)",
      "provider": "openai",
      "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
      "apiBase": "http://localhost:8000/v1",
      "apiKey": "not-needed"
    }
  ]
}

After both are running, open VS Code, trigger Continue.dev (Ctrl+L or Cmd+L), and start coding with your local model. Completions and chat are served from your GPU with zero cloud dependency.

Hands-On Exercises and Summary

Exercise 1: Benchmark Comparison

Pick two coding models from different families (for example, Qwen2.5-Coder-32B and DeepSeek-Coder-V2). Compare their HumanEval, SWE-bench, and Aider Polyglot scores. Which model wins at each benchmark? What does this tell you about their strengths?

Exercise 2: Tool Call Format Matching

For a model you want to serve with vLLM, identify the correct --tool-call-parser flag. Check the model’s documentation or chat template to determine which format it was trained on.

Exercise 3: Model Selection Decision

You have an RTX 4090 (24 GB VRAM) and need a coding assistant for daily Python development. Using the model selection framework from Section 7, determine which model and quantization level gives you the best quality within your VRAM budget.

Answers Not Provided

These exercises are designed for self-directed exploration. The “right” answers depend on your hardware, task requirements, and the latest model releases. Use the decision frameworks from this lesson rather than memorized answers.

Lesson Summary

Coding LLMs are not just general models that happen to write code — the best ones are purpose-built specialists trained on curated code corpora and evaluated against programming-specific benchmarks. The agentic revolution adds tool calling, multi-step reasoning, and IDE integration to turn these models into genuine development partners.

Benchmarks (HumanEval, SWE-bench, Aider) measure different coding capabilities
Coding model families offer specialization advantages at every parameter count
Agentic workflows use tool calling to read, write, test, and iterate on code
Tool call parser matching is critical for reliable function calling
Alignment affects what coding tasks a model will and will not perform
Model selection is task-dependent, hardware-constrained, and framework-driven
Self-hosted coding assistants offer privacy, cost, and customization advantages

Sources and References

Model Cards and Specifications

[1] Qwen2.5-Coder-32B-Instruct — Alibaba Qwen. 32B params, 128K context, open-source SOTA coding model. https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[2] DeepSeek-Coder-V2 — coding-specialized MoE variant. https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[3] Code Llama 70B — Meta. Fine-tuned from Llama 2 for code generation. https://huggingface.co/meta-llama/CodeLlama-70b-Instruct-hf (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[4] StarCoder2-15B — BigCode. Trained on The Stack v2, 600+ languages. https://huggingface.co/bigcode/starcoder2-15b (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[5] Codestral 22B — Mistral AI. 22B params, 32K context, coding-specialized with FIM support. https://mistral.ai/news/codestral/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[6] Yi-Coder — 01.AI. 1.5B–9B params, 128K context, lightweight coding specialist. https://huggingface.co/01-ai/Yi-Coder-9B-Chat (opens in new tab) (as of 2026-02-27, verified 2026-02-27)

Benchmarks and Rankings

[7] HumanEval — OpenAI. 164 hand-written Python programming problems. https://github.com/openai/human-eval (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[8] SWE-bench Verified — Princeton. Real-world GitHub issue resolution. https://www.swebench.com/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[9] Aider Polyglot Benchmark — Multi-language code editing evaluation. https://aider.chat/docs/leaderboards/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[10] LiveCodeBench — Contamination-free coding benchmark from competitive programming. https://livecodebench.github.io/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[11] SWE-agent — Princeton. Agent framework for autonomous software engineering on SWE-bench. https://github.com/princeton-nlp/SWE-agent (opens in new tab) (as of 2026-02-27, verified 2026-02-27)

Software and Tools

[12] Continue.dev — Open-source AI code assistant for VS Code and JetBrains. https://continue.dev/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[13] Aider — AI pair programming in your terminal. https://aider.chat/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[14] Cursor — AI-first code editor built on VS Code. https://cursor.com/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[15] vLLM Tool Calling Documentation — parser flags, format specs, and supported models. https://docs.vllm.ai/en/latest/features/tool_calling.html (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[16] NousResearch Hermes-2 Function Calling Format — community standard for open-weight tool use. https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[17] Llama 3.1 Tool Use — Meta native function calling format for Llama 3.1+ models. https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
[18] Mistral AI Function Calling — native tool_use format for Mistral and Codestral models. https://docs.mistral.ai/capabilities/function_calling/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)

Methodology Notes

Benchmark scores in this lesson are point-in-time measurements that change with every model release. HumanEval, SWE-bench, and Aider Polyglot scores were last verified on 2026-02-27. Model selection advice is framework-based rather than hardcoded rankings — the "best" model depends on your specific coding task, hardware, and latency requirements. Tool calling format specifications reflect the state of vLLM, llama.cpp, and related inference engines as of early 2026. Quarterly review recommended.

Coding LLMs & Agentic AI

Learning Outcomes

01Introduction — Why Coding LLMs Matter

What You Will Learn

02Coding Benchmarks — Measuring Code Intelligence

Why Multiple Benchmarks Matter

Benchmark Volatility

Score Landscape — How Top Models Compare

The Complexity Ladder

03Coding Model Families — The Landscape

What Makes a “Coding Model”?

Coverage Matrix — What Each Family Does Best

Generalists vs Specialists

The Open-Weight Coding Landscape

04Agentic Architecture — Tool Calling and Agent Loops

The Agent Loop

Function Calling vs Tool Use

Real-World Example

Simple Generation vs Agentic Workflow

05Tool Call Parsers — Format Specs and Engine Matching

Why Format Matching Matters

Format Comparison

vLLM Parser Configuration

Common Pitfall

Parser Matching Decision Table

06Censorship and Alignment in Coding Models

What Alignment Does to Coding Models

Refusal Patterns in Practice

Uncensored Variants

Responsibility Note

07Model Selection — Choosing the Right Coding Model

Task-Based Selection Framework

Quick Decision Flow

Hardware-Constrained Selection

Framework, Not Rankings

08Deployment for Coders — IDE Integration and Workflows

IDE Integration Options

Self-Hosted vs Cloud-Backed

The Self-Hosted Advantage

Quick Setup: Continue.dev + vLLM

09Hands-On Exercises and Summary

Exercise 1: Benchmark Comparison

Exercise 2: Tool Call Format Matching

Exercise 3: Model Selection Decision

Answers Not Provided

Lesson Summary

Further Reading

Sources and References

Model Cards and Specifications

Benchmarks and Rankings

Software and Tools

Methodology Notes

Introduction — Why Coding LLMs Matter

Coding Benchmarks — Measuring Code Intelligence

Coding Model Families — The Landscape

Agentic Architecture — Tool Calling and Agent Loops

Tool Call Parsers — Format Specs and Engine Matching

Censorship and Alignment in Coding Models

Model Selection — Choosing the Right Coding Model

Deployment for Coders — IDE Integration and Workflows

Hands-On Exercises and Summary