Skip to main content
Lesson05
intermediate

Coding LLMs & Agentic AI

Explore coding-specialized LLMs, benchmarks, agentic tool calling, parser matching, and how to select and deploy the right model for your coding workflow.

35-50 min
Updated 2026-02-27
6 Topics
Coding LLMsBenchmarksAgentic AITool CallingModel SelectionIDE Integration

Learning Outcomes

Introduction — Why Coding LLMs Matter

In Lesson 04, we explored how LLMs work under the hood — from neural networks and transformers to quantization, inference engines, and GPU hardware. You now understand what these models are and how to run them.

This lesson narrows the focus to a single, high-impact domain: coding. Not all LLMs are equal at writing code. Some are general-purpose models that happen to know some Python. Others are purpose-built coding specialists, trained on trillions of code tokens, fine-tuned on instruction-following for development tasks, and optimized for the specific patterns that make code generation useful.

The difference matters. A coding-specialized 32B model can outperform a general-purpose 70B model on programming benchmarks — while using half the VRAM and running twice as fast. Choosing the right model for coding is not just a preference; it is a resource and quality decision.

What You Will Learn

This lesson covers the full landscape of coding LLMs: how they are benchmarked, the major model families specialized for code, how agentic tool calling works, what tool-call parsers do inside inference engines, alignment and censorship considerations, and how to select and deploy the right model for your coding workflow.

Coding Benchmarks — Measuring Code Intelligence

How do you know if a model is good at coding? You cannot just ask it to write “Hello World” and declare victory. The industry uses standardized benchmarks that test increasingly realistic coding scenarios.

BenchmarkWhat It MeasuresFormat
HumanEvalFunction-level code generation from docstrings164 Python problems, pass@1
SWE-bench VerifiedReal-world GitHub issue resolution across full reposSubset of 500 verified solvable issues
Aider PolyglotMulti-language code editing via natural language instructions225 tasks across Python, JS, Java, C++, more
LiveCodeBenchContamination-free coding problems from competitive programmingContinuously updated problem set

Why Multiple Benchmarks Matter

A model that scores 92% on HumanEval (isolated function generation) might score only 40% on SWE-bench (full-repo issue resolution). These benchmarks test fundamentally different skills: writing a function from a docstring is very different from navigating a 50-file codebase, understanding the bug, and producing a correct multi-file patch.

Score Landscape — How Top Models Compare

The table below shows approximate performance tiers for major open-weight coding models as of early 2026. Scores shift with every release — treat this as a landscape orientation, not a leaderboard.

ModelParametersBenchmark TierBest For
Qwen2.5-Coder-32B32B denseElite — top scores across HumanEval, Aider, and multi-language tasksAll-round self-hosted coding assistant
DeepSeek-Coder-V2236B MoE (~21B active)Elite — competitive with closed-source models on code generationHigh-quality generation when VRAM allows
Codestral 22B22B denseStrong — excellent fill-in-the-middle and completionFast code completion, IDE autocomplete
StarCoder2-15B15B denseSolid — broad language coverage, strong for its sizeMulti-language projects, resource-constrained setups
Code Llama 70B70B denseEstablished — strong baseline, surpassed by newer specialistsMature ecosystem, extensive tooling support

Approximate performance tiers based on published model cards and benchmark repositories as of early 2026. Scores change with every model release. Always verify against primary sources before deployment decisions. HumanEval, Aider Leaderboards. Last verified 2026-02-27.

The Complexity Ladder

Coding benchmarks form a complexity ladder. At the base sits HumanEval: generating an isolated function from a docstring. In the middle, Aider Polyglot tests multi-language code editing via natural language instructions. At the top, SWE-bench demands full-repo navigation, bug identification, and multi-file patching to resolve real GitHub issues.

A model’s position on this ladder determines what tasks you can trust it with. A model that excels at HumanEval but struggles on SWE-bench is suitable for code completion but not for autonomous issue resolution. Understanding where each benchmark sits tells you what the score actually measures.

Coding Model Families — The Landscape

Not every LLM is trained for code. The major coding model families are purpose-built: trained on curated code datasets, fine-tuned on coding instructions, and benchmarked against programming-specific evaluations. Understanding the landscape helps you pick the right tool.

What Makes a “Coding Model”?

A coding model differs from a general-purpose model in three ways: training data (heavy code corpus), fine-tuning targets (code completion, generation, editing instructions), and evaluation focus (HumanEval, SWE-bench, not just MMLU).

FamilyOrganizationParam RangeSpecialtyLicense
Qwen2.5-CoderAlibaba / Qwen0.5B – 32BFull-stack coding, instruction following, multi-languageApache 2.0
DeepSeek-CoderDeepSeek1.3B – 236B (MoE)Code generation, project-level reasoningModel License (research + commercial)
Code LlamaMeta7B – 70BCode completion, infilling, Python specializationLlama Community License
StarCoder2BigCode3B – 15B600+ languages, fill-in-the-middle, broad coverageBigCode OpenRAIL-M
CodestralMistral AI22BFast completion, FIM, low-latency IDE useMistral AI Non-Production License
Yi-Coder01.AI1.5B – 9BLightweight coding, long context (128K)Apache 2.0

Major open-weight coding model families as of early 2026. New families and versions are released frequently — this table captures the established landscape, not an exhaustive list. HuggingFace Models. Last verified 2026-02-27.

Coverage Matrix — What Each Family Does Best

Each coding model family has strengths and gaps. The matrix below maps families to common coding tasks, showing where each excels.

Family × Capability Coverage
FamilyCompletionGenerationFIMMulti-langAgentic
Qwen2.5-Coder
DeepSeek-Coder
Code Llama
StarCoder2
Codestral
Yi-Coder

\u2713 = strong support based on published benchmarks and model documentation. Coverage evolves with each release.

Generalists vs Specialists

Some general-purpose models (GPT-4o, Claude, DeepSeek V3) are excellent at coding despite not being “coding models.” The distinction matters most at smaller parameter counts where specialization provides the biggest quality lift per VRAM dollar.

The Open-Weight Coding Landscape

The open-weight ecosystem for coding models has matured rapidly. Families like Qwen-Coder, DeepSeek-Coder, and StarCoder2 offer models across a range of parameter counts, making self-hosted coding assistance accessible from consumer GPUs to datacenter deployments.

Family Quick Reference

Qwen2.5-Coder

Sizes: 0.5B / 1.5B / 3B / 7B / 14B / 32B

Best all-round open-weight coding model as of early 2026. Top benchmark tier across HumanEval, Aider, and multi-language tasks.

Apache 2.0 license. The 32B variant matches cloud-tier models on coding benchmarks.

DeepSeek-Coder

Sizes: 1.3B / 6.7B / 33B / V2 236B MoE

Strong code generation and project-level reasoning. V2 uses MoE architecture for efficient inference.

V2 activates ~21B of 236B total params. Needs significant VRAM for weights but runs at ~21B speed.

Code Llama

Sizes: 7B / 13B / 34B / 70B

Mature ecosystem with extensive community tooling. Python-specialized variant available.

Based on Llama 2. Surpassed by newer specialists but remains widely supported.

StarCoder2

Sizes: 3B / 7B / 15B

Broadest language coverage (600+ languages). Strong fill-in-the-middle support.

BigCode OpenRAIL-M license. Trained on The Stack v2, the largest open code dataset.

Codestral

Sizes: 22B

Fast completion with strong FIM support. Designed for low-latency IDE integration.

Mistral AI Non-Production License. 32K context window. 80+ programming languages.

Yi-Coder

Sizes: 1.5B / 9B

Lightweight coding specialist with 128K context window. Excellent for constrained hardware.

Apache 2.0 license. 9B variant punches above its weight on code generation tasks.

Agentic Architecture — Tool Calling and Agent Loops

The most powerful use of coding LLMs is not generating code in isolation — it is running them as agents that can read files, write code, run tests, and iterate on failures. This is the difference between a code completion tool and an AI pair programmer.

The Agent Loop

An agentic coding workflow follows a loop: the model receives a task, decides which tools to call (read file, write file, run command), executes the tool calls, observes the results, and decides what to do next. This continues until the task is complete or the model determines it cannot proceed.

The Agent Loop

1. Receive Task

User describes goal

2. Plan

Model decides action

3. Tool Call

Read, write, or run

4. Observe

Read tool result

5. Decide

Done or iterate?

Steps 2–5 repeat until task is complete or model stops

Function Calling vs Tool Use

“Function calling” and “tool use” describe the same capability: the model outputs structured JSON indicating which function to call with what arguments, instead of generating plain text. The runtime (your agent framework) executes the function and feeds the result back to the model.

Real-World Example

When you ask an AI coding agent to “fix the login bug,” it does not just generate code. It reads the relevant files, identifies the issue, writes a fix, runs the test suite, reads the test output, and iterates if tests fail. Each step is a tool call.

Simple Generation vs Agentic Workflow

The difference between a chatbot that generates code and an agent that writes software is the loop. Simple generation is one-shot: prompt in, code out. Agentic workflows iterate autonomously until the task is done.

AspectSimple GenerationAgentic Workflow
InputSingle prompt with full contextHigh-level task description
Context gatheringUser provides all context manuallyAgent reads files, searches codebase, gathers context
OutputOne-shot code blockMulti-step: edits, tests, iterations
Error handlingUser fixes errors and re-promptsAgent runs tests, reads errors, self-corrects
Tool useNone — text in, text outFile read/write, shell commands, search
Iterations1 (single turn)5–50+ (autonomous loop)

Tool Call Parsers — Format Specs and Engine Matching

For a model to call tools, it needs to output structured data in a specific format. Different model families use different formats. The inference engine must know how to parse the model’s output into actual function calls. This is where tool call parsers come in.

Why Format Matching Matters

If you serve a model that outputs Hermes-style tool calls through an engine configured for ChatML format, the tool calls will fail silently or produce garbage. The parser must match the model’s training format.

Format Comparison

Each model family uses a different syntax for tool calls. The inference engine must parse the model’s raw output and extract structured function call data. If the parser expects Hermes format but the model outputs Mistral format, the call fails silently.

FormatSyntax PatternUsed By
Hermes-2<tool_call>{...JSON...}</tool_call>NousResearch Hermes models, many fine-tunes
Llama 3.1+ native<|python_tag|> or JSON function blocksLlama 3.1, 3.2, 3.3 Instruct models
Mistral tool_use[TOOL_CALLS] {...JSON...}Mistral, Codestral, Mixtral Instruct
Qwen function_call✿FUNCTION✿ or JSON block in assistant turnQwen2.5 Instruct, Qwen2.5-Coder
DeepSeek JSONJSON function call in assistant responseDeepSeek-V2, DeepSeek-V3, DeepSeek-Coder

Tool calling format specifications as of early 2026. Formats evolve with model releases — always check the model's chat template for the authoritative format. vLLM Tool Calling. Last verified 2026-02-27.

vLLM Parser Configuration

vLLM supports multiple tool call parsers via the --tool-call-parser flag. Choosing the correct parser for your model is essential for reliable agentic workflows.

vLLM Tool Call Parser Examplesbash
# Hermes-2 fine-tune (most community models)
vllm serve my-model --tool-call-parser hermes

# Llama 3.1+ Instruct (native Meta format)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tool-call-parser llama3_json

# Mistral / Codestral
vllm serve mistralai/Codestral-22B-v0.1 \
  --tool-call-parser mistral

# DeepSeek-V3 (dedicated parser)
vllm serve deepseek-ai/DeepSeek-V3 \
  --tool-call-parser deepseek_v3

# Enable tool calling on the OpenAI-compatible endpoint
# Then call with tools=[...] in your API request

Parser Matching Decision Table

Use this table to look up the correct --tool-call-parser flag for your model. When in doubt, check the model’s tokenizer_config.json for its chat template.

Model FamilyvLLM Parser FlagNotes
Hermes-2 fine-tuneshermesMost common for community fine-tunes
Llama 3.1+ Instructllama3_jsonNative Llama tool calling format
Mistral / CodestralmistralMistral-specific tool_use format
Qwen2.5 InstructinternlmQwen3-Coder uses qwen3_xml — check chat template
DeepSeek-V3deepseek_v3Dedicated parser since vLLM 0.8+

Parser flag mappings based on vLLM documentation and model chat templates as of early 2026. Verify against the current vLLM release and your model's specific version. vLLM Tool Calling Docs. Last verified 2026-02-27.

Censorship and Alignment in Coding Models

When you ask a coding model to write a penetration testing script, bypass a rate limiter, or implement a web scraper, some models will refuse. Others will comply without hesitation. This difference comes from alignment training — the safety fine-tuning applied after the model learns to code.

What Alignment Does to Coding Models

Alignment (RLHF, DPO, or constitutional AI methods) teaches models to refuse harmful requests. For general chat, this is straightforward. For coding, the boundary is murkier: security research, DevOps automation, and system administration often require code that looks like it could be misused.

Refusal Patterns in Practice

The impact of alignment on coding varies by task. Standard application development is unaffected. Security research, DevOps tooling, and system-level code trigger refusals more frequently in heavily aligned models.

Task TypeAligned Model ResponseUncensored Model Response
Write a port scannerMay refuse or add extensive warnings about legalityGenerates the code directly
Bypass rate limiterOften refuses, citing potential misuseProvides implementation options
Web scraper for site XMay warn about Terms of Service, sometimes refusesGenerates the scraper code
Reverse-engineer binaryOften refuses or adds legal disclaimersProvides analysis approach
Standard CRUD APIGenerates normallyGenerates normally

Note that refusal behavior varies between model versions and even between quantization levels of the same model. The patterns above are generalizations — always test with your specific model and use case.

Uncensored Variants

Some open-weight models are released in “uncensored” or “abliterated” variants where safety fine-tuning has been partially removed. These models are more compliant for legitimate coding tasks but also remove guardrails entirely.

The Alignment Spectrum

Heavily Aligned

Standard

Base Model

Abliterated

Most self-hosted coding use cases are best served by standard or lightly aligned models. Moving further right increases compliance but removes all safety boundaries.

Model Selection — Choosing the Right Coding Model

With dozens of coding-capable models available, how do you choose? The answer depends on your task type, hardware, latency requirements, and quality threshold.

Task-Based Selection Framework

Different coding tasks have different model requirements. Code completion (fill-in-the-middle) needs speed above all. Code generation needs quality and instruction following. Code review needs reasoning depth. Agentic workflows need reliable tool calling.

Task-Based Model Selection

Code Completion

Speed > Quality

FIM-capable, low latency. Smaller models (7B–15B) at higher quantization for fast autocomplete.

Code Generation

Quality > Speed

Instruction-tuned, strong HumanEval scores. Larger models (32B+) for complex function generation.

Code Review

Reasoning depth

Large context window, strong reasoning. 32B+ models that can hold entire files and explain issues.

Agentic Workflows

Tool calling reliability

Reliable function calling, iterative reasoning. Must support tool-call parser in your inference engine.

Quick Decision Flow

Model Selection Decision Flow

1. What is your primary task?

Completion → prioritize speed. Generation/Review → prioritize quality. Agentic → prioritize tool calling.

2. How much VRAM do you have?

8–12 GB → 7B models. 16–24 GB → up to 32B (Q4). 48+ GB → 70B models.

3. Do you need tool calling?

Yes → check parser support in vLLM for your model family (Section 5).

4. Latency or quality?

Latency-sensitive → smaller model + higher quantization. Quality-sensitive → largest model that fits.

Hardware-Constrained Selection

Your GPU determines your ceiling. A consumer RTX 4090 (24 GB VRAM) runs different models than a dual-A6000 workstation (96 GB total). The model selection framework must account for what actually fits.

Hardware TierVRAMMax Dense Model (Q4)Recommended Coding Models
Consumer GPU8–12 GB~7B–13BQwen2.5-Coder-7B, Yi-Coder-9B, StarCoder2-7B
Enthusiast GPU16–24 GB~14B–32BQwen2.5-Coder-32B (Q4), Codestral 22B, StarCoder2-15B
Workstation GPU48 GB~70B (Q4)Code Llama 70B (Q4), Qwen2.5-Coder-32B (FP16)
Dual GPU48–96 GB~70B (FP16)DeepSeek-Coder-V2 (quantized), Code Llama 70B (FP16)
Datacenter80+ GB per GPU~140B+ or large MoEDeepSeek-Coder-V2 (full), enterprise-scale models

Model size estimates assume Q4 quantization unless noted. Actual VRAM usage depends on context length, KV cache, and engine overhead. Use the VRAM calculator from Lesson 04 for precise estimates.. Last verified 2026-02-27.

Framework, Not Rankings

This section provides a decision framework rather than a hardcoded “best model” list. Model rankings change monthly. The criteria for selecting them do not.

Deployment for Coders — IDE Integration and Workflows

A coding model is only useful if it integrates into your actual development workflow. This section covers the tools that connect self-hosted models to your editor, terminal, and CI pipeline.

IDE Integration Options

The three dominant approaches to local AI coding assistance are: editor extensions (Continue.dev for VS Code/JetBrains), terminal agents (Aider for git-aware pair programming), and AI-native editors (Cursor, built from the ground up around AI interaction).

FeatureContinue.devAiderCursor
TypeIDE extensionTerminal agentAI-native editor
Editor SupportVS Code, JetBrainsAny (terminal-based)Cursor only (VS Code fork)
Local Model SupportYes (OpenAI-compatible API)Yes (OpenAI-compatible API)Limited (primarily cloud)
Agentic ModeBasic (edit + apply)Full (git-aware, multi-file)Full (Composer agent)
Git IntegrationNoYes (auto-commit)Basic
Open SourceYes (Apache 2.0)Yes (Apache 2.0)No (proprietary)
Best ForIDE-integrated completions with local modelsTerminal-first pair programmingAll-in-one AI coding (cloud-first)

Feature comparison based on official documentation as of early 2026. All three tools are under active development — verify current capabilities before choosing. Continue.dev, Aider, Cursor. Last verified 2026-02-27.

Self-Hosted Coding Assistant Architecture

GPU

GPU + Model

Qwen2.5-Coder-32B loaded in VRAM

vLLM

Inference Engine

vLLM serves OpenAI-compatible API

API

API Endpoint

localhost:8000/v1/chat/completions

IDE

IDE Integration

Continue.dev / Aider connects to API

All data stays on your machine. No external API calls, no per-token costs.

Self-Hosted vs Cloud-Backed

Every IDE integration tool can connect to either cloud APIs (OpenAI, Anthropic) or local inference endpoints (vLLM, Ollama). Running your own model gives you privacy, zero per-token cost, and the ability to use uncensored or fine-tuned models.

The Self-Hosted Advantage

With a local vLLM server running a coding model, your IDE integration sends completions to localhost. No data leaves your machine. No API keys. No usage limits. The only cost is your GPU’s electricity.

Quick Setup: Continue.dev + vLLM

The most common self-hosted coding setup pairs Continue.dev (IDE extension) with vLLM (inference engine). Here is the minimal configuration.

Step 1: Start vLLM with a coding modelbash
# Serve Qwen2.5-Coder-32B with tool calling enabled
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
  --tool-call-parser internlm \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90
Step 2: Configure Continue.dev (~/.continue/config.json)json
{
  "models": [
    {
      "title": "Qwen2.5-Coder-32B (Local)",
      "provider": "openai",
      "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
      "apiBase": "http://localhost:8000/v1",
      "apiKey": "not-needed"
    }
  ]
}

After both are running, open VS Code, trigger Continue.dev (Ctrl+L or Cmd+L), and start coding with your local model. Completions and chat are served from your GPU with zero cloud dependency.

Hands-On Exercises and Summary

Exercise 1: Benchmark Comparison

Pick two coding models from different families (for example, Qwen2.5-Coder-32B and DeepSeek-Coder-V2). Compare their HumanEval, SWE-bench, and Aider Polyglot scores. Which model wins at each benchmark? What does this tell you about their strengths?

Exercise 2: Tool Call Format Matching

For a model you want to serve with vLLM, identify the correct --tool-call-parser flag. Check the model’s documentation or chat template to determine which format it was trained on.

Exercise 3: Model Selection Decision

You have an RTX 4090 (24 GB VRAM) and need a coding assistant for daily Python development. Using the model selection framework from Section 7, determine which model and quantization level gives you the best quality within your VRAM budget.

Answers Not Provided

These exercises are designed for self-directed exploration. The “right” answers depend on your hardware, task requirements, and the latest model releases. Use the decision frameworks from this lesson rather than memorized answers.

Lesson Summary

Coding LLMs are not just general models that happen to write code — the best ones are purpose-built specialists trained on curated code corpora and evaluated against programming-specific benchmarks. The agentic revolution adds tool calling, multi-step reasoning, and IDE integration to turn these models into genuine development partners.

  • Benchmarks (HumanEval, SWE-bench, Aider) measure different coding capabilities
  • Coding model families offer specialization advantages at every parameter count
  • Agentic workflows use tool calling to read, write, test, and iterate on code
  • Tool call parser matching is critical for reliable function calling
  • Alignment affects what coding tasks a model will and will not perform
  • Model selection is task-dependent, hardware-constrained, and framework-driven
  • Self-hosted coding assistants offer privacy, cost, and customization advantages

Further Reading

  • Aider LLM Leaderboards — Live benchmark tracking
  • SWE-bench — Real-world coding evaluation
  • vLLM Documentation — Tool calling and parser configuration
  • Continue.dev Documentation — IDE integration setup

Sources and References

Model Cards and Specifications

  1. [1] Qwen2.5-Coder-32B-Instruct — Alibaba Qwen. 32B params, 128K context, open-source SOTA coding model. https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  2. [2] DeepSeek-Coder-V2 — coding-specialized MoE variant. https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  3. [3] Code Llama 70B — Meta. Fine-tuned from Llama 2 for code generation. https://huggingface.co/meta-llama/CodeLlama-70b-Instruct-hf (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  4. [4] StarCoder2-15B — BigCode. Trained on The Stack v2, 600+ languages. https://huggingface.co/bigcode/starcoder2-15b (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  5. [5] Codestral 22B — Mistral AI. 22B params, 32K context, coding-specialized with FIM support. https://mistral.ai/news/codestral/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  6. [6] Yi-Coder — 01.AI. 1.5B–9B params, 128K context, lightweight coding specialist. https://huggingface.co/01-ai/Yi-Coder-9B-Chat (opens in new tab) (as of 2026-02-27, verified 2026-02-27)

Benchmarks and Rankings

  1. [7] HumanEval — OpenAI. 164 hand-written Python programming problems. https://github.com/openai/human-eval (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  2. [8] SWE-bench Verified — Princeton. Real-world GitHub issue resolution. https://www.swebench.com/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  3. [9] Aider Polyglot Benchmark — Multi-language code editing evaluation. https://aider.chat/docs/leaderboards/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  4. [10] LiveCodeBench — Contamination-free coding benchmark from competitive programming. https://livecodebench.github.io/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  5. [11] SWE-agent — Princeton. Agent framework for autonomous software engineering on SWE-bench. https://github.com/princeton-nlp/SWE-agent (opens in new tab) (as of 2026-02-27, verified 2026-02-27)

Software and Tools

  1. [12] Continue.dev — Open-source AI code assistant for VS Code and JetBrains. https://continue.dev/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  2. [13] Aider — AI pair programming in your terminal. https://aider.chat/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  3. [14] Cursor — AI-first code editor built on VS Code. https://cursor.com/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  4. [15] vLLM Tool Calling Documentation — parser flags, format specs, and supported models. https://docs.vllm.ai/en/latest/features/tool_calling.html (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  5. [16] NousResearch Hermes-2 Function Calling Format — community standard for open-weight tool use. https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  6. [17] Llama 3.1 Tool Use — Meta native function calling format for Llama 3.1+ models. https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)
  7. [18] Mistral AI Function Calling — native tool_use format for Mistral and Codestral models. https://docs.mistral.ai/capabilities/function_calling/ (opens in new tab) (as of 2026-02-27, verified 2026-02-27)

Methodology Notes

Benchmark scores in this lesson are point-in-time measurements that change with every model release. HumanEval, SWE-bench, and Aider Polyglot scores were last verified on 2026-02-27. Model selection advice is framework-based rather than hardcoded rankings — the "best" model depends on your specific coding task, hardware, and latency requirements. Tool calling format specifications reflect the state of vLLM, llama.cpp, and related inference engines as of early 2026. Quarterly review recommended.