Hugging FaceGitHubgithub.com/evalstate
Agentic AI on CPU · London · 2026
Upskilling Local Models and Agents
Shaun Smith · Adrian Lepers
May 2026
Hugging FaceGitHubgithub.com/evalstate

Shaun Smith @evalstate

  • Open Source @ Hugging Face
  • MCP Maintainer [Transports] / Community Moderator
  • huggingface/hf-mcp-server
  • huggingface/upskill
  • huggingface/skills
  • Maintainer of fast-agent
Hugging Face
MCP
Hugging FaceGitHubgithub.com/evalstate

Hugging FaceGitHubgithub.com/evalstate

The evolution of Tool Calling and Model Inference....

Hugging FaceGitHubgithub.com/evalstate

Things we didn't have 18 Months ago...

Model Context Protocol

AGENTS.MD and Agent Skills

MoE Models and Efficient Quantizations

Agent Client Protocol

Long Running Tool Loops (and reasoning models)

Hugging FaceGitHubgithub.com/evalstate

Reinforcement Learning

Models are placed in an environment, given a task and scored with a reward function:

  • discover
  • self-correct
  • problem solve
  • keep driving the loop without constant human steering

mini-SWE-Agent: A single 100 line python and single freeform (non JSON) tool can score 76.0% on SWE-Bench!

It's hard to compete against that efficiency.

alt text

Reinforcement learning environment diagram SWE-Bench bash tool benchmark result
Hugging FaceGitHubgithub.com/evalstate

Smaller and Simpler Harnesses

  • General-purpose agent harnesses are given direct(*) shell access
  • Fewer pre/post tool and LLM stop checks to keep models on track
  • API surface and Custom Workflows replaced by Model capabilities
  • Snapshot and checkpointing techniques
  • Movable runtime environments
  • Scripting (code generation) allows immediate specialization
Hugging FaceGitHubgithub.com/evalstate

Self Directing Models

Task flows to Navigate, Ingest, Act, then loops back to Task
Hugging FaceGitHubgithub.com/evalstate

Dynamic Tool Calling

Dynamic Space Tool: 45 tokens

MCP provides an inference gateway to thousands of specialized and custom models covering Audio, Video, Text, 3D Models, Environments and more.

MCP provides Authentication and Multimodal support.

Qwen 3.5-35B-A3B
Flux.1-Krea-Dev
Qwen-Edit-2509-Multiple-angles-LoRA
Wan2.2 First/Last Frame

Hugging FaceGitHubgithub.com/evalstate

Why This Enabled Skills

  • Simple to navigate native content hierarchy
  • Unsurprising Token Dense format (bash!)
  • Reusable procedures become scaffolding for capable models
  • Script access requires fewer mid-context tool tricks
  • Between deterministic program and documentation
Hugging FaceGitHubgithub.com/evalstate

Training Models

LLM Trainer Skill

Fine-tune Qwen3-0.6B on the dataset open-r1/codeforces-cots

Handles:

  • Dataset Construction
  • Dataset Selection and Validation
  • Hardware Selection
  • Training Scripts
  • Job Submission and Monitoring
  • Trackio Supervision
  • GGUF Conversion

Recently added Vision Training!

Hugging FaceGitHubgithub.com/evalstate

https://github.com/huggingface/upskill

Hugging FaceGitHubgithub.com/evalstate

Upskill

  • Run in Sandboxes, View Traces, Optimise and Benchmark
Upskill trace chart
Upskill benchmark output
  • Tutor and Select best Price/Performance Models
Hugging FaceGitHubgithub.com/evalstate

Code Execution Tools

A model with access to general purposes tools has crossed into a very real form of code mode.

Bash provides a general purpose, token dense-execution language.

Task-specific tools generated on demand. Example: HF Tool Builder navigates OpenAPI spec to build composable CLI tools.

Some models are trained to use code tools natively, and are bundled with interpreters.

Hugging FaceGitHubgithub.com/evalstate
Local model capability is compounding
Between May 2024 and May 2026, the most expensive MacBook Pro you could buy stayed at 128 GB of unified memory. The hardware ceiling barely moved. But the smartest open-weight model you could actually run on it went from a score of 10 — Llama 3 70B — to 47 — DeepSeek V4 Flash on antirez's mixed-Q2 GGUF — on the Artificial Analysis Intelligence Index.
Artificial Analysis local model intelligence chart
That is 4.7× in 24 months or a doubling of intelligence every 10.7 months
Source: “Local Moore’s Law” by Mishig Davaadorj · huggingface.co/blog/mishig/local-moores-law
Hugging FaceGitHubgithub.com/evalstate

Hugging FaceGitHubgithub.com/evalstate

OpenAI Privacy Filter

Privacy filter screenshot
can you help me.

my name is "shaun smith"
my credit card is "4929 1003 4422 4042"
the API key I have been using is
"sk-proj-rr3399393922220202".

You can reach me at shaun.smith@private-email.com
or +44 7700 900123.

My home address is 221B Baker Street,
London NW1 6XE.

My AWS access key is AKIAIOSFODNN7EXAMPLE
and the secret is
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY.

My national insurance number is QQ 12 34 56 C.

Hugging FaceGitHubgithub.com/evalstate

Some thoughts

  • Owning and usefully customising and improving your own models is accessible
  • Frontier Models are overused: Price/Performance
  • Inference and Execution environments are blending
  • Self Improvement is here if you want it
Hugging FaceGitHubgithub.com/evalstate
The takeaway
Open source doesn't mean ungoverned.
Own the weights.
Set the policy.
Don’t only trust someone else’s black box.
Hugging Face Hugging Face
Hugging FaceGitHubgithub.com/evalstate

Thank You!

Hugging FaceGitHubgithub.com/evalstate

Agent Client Protocol

File and Shell Tools

Client provided tools, enables "follow along" in editors

Session Based

Listing, Resumption and Rehydration of Agent sessions

Streaming Results and Observability

Agent Results and Tool Status stream, are cancellable

MCP Native Support

Uses MCP Data Model. Client sends MCP Sever Configurations


Hugging FaceGitHubgithub.com/evalstate

Open Responses

Open standard extending OpenAI's Responses API. Provides a consistent, provider neutral way to interact with modern LLMs. Repairs Chat Completion API drift.

It defines a shared schema, and tooling layer that enable a unified experience for calling language models, streaming results, and composing agentic workflows—independent of provider.


Usage as a Provider / Router allows creation of rich Agent Environments

Internal Tools - (Model or Provider)
  • shell and local_shell
  • code_interpreter
  • apply_patch
  • web_search
  • etc..
External Tools (Client Supplied)
  • MCP Servers
  • Standard JSON function calls
  • Free-Form Tools
  • Grammar constrained Tools
Hugging FaceGitHubgithub.com/evalstate

It was close....! PMF for MCP

MCP is a Commodity Standard

Supports Consumer, Enterprise and Developer use-cases.

Single URL to install authenticated JSON tools across thousands of clients

MCP's "fit" features weren't present at launch!

URI/Resources based extensions deliver innovation and extensibility...

...Which enabled rapid MCP Apps distribution on a solid support base.

Model/Host Changes and STDIO

Host applications with Shell tool reduce the need for STDIO Servers.

In many cases for local running tools such as Apify mcp-cli or Pete Steinberger's MCPorter offer a better experience for MCP usage.

Distribution via MCPB is one potential advantage

Simple one-shot server design meant that distribution of ideas was more important than code.

Hugging FaceGitHubgithub.com/evalstate

Generation and Execution Environments

Style 1 - Main Model owns Code Generation

Main model
Generates Search Function
Execution Tool
Uses Search Function to return API definitions
Main model
Generates code from that API surface
Execution tool
Runs the code and returns output
Main model
Reads result and writes final answer
Code Generation: Main Model
Code Execution: Tool Environment

Style 2 - Delegated Code Generation

Main model
Sends a natural-language task to the tool
Execution tool
System Prompt contains API definitions
Execution tool
Returns the result
Main model
Packages it as the final answer
Code Generation: Tool Model
Code Generation: Tool Environment
API Definitions Cacheable

MCP makes it easy to transfer generation and execution between models and environments!
(and who pays for inference)

Hugging FaceGitHubgithub.com/evalstate

LLMs for Navigating: GenUI, Apps SDK (Prefect Prefab)

A common pattern:

  1. user asks for navigation or retrieval
  2. tools fetch the answer
  3. the model then spends expensive output tokens reprocessing a result that was already good enough
  4. The MCP Apps pattern fixes this by letting the result become final for the user.
Hugging FaceGitHubgithub.com/evalstate