Hugging FaceGitHubgithub.com/evalstate
AI Native Dev · London · 2026
Upskilling Models, Agents and the ML Pipeline
Shaun Smith · @evalstate
April 2026
Hugging FaceGitHubgithub.com/evalstate

Shaun Smith @evalstate

  • Open Source @ Hugging Face
  • MCP Maintainer [Transports] / Community Moderator
  • huggingface/hf-mcp-server
  • huggingface/upskill
  • huggingface/skills
  • Maintainer of fast-agent
Hugging Face
MCP
Hugging FaceGitHubgithub.com/evalstate

Hugging FaceGitHubgithub.com/evalstate

The evolution of Tool Calling....

Hugging FaceGitHubgithub.com/evalstate

Things we didn't have 18 Months ago...

MCP Streamable HTTP Transport and OAuth

AGENTS.MD and Agent Skills

Internal Tools in Inference APIs

Agent Client Protocol
Responses API

Long Running Tool Loops (and reasoning models)

Hugging FaceGitHubgithub.com/evalstate

Reinforcement Learning

Models are placed in an environment, given a task and scored with a reward function:

  • discover
  • self-correct
  • problem solve
  • keep driving the loop without constant human steering

mini-SWE-Agent: A single 100 line python and single freeform (non JSON) tool can score 76.0% on SWE-Bench!

It's hard to compete against that efficiency.

alt text

Reinforcement learning environment diagram SWE-Bench bash tool benchmark result
Hugging FaceGitHubgithub.com/evalstate

Smaller and Simpler Harnesses

  • General-purpose agent harnesses are given direct(*) shell access
  • Fewer pre/post tool and LLM stop checks to keep models on track
  • API surface and Custom Workflows replaced by Model capabilities
  • Snapshot and checkpointing techniques
  • Movable runtime environments
  • Scripting (code generation) allows immediate specialization
Hugging FaceGitHubgithub.com/evalstate

Self Directing Models

Task flows to Navigate, Ingest, Act, then loops back to Task
Hugging FaceGitHubgithub.com/evalstate

Dynamic Tool Calling

Dynamic Space Tool: 45 tokens

MCP provides an inference gateway to thousands of specialized and custom models covering Audio, Video, Text, 3D Models, Environments and more.

MCP provides Authentication and Multimodal support.

Qwen 3.5-35B-A3B
Flux.1-Krea-Dev
Qwen-Edit-2509-Multiple-angles-LoRA
Wan2.2 First/Last Frame

Hugging FaceGitHubgithub.com/evalstate

Why This Enabled Skills

  • Simple to navigate native content hierarchy
  • Unsurprising Token Dense format (bash!)
  • Reusable procedures become scaffolding for capable models
  • Script access requires fewer mid-context tool tricks
  • Between deterministic program and documentation
Hugging FaceGitHubgithub.com/evalstate

Training Models

LLM Trainer Skill

Fine-tune Qwen3-0.6B on the dataset open-r1/codeforces-cots

Handles:

  • Dataset Construction
  • Dataset Selection and Validation
  • Hardware Selection
  • Training Scripts
  • Job Submission and Monitoring
  • Trackio Supervision
  • GGUF Conversion

Recently added Vision Training!

Hugging FaceGitHubgithub.com/evalstate

Building Kernels

Build a vectorized RMSNorm kernel for H100 targeting Qwen3-8B

Skill Distribution via CLI

Handles:

  • GPU architecture targeting
  • Kernel source generation
  • PyTorch C++ bindings
  • build.toml project setup
  • Micro-benchmark scripts
  • End-to-end model/pipeline benchmarks
  • Kernel Hub publication

Kernels are first-class on the Hub!

Hugging FaceGitHubgithub.com/evalstate

https://github.com/huggingface/upskill

Hugging FaceGitHubgithub.com/evalstate

Upskill

  • Run in Sandboxes, View Traces, Optimise and Benchmark
Upskill trace chart
Upskill benchmark output
  • Tutor and Select best Price/Performance Models
Hugging FaceGitHubgithub.com/evalstate

Code Execution Tools

A model with access to general purposes tools has crossed into a very real form of code mode.

Bash provides a general purpose, token dense-execution language.

Task-specific tools generated on demand. Example: HF Tool Builder navigates OpenAPI spec to build composable CLI tools.

Some models are trained to use code tools natively, and are bundled with interpreters.

Hugging FaceGitHubgithub.com/evalstate

LLMs for Navigating: GenUI, Apps SDK (Prefect Prefab)

A common pattern:

  1. user asks for navigation or retrieval
  2. tools fetch the answer
  3. the model then spends expensive output tokens reprocessing a result that was already good enough
  4. The MCP Apps pattern fixes this by letting the result become final for the user.
Hugging FaceGitHubgithub.com/evalstate

Closing Thoughts

  • Owning and usefully customising and improving your own models is accessible
  • Frontier Models are overused: Price/Performance
  • Inference and Execution environments are blending
  • Self Improvement is here if you want it
Hugging FaceGitHubgithub.com/evalstate

Thank You!

Hugging FaceGitHubgithub.com/evalstate
Hugging FaceGitHubgithub.com/evalstate

Agent Client Protocol

File and Shell Tools

Client provided tools, enables "follow along" in editors

Session Based

Listing, Resumption and Rehydration of Agent sessions

Streaming Results and Observability

Agent Results and Tool Status stream, are cancellable

MCP Native Support

Uses MCP Data Model. Client sends MCP Sever Configurations


Hugging FaceGitHubgithub.com/evalstate

Open Responses

Open standard extending OpenAI's Responses API. Provides a consistent, provider neutral way to interact with modern LLMs. Repairs Chat Completion API drift.

It defines a shared schema, and tooling layer that enable a unified experience for calling language models, streaming results, and composing agentic workflows—independent of provider.


Usage as a Provider / Router allows creation of rich Agent Environments

Internal Tools - (Model or Provider)
  • shell and local_shell
  • code_interpreter
  • apply_patch
  • web_search
  • etc..
External Tools (Client Supplied)
  • MCP Servers
  • Standard JSON function calls
  • Free-Form Tools
  • Grammar constrained Tools
Hugging FaceGitHubgithub.com/evalstate

It was close....! PMF for MCP

MCP is a Commodity Standard

Supports Consumer, Enterprise and Developer use-cases.

Single URL to install authenticated JSON tools across thousands of clients

MCP's "fit" features weren't present at launch!

URI/Resources based extensions deliver innovation and extensibility...

...Which enabled rapid MCP Apps distribution on a solid support base.

Model/Host Changes and STDIO

Host applications with Shell tool reduce the need for STDIO Servers.

In many cases for local running tools such as Apify mcp-cli or Pete Steinberger's MCPorter offer a better experience for MCP usage.

Distribution via MCPB is one potential advantage

Simple one-shot server design meant that distribution of ideas was more important than code.

Hugging FaceGitHubgithub.com/evalstate

Generation and Execution Environments

Style 1 - Main Model owns Code Generation

Main model
Generates Search Function
Execution Tool
Uses Search Function to return API definitions
Main model
Generates code from that API surface
Execution tool
Runs the code and returns output
Main model
Reads result and writes final answer
Code Generation: Main Model
Code Execution: Tool Environment

Style 2 - Delegated Code Generation

Main model
Sends a natural-language task to the tool
Execution tool
System Prompt contains API definitions
Execution tool
Returns the result
Main model
Packages it as the final answer
Code Generation: Tool Model
Code Generation: Tool Environment
API Definitions Cacheable

MCP makes it easy to transfer generation and execution between models and environments!
(and who pays for inference)

Hugging FaceGitHubgithub.com/evalstate