A frozen model with trained markdown. We are turning every SKILL.md on skillmake into a trainable parameter — bounded edits, held-out validation, A/B routed installs. This page explains what we are building, what data we already collect, and how to evaluate an A/B test once it is live.
SkillOpt (Wang et al., arXiv 2605.23904) reframes a SKILL.md file as the trainable parameter of an agent system. The base model stays frozen — we never fine-tune Claude or Cursor — but the markdown the agent reads gets optimized over time the same way a neural net's weights would: bounded edits per step, held-out validation gate, strict-improvement promotion, protected sections that the edit machinery refuses to touch.
That framing matters because "self-improving agents" without those guardrails mostly produce slop. SkillOpt is the discipline that turns a freeform edit this skill button into a stable optimization loop.
The frozen model plus a trained context window is the practical adaptation. We are not training Claude. We are training the document Claude reads.
The paper's Figure 1 walks through the gradient-descent analogy. Here is the exact translation we use on skillmake:
| Paper concept | skillmake equivalent |
|---|---|
Parameter θ | A published SKILL.md (MarketplaceEntry.markdown in KV) |
Frozen model f | Whatever agent the visitor runs (Claude Code, Cursor, etc.) — we never touch it |
| Gradient direction | Trajectory-derived edits from the optimizer (4-8 atomic SkillEdit ops, derived from telemetry + failure signal) |
| Learning rate | Bounded edit budget — max 8 ops per candidate; replace_apiReference.signature is forbidden |
| Validation check | Held-out conversion gate: install_hit / marketplace_view strictly improves on A/B-routed traffic, p < 0.05 |
| Batch / minibatch | N-install threshold before the gate fires (default 200 installs per arm) |
| "Suboptimal skill" trajectory | Ad-hoc curator edit that bypasses the gate (deprecated path) |
| Rejected side updates | Candidates that fail the static or dynamic gate — logged as negative feedback for the next optimizer pass |
| Protected section invariant | protectedSections + <!-- @protected:<id> --> markers, byte-identical between candidate and parent |
Versioned KV storage, bounded edits, static gate, protected sections.
skill:<id> stays as the current pointer; new skill:<id>:v<n> append-only versions, plus at most one skill:<id>:candidate at a time. Old reads keep working — the migration is additive.name@hash8 when writing to Analytics Engine, so a single dashboard can split conversion by version without a schema change. Existing dashboards keep working via SPLIT(blob2,'@')[0].SkillEdit[] payload — 9 op kinds, max 8 per candidate. Forbidden ops (would break router behavior): replace_name, replace_apiReference.signature, replace_category, replace_audience.whenToUse coherence, protected-section byte-identity, schema shape, and edit budget. Curator can override with a logged reason — matching the paper's "researcher override" pattern.protectedSections: string[] field wraps named sections in <!-- @protected:<id> --> markers. The static gate refuses any byte difference inside those markers.Deterministic split at /i/<name>:
const candidate = await getCandidateSkill(id);
const useCandidate = candidate && shouldRouteCandidate(req, candidate.abTrafficShare);
const served = useCandidate ? candidate : current;
shouldRouteCandidate hashes (cf-connecting-ip + name), so the same visitor sees the same arm across reloads. Default share: 20% candidate / 80% current. While a candidate exists for a skill, cache headers flip to private, no-store to prevent Cloudflare or upstream proxies from poisoning the split.
An hourly Cloudflare Cron at /api/cron/validate-candidates reads Analytics Engine for the last 7 days. For each candidate:
marketplace_view count for the slug (route-agnostic — same on both arms).install_hit count for name@candidate_hash and name@current_hash.promoteCandidate(id) atomically flips the pointer. Regress → retire the candidate and log to a negative-feedback store so the optimizer doesn't re-propose the same diff.Long-tail fallback. Any candidate that fails to cross 200 installs in 14 days falls back to an LLM-judge call against the skill-judge skill we already seeded from softaworks/agent-toolkit. The judge has to score strictly higher than the current version. This is our answer to the paper's open verifier problem — we don't solve it generally; we punt to a peer skill trained on official examples.
Weekly cron /api/cron/propose-edits:
SKILL.md + telemetry + rejected-edits as a narrow prompt to a frontier model.SkillEdit[] of length 4-8.An SKILLOPT_ENABLED=true|false env var gates every cron — kill switch is one config flip. Budget at current corpus size (94 skills, top quartile under target ≈ 24 weekly calls × $0.40 avg) is roughly $10/month.
Yes — and we have been for weeks. Every event lands in the skillmake_metrics Cloudflare Workers Analytics Engine dataset. No personally identifying info is stored; the daily visitor id is a SHA-256 hash of ip + ua + day truncated to 8 bytes, so the same person on the same day collapses into one id but cannot be tracked across days. Search queries are also hashed before write.
Here is every MetricEvent type the codebase currently fires, and what each one tells us:
| Event | What it tells us |
|---|---|
install_hit | Someone hit /i/<name> and got the markdown. After Phase 1, the slug includes @hash8 so we can split by version. This is the gradient signal. |
marketplace_view | The marketplace listing page loaded for a given skill — denominator for conversion. |
home_view | The homepage loaded; blob2 carries the audience filter slug if one is active (engineers/creators/etc.). |
tricks_view / powerhouse_view | Cross-page interest — which subsection of the site people drift toward. |
search_submitted | A search query was submitted (hashed). Volume + funnel signal, not literal text. |
github_click | Outbound click to the repo. Paired with install_hit, this catches the "intent leak" — people who read the listing and bounced to GitHub instead of installing. |
page_dwell | Bucketed dwell time: 0-5s / 5-15s / 15-30s / 30-60s / 60-300s / 300s+. |
scroll_depth | Bucketed reading depth: 0 / 25 / 50 / 75 / 100. |
submit_started / submit_completed | Submit funnel — top and bottom of the conversion flow for new skill submissions. |
convert_success / convert_error | The /api/convert extraction pipeline succeeded or failed. blob2 carries error code; double2 carries HTTP status. |
api_error | Any 4xx/5xx from a server route. Logged via observability.ts as a structured JSON line + Analytics Engine row. |
And the blob columns are stable across every event:
| Column | Meaning |
|---|---|
index1 / blob1 | Event name (mirrored for SQL convenience) |
blob2 | Slug / bucket / route fragment / hashed search query (event-dependent) |
blob3 | Country (cf-ipcountry; ?? if unknown) |
blob4 | Referer host (empty for direct hits) |
blob5 | UA category — curl / browser / bot / other |
blob6 | Daily visitor id — SHA-256(ip+ua+day), 8 bytes |
double1 | Always 1 (sample interval anchor) |
double2 | HTTP status (api_error, convert_error only) |
The Grafana dashboard (docs/grafana/skillmake-dashboard.json) already renders 30+ panels off this dataset — total installs, agent vs human split, hourly install pulse, top countries, GitHub-click leaderboard, dwell/scroll distributions, audience demand from homepage filter clicks, errors per hour, the home → marketplace → install funnel, and the per-skill leaderboards we will repurpose for per-version A/B comparison in Phase 1f.
This is what one entry looks like after Phase 1 lands. The five new fields at the top of the entry are the entire surface area of versioning:
{
"id": "superpowers-a1b2c3d4",
"name": "superpowers",
"contentHash": "a1b2c3d4e5f6g7h8",
// --- new in Phase 1 ---
"versionId": 7,
"parentContentHash": "f9e8d7c6b5a4",
"edits": [
{ "op": "add_gotcha", "value": "Don't forget to..." },
{ "op": "replace_whenToUse_item", "index": 1, "value": "Use when..." }
],
"status": "candidate", // "candidate" | "current" | "retired"
"abTrafficShare": 0.2, // only meaningful when status === "candidate"
// -----------------------
"description": "...",
"category": "tool",
"audience": "engineers",
"protectedSections": ["apiReference"],
"markdown": "# superpowers\n\n<!-- @protected:apiReference -->\n...",
"createdAt": "2026-05-27T12:00:00Z"
}
Once Phase 2 lands, here is exactly what running an experiment looks like from the operator side.
The candidate goes through skill-validator on insert. Token count must be under 1500 (the paper's compactness lesson — best skills land around 920). Description must share at least one keyword with each whenToUse item — the cheap proxy for router/body alignment. Protected sections must be byte-identical to the parent. Edit budget ≤ 8 ops. If any check fails, the candidate doesn't enter the A/B at all; the curator sees the rejection reasons and can either revise or force-pass with a logged override.
Default is abTrafficShare: 0.2. shouldRouteCandidate(req, share) hashes (cf-connecting-ip + name) so the same visitor never flips arms between reloads. The cache header on /i/<name> drops to private, no-store for the duration — otherwise Cloudflare's edge cache would poison the split.
The hourly cron checks both arm counts each tick. Below 200 per arm it reports "insufficient data, keep collecting." If 14 days pass without either arm hitting 200 installs, it routes to the skill-judge LLM-judge fallback and decides on judge score instead of conversion. Curator can manually nudge the share higher (50% / 80%) for low-traffic skills if they want to accelerate.
The Grafana dashboard exposes everything an operator needs to sanity-check the test before the cron decides for you:
install_hit / marketplace_view ratio with explicit deltacurl) but losing on browsers? That's a router-vs-body drift signalgithub_click goes up while installs go down, the candidate is making people bounce to the repo. Intent leak.The hourly cron computes conversion ratios per arm, runs a Welch's t-test (unequal variance is the realistic assumption for differently-sized arms), and requires p < 0.05 AND strict improvement. Ties reject. Regressions retire the candidate and the diff goes into the negative-feedback store keyed at skill:<id>:rejected:v<hash>, with the telemetry snapshot that triggered rejection. The next optimizer pass reads those rejected diffs and is prompted to avoid them.
Promoted? The pointer flips atomically. Cache header restores to public, max-age=300, must-revalidate. The previous current moves to retired but stays queryable for 30 days from /admin/audit — every auto-promote is reversible. Operator should pull up the per-version panels for the next 24 hours and confirm the conversion lift holds at 100% traffic, not just at 20%.
This is the test that matters. The paper's central insight: aggregate accuracy is the wrong unit. You can ship a candidate that raises the corpus mean by 2pp while making three high-traffic skills measurably worse. The cron decides per-skill; the operator's job is to spot-check that the effect size on this skill exceeds the corpus average over the last 14 days. If it doesn't, the promotion is technically valid but probably a wash.
skill-judge.no-store before running. A cached /i/<name> response at the CDN edge will silently break the split.audit:<timestamp>:<id>. Retired versions are kept for 30 days; reverting a bad auto-promote is a one-click operation in /admin/audit.These are the bullets that shaped every architectural choice above:
The validation gate is the only thing that matters. The best skills in the paper landed with just 1-4 accepted edits total — everything else was rejected by the gate.
Bounded edits beat full rewrites. 4-8 atomic ops per step is the sweet spot. Bigger steps regress more than they help.
Compactness wins. The median final skill in SkillOpt was ~920 tokens. Our static gate targets that number; warns at 1200; rejects at 1500.
The description is the router; the body is the agent. They are two surfaces and they can drift apart. Only end-to-end tests catch it. On skillmake we proxy with
description ↔ whenToUsekeyword coherence and then let the conversion gate make the final call.
Aggregate accuracy is the wrong unit. Per-skill effect size is where the action is. Hence the cron decides per-skill and the operator review checks per-skill effect against corpus average.
Frozen model + trained context is the practical adaptation. We are not training Claude. We are training the document Claude reads.
These ten are the ones with enough traffic to drive a real A/B test against the conversion gate without falling back to skill-judge. At ~50-100 installs/day each, the top entries can clear the 200-installs-per-arm threshold in roughly 5-10 days at 20% candidate share. The long tail will lean on the LLM-judge fallback.
Both queries below run against the Cloudflare Analytics Engine SQL API. Full recipe collection lives in docs/analytics-queries.md.
SELECT
blob2 AS slug,
sum(_sample_interval) AS installs
FROM skillmake_metrics
WHERE index1 = 'install_hit' AND blob2 != ''
GROUP BY slug
ORDER BY installs DESC
LIMIT 50
WITH
home AS (
SELECT blob6 AS visitor, toStartOfDay(timestamp) AS day
FROM skillmake_metrics
WHERE index1 = 'home_view'
AND timestamp >= NOW() - INTERVAL '30' DAY
GROUP BY visitor, day
),
installs AS (
SELECT blob6 AS visitor, toStartOfDay(timestamp) AS day
FROM skillmake_metrics
WHERE index1 = 'install_hit'
AND timestamp >= NOW() - INTERVAL '30' DAY
GROUP BY visitor, day
)
SELECT
home.day AS day,
count(DISTINCT home.visitor) AS home_visitors,
count(DISTINCT installs.visitor) AS installers,
round(count(DISTINCT installs.visitor) / count(DISTINCT home.visitor) * 100, 2) AS pct
FROM home
LEFT JOIN installs ON home.visitor = installs.visitor AND home.day = installs.day
GROUP BY day
ORDER BY day ASC
After Phase 1 lands, every per-version query is the same shape with SPLIT(blob2,'@')[1] AS hash tacked on for the split.