Status report · 2026-05-27

SkillOpt on skillmake.xyz

A frozen model with trained markdown. We are turning every SKILL.md on skillmake into a trainable parameter — bounded edits, held-out validation, A/B routed installs. This page explains what we are building, what data we already collect, and how to evaluate an A/B test once it is live.

Phase 1 — building Phase 2 — planned Phase 3-4 — cron-driven

01What SkillOpt is

SkillOpt (Wang et al., arXiv 2605.23904) reframes a SKILL.md file as the trainable parameter of an agent system. The base model stays frozen — we never fine-tune Claude or Cursor — but the markdown the agent reads gets optimized over time the same way a neural net's weights would: bounded edits per step, held-out validation gate, strict-improvement promotion, protected sections that the edit machinery refuses to touch.

That framing matters because "self-improving agents" without those guardrails mostly produce slop. SkillOpt is the discipline that turns a freeform edit this skill button into a stable optimization loop.

The frozen model plus a trained context window is the practical adaptation. We are not training Claude. We are training the document Claude reads.

02SkillOpt figure mapping

The paper's Figure 1 walks through the gradient-descent analogy. Here is the exact translation we use on skillmake:

Paper concept	skillmake equivalent
Parameter `θ`	A published `SKILL.md` (`MarketplaceEntry.markdown` in KV)
Frozen model `f`	Whatever agent the visitor runs (Claude Code, Cursor, etc.) — we never touch it
Gradient direction	Trajectory-derived edits from the optimizer (4-8 atomic `SkillEdit` ops, derived from telemetry + failure signal)
Learning rate	Bounded edit budget — max 8 ops per candidate; `replace_apiReference.signature` is forbidden
Validation check	Held-out conversion gate: `install_hit / marketplace_view` strictly improves on A/B-routed traffic, p < 0.05
Batch / minibatch	N-install threshold before the gate fires (default 200 installs per arm)
"Suboptimal skill" trajectory	Ad-hoc curator edit that bypasses the gate (deprecated path)
Rejected side updates	Candidates that fail the static or dynamic gate — logged as negative feedback for the next optimizer pass
Protected section invariant	`protectedSections` + `<!-- @protected:<id> -->` markers, byte-identical between candidate and parent

03What we are building (4 phases, ~1 week)

Phase 1 — Foundation Day 1-3

Versioned KV storage, bounded edits, static gate, protected sections.

Versioned storage. skill:<id> stays as the current pointer; new skill:<id>:v<n> append-only versions, plus at most one skill:<id>:candidate at a time. Old reads keep working — the migration is additive.
Version binding on install. The install route encodes the content hash in the slug as name@hash8 when writing to Analytics Engine, so a single dashboard can split conversion by version without a schema change. Existing dashboards keep working via SPLIT(blob2,'@')[0].
Bounded-edit endpoint. The freeform admin update is replaced by a structured SkillEdit[] payload — 9 op kinds, max 8 per candidate. Forbidden ops (would break router behavior): replace_name, replace_apiReference.signature, replace_category, replace_audience.
Static gate. Runs synchronously on every candidate. Checks token count (≤ 1500, target 920 per the paper), description ↔ whenToUse coherence, protected-section byte-identity, schema shape, and edit budget. Curator can override with a logged reason — matching the paper's "researcher override" pattern.
Protected sections. A new protectedSections: string[] field wraps named sections in  markers. The static gate refuses any byte difference inside those markers.

Phase 2 — A/B routing Day 4

Deterministic split at /i/<name>:

const candidate = await getCandidateSkill(id);
const useCandidate = candidate && shouldRouteCandidate(req, candidate.abTrafficShare);
const served = useCandidate ? candidate : current;

shouldRouteCandidate hashes (cf-connecting-ip + name), so the same visitor sees the same arm across reloads. Default share: 20% candidate / 80% current. While a candidate exists for a skill, cache headers flip to private, no-store to prevent Cloudflare or upstream proxies from poisoning the split.

Phase 3 — Dynamic validation gate cron · hourly Day 5

An hourly Cloudflare Cron at /api/cron/validate-candidates reads Analytics Engine for the last 7 days. For each candidate:

Query marketplace_view count for the slug (route-agnostic — same on both arms).
Query install_hit count for name@candidate_hash and name@current_hash.
Require ≥ 200 installs per arm; otherwise "insufficient data, keep collecting."
Welch's t-test on the conversion ratios; require p < 0.05 AND strict improvement. Ties reject.
Pass → promoteCandidate(id) atomically flips the pointer. Regress → retire the candidate and log to a negative-feedback store so the optimizer doesn't re-propose the same diff.

Long-tail fallback. Any candidate that fails to cross 200 installs in 14 days falls back to an LLM-judge call against the skill-judge skill we already seeded from softaworks/agent-toolkit. The judge has to score strictly higher than the current version. This is our answer to the paper's open verifier problem — we don't solve it generally; we punt to a peer skill trained on official examples.

Phase 4 — Optimizer loop cron · weekly Day 6-7

Weekly cron /api/cron/propose-edits:

Find every approved skill below the 25th percentile in conversion with no active candidate.
Pull 14 days of telemetry — install volume, conversion ratio, github_click rate, top countries, dwell distribution.
Pull the last N rejected candidates (avoids loops).
Render current SKILL.md + telemetry + rejected-edits as a narrow prompt to a frontier model.
Model must respond with a Zod-validated SkillEdit[] of length 4-8.
Apply edits → static gate → save as candidate → enters A/B at 20% share.

An SKILLOPT_ENABLED=true|false env var gates every cron — kill switch is one config flip. Budget at current corpus size (94 skills, top quartile under target ≈ 24 weekly calls × $0.40 avg) is roughly $10/month.

04Are we collecting usage data?

Yes — and we have been for weeks. Every event lands in the skillmake_metrics Cloudflare Workers Analytics Engine dataset. No personally identifying info is stored; the daily visitor id is a SHA-256 hash of ip + ua + day truncated to 8 bytes, so the same person on the same day collapses into one id but cannot be tracked across days. Search queries are also hashed before write.

Here is every MetricEvent type the codebase currently fires, and what each one tells us:

Event	What it tells us
`install_hit`	Someone hit `/i/<name>` and got the markdown. After Phase 1, the slug includes `@hash8` so we can split by version. This is the gradient signal.
`marketplace_view`	The marketplace listing page loaded for a given skill — denominator for conversion.
`home_view`	The homepage loaded; `blob2` carries the audience filter slug if one is active (engineers/creators/etc.).
`tricks_view` / `powerhouse_view`	Cross-page interest — which subsection of the site people drift toward.
`search_submitted`	A search query was submitted (hashed). Volume + funnel signal, not literal text.
`github_click`	Outbound click to the repo. Paired with `install_hit`, this catches the "intent leak" — people who read the listing and bounced to GitHub instead of installing.
`page_dwell`	Bucketed dwell time: `0-5s / 5-15s / 15-30s / 30-60s / 60-300s / 300s+`.
`scroll_depth`	Bucketed reading depth: `0 / 25 / 50 / 75 / 100`.
`submit_started` / `submit_completed`	Submit funnel — top and bottom of the conversion flow for new skill submissions.
`convert_success` / `convert_error`	The `/api/convert` extraction pipeline succeeded or failed. `blob2` carries error code; `double2` carries HTTP status.
`api_error`	Any 4xx/5xx from a server route. Logged via `observability.ts` as a structured JSON line + Analytics Engine row.

And the blob columns are stable across every event:

Column	Meaning
`index1` / `blob1`	Event name (mirrored for SQL convenience)
`blob2`	Slug / bucket / route fragment / hashed search query (event-dependent)
`blob3`	Country (`cf-ipcountry`; `??` if unknown)
`blob4`	Referer host (empty for direct hits)
`blob5`	UA category — `curl` / `browser` / `bot` / `other`
`blob6`	Daily visitor id — SHA-256(ip+ua+day), 8 bytes
`double1`	Always `1` (sample interval anchor)
`double2`	HTTP status (`api_error`, `convert_error` only)

The Grafana dashboard (docs/grafana/skillmake-dashboard.json) already renders 30+ panels off this dataset — total installs, agent vs human split, hourly install pulse, top countries, GitHub-click leaderboard, dwell/scroll distributions, audience demand from homepage filter clicks, errors per hour, the home → marketplace → install funnel, and the per-skill leaderboards we will repurpose for per-version A/B comparison in Phase 1f.

05MarketplaceEntry shape — new fields

This is what one entry looks like after Phase 1 lands. The five new fields at the top of the entry are the entire surface area of versioning:

{
  "id": "superpowers-a1b2c3d4",
  "name": "superpowers",
  "contentHash": "a1b2c3d4e5f6g7h8",

  // --- new in Phase 1 ---
  "versionId": 7,
  "parentContentHash": "f9e8d7c6b5a4",
  "edits": [
    { "op": "add_gotcha", "value": "Don't forget to..." },
    { "op": "replace_whenToUse_item", "index": 1, "value": "Use when..." }
  ],
  "status": "candidate",        // "candidate" | "current" | "retired"
  "abTrafficShare": 0.2,        // only meaningful when status === "candidate"
  // -----------------------

  "description": "...",
  "category": "tool",
  "audience": "engineers",
  "protectedSections": ["apiReference"],
  "markdown": "# superpowers\n\n<!-- @protected:apiReference -->\n...",
  "createdAt": "2026-05-27T12:00:00Z"
}

06How an A/B test gets evaluated — 7-step operator playbook

Once Phase 2 lands, here is exactly what running an experiment looks like from the operator side.

STEP 0

Pre-flight static gate

The candidate goes through skill-validator on insert. Token count must be under 1500 (the paper's compactness lesson — best skills land around 920). Description must share at least one keyword with each whenToUse item — the cheap proxy for router/body alignment. Protected sections must be byte-identical to the parent. Edit budget ≤ 8 ops. If any check fails, the candidate doesn't enter the A/B at all; the curator sees the rejection reasons and can either revise or force-pass with a logged override.

STEP 1

Start at 20% candidate share

Default is abTrafficShare: 0.2. shouldRouteCandidate(req, share) hashes (cf-connecting-ip + name) so the same visitor never flips arms between reloads. The cache header on /i/<name> drops to private, no-store for the duration — otherwise Cloudflare's edge cache would poison the split.

STEP 2

Wait for ≥ 200 installs per arm — or 14 days

The hourly cron checks both arm counts each tick. Below 200 per arm it reports "insufficient data, keep collecting." If 14 days pass without either arm hitting 200 installs, it routes to the skill-judge LLM-judge fallback and decides on judge score instead of conversion. Curator can manually nudge the share higher (50% / 80%) for low-traffic skills if they want to accelerate.

STEP 3

Read the dashboard panels

The Grafana dashboard exposes everything an operator needs to sanity-check the test before the cron decides for you:

Top skill versions by install rate (14d) — per-version bargauge, candidate next to current
Conversion delta per skill version — install_hit / marketplace_view ratio with explicit delta
UA split per version — is the candidate winning on agents (curl) but losing on browsers? That's a router-vs-body drift signal
Top countries per version — geographic skew often hides an audience-fit problem
GitHub-click ratio per version — if github_click goes up while installs go down, the candidate is making people bounce to the repo. Intent leak.
Errors per hour — a candidate that breaks downstream agent flow shows up here first

STEP 4

Cron decides — Welch's t-test, p < 0.05, strict improvement

The hourly cron computes conversion ratios per arm, runs a Welch's t-test (unequal variance is the realistic assumption for differently-sized arms), and requires p < 0.05 AND strict improvement. Ties reject. Regressions retire the candidate and the diff goes into the negative-feedback store keyed at skill:<id>:rejected:v<hash>, with the telemetry snapshot that triggered rejection. The next optimizer pass reads those rejected diffs and is prompted to avoid them.

STEP 5

Post-promotion sanity check

Promoted? The pointer flips atomically. Cache header restores to public, max-age=300, must-revalidate. The previous current moves to retired but stays queryable for 30 days from /admin/audit — every auto-promote is reversible. Operator should pull up the per-version panels for the next 24 hours and confirm the conversion lift holds at 100% traffic, not just at 20%.

STEP 6

Per-skill effect size > corpus average

This is the test that matters. The paper's central insight: aggregate accuracy is the wrong unit. You can ship a candidate that raises the corpus mean by 2pp while making three high-traffic skills measurably worse. The cron decides per-skill; the operator's job is to spot-check that the effect size on this skill exceeds the corpus average over the last 14 days. If it doesn't, the promotion is technically valid but probably a wash.

STEP 7

Failure modes to watch

Low traffic + judge fallback drift. If a skill never crosses 200 installs/arm and the judge keeps promoting candidates that don't show up in conversion later, the judge prompt itself has drifted. Re-seed skill-judge.
Optimizer loops. If the same skill goes through five rejected candidates in a row, the model is stuck in a local minimum. The rejected-edits store should prevent literal repetition; if it's still happening, the telemetry input is too narrow. Widen the context.
Cache poisoning. Always confirm the candidate cache header is no-store before running. A cached /i/<name> response at the CDN edge will silently break the split.
Curator surprise. Every auto-promote writes to audit:<timestamp>:<id>. Retired versions are kept for 30 days; reverting a bad auto-promote is a one-click operation in /admin/audit.

07Key insights from the paper

These are the bullets that shaped every architectural choice above:

The validation gate is the only thing that matters. The best skills in the paper landed with just 1-4 accepted edits total — everything else was rejected by the gate.

Bounded edits beat full rewrites. 4-8 atomic ops per step is the sweet spot. Bigger steps regress more than they help.

Compactness wins. The median final skill in SkillOpt was ~920 tokens. Our static gate targets that number; warns at 1200; rejects at 1500.

The description is the router; the body is the agent. They are two surfaces and they can drift apart. Only end-to-end tests catch it. On skillmake we proxy with description ↔ whenToUse keyword coherence and then let the conversion gate make the final call.

Aggregate accuracy is the wrong unit. Per-skill effect size is where the action is. Hence the cron decides per-skill and the operator review checks per-skill effect against corpus average.

Frozen model + trained context is the practical adaptation. We are not training Claude. We are training the document Claude reads.

08Current state snapshot

Live as of 2026-05-27. Marketplace data from GET /api/marketplace; install counts from Cloudflare Analytics Engine skillmake_metrics dataset (all-time).

94Total skills

43Engineers

23Creators

8DevOps

6Design

6Marketing

4AI

4General

Top 10 skills by install count (all-time)

Sourced live from Cloudflare Analytics Engine via the leaderboard query in docs/analytics-queries.md.

superpowers773
anthropic-webapp-testing689
karpathy-claude-md665
ui-ux-pro-max664
podcast-show-notes654
anthropic-skill-creator641
video-transcript-to-blog637
shadcn-ui-skill625
mp-triage521
mp-zoom-out492

These ten are the ones with enough traffic to drive a real A/B test against the conversion gate without falling back to skill-judge. At ~50-100 installs/day each, the top entries can clear the 200-installs-per-arm threshold in roughly 5-10 days at 20% candidate share. The long tail will lean on the LLM-judge fallback.

09Sample SQL queries

Both queries below run against the Cloudflare Analytics Engine SQL API. Full recipe collection lives in docs/analytics-queries.md.

Install leaderboard (all-time)

SELECT
  blob2 AS slug,
  sum(_sample_interval) AS installs
FROM skillmake_metrics
WHERE index1 = 'install_hit' AND blob2 != ''
GROUP BY slug
ORDER BY installs DESC
LIMIT 50

Funnel — same-day home view → install

WITH
  home AS (
    SELECT blob6 AS visitor, toStartOfDay(timestamp) AS day
    FROM skillmake_metrics
    WHERE index1 = 'home_view'
      AND timestamp >= NOW() - INTERVAL '30' DAY
    GROUP BY visitor, day
  ),
  installs AS (
    SELECT blob6 AS visitor, toStartOfDay(timestamp) AS day
    FROM skillmake_metrics
    WHERE index1 = 'install_hit'
      AND timestamp >= NOW() - INTERVAL '30' DAY
    GROUP BY visitor, day
  )
SELECT
  home.day AS day,
  count(DISTINCT home.visitor) AS home_visitors,
  count(DISTINCT installs.visitor) AS installers,
  round(count(DISTINCT installs.visitor) / count(DISTINCT home.visitor) * 100, 2) AS pct
FROM home
LEFT JOIN installs ON home.visitor = installs.visitor AND home.day = installs.day
GROUP BY day
ORDER BY day ASC

After Phase 1 lands, every per-version query is the same shape with SPLIT(blob2,'@')[1] AS hash tacked on for the split.