Status report · 2026-05-27

SkillOpt on skillmake.xyz

A frozen model with trained markdown. We are turning every SKILL.md on skillmake into a trainable parameter — bounded edits, held-out validation, A/B routed installs. This page explains what we are building, what data we already collect, and how to evaluate an A/B test once it is live.

Phase 1 — building Phase 2 — planned Phase 3-4 — cron-driven

01What SkillOpt is

SkillOpt (Wang et al., arXiv 2605.23904) reframes a SKILL.md file as the trainable parameter of an agent system. The base model stays frozen — we never fine-tune Claude or Cursor — but the markdown the agent reads gets optimized over time the same way a neural net's weights would: bounded edits per step, held-out validation gate, strict-improvement promotion, protected sections that the edit machinery refuses to touch.

That framing matters because "self-improving agents" without those guardrails mostly produce slop. SkillOpt is the discipline that turns a freeform edit this skill button into a stable optimization loop.

The frozen model plus a trained context window is the practical adaptation. We are not training Claude. We are training the document Claude reads.

02SkillOpt figure mapping

The paper's Figure 1 walks through the gradient-descent analogy. Here is the exact translation we use on skillmake:

Paper conceptskillmake equivalent
Parameter θA published SKILL.md (MarketplaceEntry.markdown in KV)
Frozen model fWhatever agent the visitor runs (Claude Code, Cursor, etc.) — we never touch it
Gradient directionTrajectory-derived edits from the optimizer (4-8 atomic SkillEdit ops, derived from telemetry + failure signal)
Learning rateBounded edit budget — max 8 ops per candidate; replace_apiReference.signature is forbidden
Validation checkHeld-out conversion gate: install_hit / marketplace_view strictly improves on A/B-routed traffic, p < 0.05
Batch / minibatchN-install threshold before the gate fires (default 200 installs per arm)
"Suboptimal skill" trajectoryAd-hoc curator edit that bypasses the gate (deprecated path)
Rejected side updatesCandidates that fail the static or dynamic gate — logged as negative feedback for the next optimizer pass
Protected section invariantprotectedSections + <!-- @protected:<id> --> markers, byte-identical between candidate and parent

03What we are building (4 phases, ~1 week)

Phase 1 — Foundation Day 1-3

Versioned KV storage, bounded edits, static gate, protected sections.

Phase 2 — A/B routing Day 4

Deterministic split at /i/<name>:

const candidate = await getCandidateSkill(id);
const useCandidate = candidate && shouldRouteCandidate(req, candidate.abTrafficShare);
const served = useCandidate ? candidate : current;

shouldRouteCandidate hashes (cf-connecting-ip + name), so the same visitor sees the same arm across reloads. Default share: 20% candidate / 80% current. While a candidate exists for a skill, cache headers flip to private, no-store to prevent Cloudflare or upstream proxies from poisoning the split.

Phase 3 — Dynamic validation gate cron · hourly Day 5

An hourly Cloudflare Cron at /api/cron/validate-candidates reads Analytics Engine for the last 7 days. For each candidate:

  1. Query marketplace_view count for the slug (route-agnostic — same on both arms).
  2. Query install_hit count for name@candidate_hash and name@current_hash.
  3. Require ≥ 200 installs per arm; otherwise "insufficient data, keep collecting."
  4. Welch's t-test on the conversion ratios; require p < 0.05 AND strict improvement. Ties reject.
  5. Pass → promoteCandidate(id) atomically flips the pointer. Regress → retire the candidate and log to a negative-feedback store so the optimizer doesn't re-propose the same diff.

Long-tail fallback. Any candidate that fails to cross 200 installs in 14 days falls back to an LLM-judge call against the skill-judge skill we already seeded from softaworks/agent-toolkit. The judge has to score strictly higher than the current version. This is our answer to the paper's open verifier problem — we don't solve it generally; we punt to a peer skill trained on official examples.

Phase 4 — Optimizer loop cron · weekly Day 6-7

Weekly cron /api/cron/propose-edits:

  1. Find every approved skill below the 25th percentile in conversion with no active candidate.
  2. Pull 14 days of telemetry — install volume, conversion ratio, github_click rate, top countries, dwell distribution.
  3. Pull the last N rejected candidates (avoids loops).
  4. Render current SKILL.md + telemetry + rejected-edits as a narrow prompt to a frontier model.
  5. Model must respond with a Zod-validated SkillEdit[] of length 4-8.
  6. Apply edits → static gate → save as candidate → enters A/B at 20% share.

An SKILLOPT_ENABLED=true|false env var gates every cron — kill switch is one config flip. Budget at current corpus size (94 skills, top quartile under target ≈ 24 weekly calls × $0.40 avg) is roughly $10/month.

04Are we collecting usage data?

Yes — and we have been for weeks. Every event lands in the skillmake_metrics Cloudflare Workers Analytics Engine dataset. No personally identifying info is stored; the daily visitor id is a SHA-256 hash of ip + ua + day truncated to 8 bytes, so the same person on the same day collapses into one id but cannot be tracked across days. Search queries are also hashed before write.

Here is every MetricEvent type the codebase currently fires, and what each one tells us:

EventWhat it tells us
install_hitSomeone hit /i/<name> and got the markdown. After Phase 1, the slug includes @hash8 so we can split by version. This is the gradient signal.
marketplace_viewThe marketplace listing page loaded for a given skill — denominator for conversion.
home_viewThe homepage loaded; blob2 carries the audience filter slug if one is active (engineers/creators/etc.).
tricks_view / powerhouse_viewCross-page interest — which subsection of the site people drift toward.
search_submittedA search query was submitted (hashed). Volume + funnel signal, not literal text.
github_clickOutbound click to the repo. Paired with install_hit, this catches the "intent leak" — people who read the listing and bounced to GitHub instead of installing.
page_dwellBucketed dwell time: 0-5s / 5-15s / 15-30s / 30-60s / 60-300s / 300s+.
scroll_depthBucketed reading depth: 0 / 25 / 50 / 75 / 100.
submit_started / submit_completedSubmit funnel — top and bottom of the conversion flow for new skill submissions.
convert_success / convert_errorThe /api/convert extraction pipeline succeeded or failed. blob2 carries error code; double2 carries HTTP status.
api_errorAny 4xx/5xx from a server route. Logged via observability.ts as a structured JSON line + Analytics Engine row.

And the blob columns are stable across every event:

ColumnMeaning
index1 / blob1Event name (mirrored for SQL convenience)
blob2Slug / bucket / route fragment / hashed search query (event-dependent)
blob3Country (cf-ipcountry; ?? if unknown)
blob4Referer host (empty for direct hits)
blob5UA category — curl / browser / bot / other
blob6Daily visitor id — SHA-256(ip+ua+day), 8 bytes
double1Always 1 (sample interval anchor)
double2HTTP status (api_error, convert_error only)

The Grafana dashboard (docs/grafana/skillmake-dashboard.json) already renders 30+ panels off this dataset — total installs, agent vs human split, hourly install pulse, top countries, GitHub-click leaderboard, dwell/scroll distributions, audience demand from homepage filter clicks, errors per hour, the home → marketplace → install funnel, and the per-skill leaderboards we will repurpose for per-version A/B comparison in Phase 1f.

05MarketplaceEntry shape — new fields

This is what one entry looks like after Phase 1 lands. The five new fields at the top of the entry are the entire surface area of versioning:

{
  "id": "superpowers-a1b2c3d4",
  "name": "superpowers",
  "contentHash": "a1b2c3d4e5f6g7h8",

  // --- new in Phase 1 ---
  "versionId": 7,
  "parentContentHash": "f9e8d7c6b5a4",
  "edits": [
    { "op": "add_gotcha", "value": "Don't forget to..." },
    { "op": "replace_whenToUse_item", "index": 1, "value": "Use when..." }
  ],
  "status": "candidate",        // "candidate" | "current" | "retired"
  "abTrafficShare": 0.2,        // only meaningful when status === "candidate"
  // -----------------------

  "description": "...",
  "category": "tool",
  "audience": "engineers",
  "protectedSections": ["apiReference"],
  "markdown": "# superpowers\n\n<!-- @protected:apiReference -->\n...",
  "createdAt": "2026-05-27T12:00:00Z"
}

06How an A/B test gets evaluated — 7-step operator playbook

Once Phase 2 lands, here is exactly what running an experiment looks like from the operator side.

STEP 0

Pre-flight static gate

The candidate goes through skill-validator on insert. Token count must be under 1500 (the paper's compactness lesson — best skills land around 920). Description must share at least one keyword with each whenToUse item — the cheap proxy for router/body alignment. Protected sections must be byte-identical to the parent. Edit budget ≤ 8 ops. If any check fails, the candidate doesn't enter the A/B at all; the curator sees the rejection reasons and can either revise or force-pass with a logged override.

STEP 1

Start at 20% candidate share

Default is abTrafficShare: 0.2. shouldRouteCandidate(req, share) hashes (cf-connecting-ip + name) so the same visitor never flips arms between reloads. The cache header on /i/<name> drops to private, no-store for the duration — otherwise Cloudflare's edge cache would poison the split.

STEP 2

Wait for ≥ 200 installs per arm — or 14 days

The hourly cron checks both arm counts each tick. Below 200 per arm it reports "insufficient data, keep collecting." If 14 days pass without either arm hitting 200 installs, it routes to the skill-judge LLM-judge fallback and decides on judge score instead of conversion. Curator can manually nudge the share higher (50% / 80%) for low-traffic skills if they want to accelerate.

STEP 3

Read the dashboard panels

The Grafana dashboard exposes everything an operator needs to sanity-check the test before the cron decides for you:

STEP 4

Cron decides — Welch's t-test, p < 0.05, strict improvement

The hourly cron computes conversion ratios per arm, runs a Welch's t-test (unequal variance is the realistic assumption for differently-sized arms), and requires p < 0.05 AND strict improvement. Ties reject. Regressions retire the candidate and the diff goes into the negative-feedback store keyed at skill:<id>:rejected:v<hash>, with the telemetry snapshot that triggered rejection. The next optimizer pass reads those rejected diffs and is prompted to avoid them.

STEP 5

Post-promotion sanity check

Promoted? The pointer flips atomically. Cache header restores to public, max-age=300, must-revalidate. The previous current moves to retired but stays queryable for 30 days from /admin/audit — every auto-promote is reversible. Operator should pull up the per-version panels for the next 24 hours and confirm the conversion lift holds at 100% traffic, not just at 20%.

STEP 6

Per-skill effect size > corpus average

This is the test that matters. The paper's central insight: aggregate accuracy is the wrong unit. You can ship a candidate that raises the corpus mean by 2pp while making three high-traffic skills measurably worse. The cron decides per-skill; the operator's job is to spot-check that the effect size on this skill exceeds the corpus average over the last 14 days. If it doesn't, the promotion is technically valid but probably a wash.

STEP 7

Failure modes to watch

07Key insights from the paper

These are the bullets that shaped every architectural choice above:

The validation gate is the only thing that matters. The best skills in the paper landed with just 1-4 accepted edits total — everything else was rejected by the gate.

Bounded edits beat full rewrites. 4-8 atomic ops per step is the sweet spot. Bigger steps regress more than they help.

Compactness wins. The median final skill in SkillOpt was ~920 tokens. Our static gate targets that number; warns at 1200; rejects at 1500.

The description is the router; the body is the agent. They are two surfaces and they can drift apart. Only end-to-end tests catch it. On skillmake we proxy with description ↔ whenToUse keyword coherence and then let the conversion gate make the final call.

Aggregate accuracy is the wrong unit. Per-skill effect size is where the action is. Hence the cron decides per-skill and the operator review checks per-skill effect against corpus average.

Frozen model + trained context is the practical adaptation. We are not training Claude. We are training the document Claude reads.

08Current state snapshot

Live as of 2026-05-27. Marketplace data from GET /api/marketplace; install counts from Cloudflare Analytics Engine skillmake_metrics dataset (all-time).

94Total skills
43Engineers
23Creators
8DevOps
6Design
6Marketing
4AI
4General

Top 10 skills by install count (all-time)

Sourced live from Cloudflare Analytics Engine via the leaderboard query in docs/analytics-queries.md.

  1. superpowers773
  2. anthropic-webapp-testing689
  3. karpathy-claude-md665
  4. ui-ux-pro-max664
  5. podcast-show-notes654
  6. anthropic-skill-creator641
  7. video-transcript-to-blog637
  8. shadcn-ui-skill625
  9. mp-triage521
  10. mp-zoom-out492

These ten are the ones with enough traffic to drive a real A/B test against the conversion gate without falling back to skill-judge. At ~50-100 installs/day each, the top entries can clear the 200-installs-per-arm threshold in roughly 5-10 days at 20% candidate share. The long tail will lean on the LLM-judge fallback.

09Sample SQL queries

Both queries below run against the Cloudflare Analytics Engine SQL API. Full recipe collection lives in docs/analytics-queries.md.

Install leaderboard (all-time)

SELECT
  blob2 AS slug,
  sum(_sample_interval) AS installs
FROM skillmake_metrics
WHERE index1 = 'install_hit' AND blob2 != ''
GROUP BY slug
ORDER BY installs DESC
LIMIT 50

Funnel — same-day home view → install

WITH
  home AS (
    SELECT blob6 AS visitor, toStartOfDay(timestamp) AS day
    FROM skillmake_metrics
    WHERE index1 = 'home_view'
      AND timestamp >= NOW() - INTERVAL '30' DAY
    GROUP BY visitor, day
  ),
  installs AS (
    SELECT blob6 AS visitor, toStartOfDay(timestamp) AS day
    FROM skillmake_metrics
    WHERE index1 = 'install_hit'
      AND timestamp >= NOW() - INTERVAL '30' DAY
    GROUP BY visitor, day
  )
SELECT
  home.day AS day,
  count(DISTINCT home.visitor) AS home_visitors,
  count(DISTINCT installs.visitor) AS installers,
  round(count(DISTINCT installs.visitor) / count(DISTINCT home.visitor) * 100, 2) AS pct
FROM home
LEFT JOIN installs ON home.visitor = installs.visitor AND home.day = installs.day
GROUP BY day
ORDER BY day ASC

After Phase 1 lands, every per-version query is the same shape with SPLIT(blob2,'@')[1] AS hash tacked on for the split.