Why the average Claude Code skill scores 57 out of 100
The mid-corpus average from SkillCheck’s v3.20.0 ecosystem scan was 57. That means the average community Claude Code skill has either one critical issue and three warnings, or no criticals and around nine warnings. Both shapes show up in the data. I wrote about the scan itself last week in How does skills ecosystem look like right now — 2,139 SKILL.md files, twelve community repos, what the corpus looks like at v3.20.0.
This piece answers the natural follow-up: what is being scored, and how.
57 is not low because the rubric is harsh. It’s what falls out when you score a skill against twenty-five check categories assembled over five months from fourteen named sources. The methodology page documents all of it. This is the narrative version.
Phase I: foundation checks
I started building SkillCheck in December 2025. Ten check categories. Four trace directly to the agentskills.io specification: structure, body, naming, and token budget. The remaining six categories are mostly my own work, with narrow anchors in external standards where specific rules fit existing patterns.
Visual is half WCAG. Two of its four rules come from WCAG 2.2 (color contrast and alt-text requirements). The other two rules are mine: detecting AI-default purple gradients and generic-font usage. Technical (the category that catches secrets, injection risks, PII, version pinning, and a few other production hazards) borrows one rule from MCP best practices, specifically the protocol-version requirement. The other six technical rules are mine.
Semantic consistency, anti-slop pattern detection, quality patterns, and enterprise readiness are entirely mine. Together with the parts of visual and technical I wrote myself, that’s most of Phase I.
This was the version that caught the lowest-hanging fruit. A SKILL.md without a description field. A markdown body that ran past the recommended 5000-token mark. A naming collision with a built-in skill. The published spec said these were problems, and v1.0 said so deterministically.
That version did its job. What it didn’t do was catch the things I actually wanted it to catch.
Phase II: what production teaches you
Three months later, in March 2026, I added eight more categories based on what I was learning from running checks across my own portfolio. Quality Pro covered the deeper structural patterns that the spec didn’t name. Workflow covered what the skill actually does in production, which often differs from what its description claims. Reference integrity. Eval readiness. Orchestration safety. Autonomy boundaries. Composability. Observability.
These are the categories that nobody writes a spec for, because they only become visible after you’ve shipped a few skills and watched them collide. A skill that passes every Phase I check can still fail in production because its trigger words overlap with another skill, or because it claims to be deterministic and isn’t, or because it loads three reference files when one would do. Phase II is what showed up in audit logs after Phase I shipped clean skills into the wild.
Phase III: the field reports
In April 2026 I added three more categories from observed failures and two pieces of cross-lab methodology. Design pattern classification, because the same skills kept being miscategorized between “reviewer” and “generator,” and the wrong category leads to the wrong eval. The classification itself (Reviewer, Generator, Inversion, Pipeline, Tool Wrapper) is adapted from Google’s ADK agent-type hierarchy. Trigger collision detection, because skill libraries grow into namespace clashes the way directories grow into deep nesting. Eval kit completeness, because a skill without an eval is a vibe with confidence. The eval-kit category encodes OpenAI’s approach to systematic agent skill testing, which auto-generates should-trigger and should-NOT-trigger prompt pairs.
Each Phase III category was added because a specific failure mode showed up in a specific skill audit and I wanted to never see that failure again. The categories are field reports compressed into deterministic checks, anchored to lab-published methodology where I could find it.
Phase IV: the empirical anchor
Phase IV is one category, knowledge density, and it’s the most empirically grounded check in the whole tool. The principle behind it is simple: when an agent has a list of things to choose from, the words you wrote to describe each thing are what it reads to decide. Vague descriptions get skipped. Precise descriptions get picked. This holds whether the things are MCP tools or skills or anything else with a description field.
Two arXiv papers measured the effect at scale on MCP tools, which is the domain where thousands of measurable examples already exist.
Hasan et al. (arXiv:2602.14878) scanned 856 MCP tool descriptions across 103 servers and found that 97% of them contained at least one description smell, which the authors define as a pattern that makes a tool harder for an agent to select correctly. The paper’s findings shaped several of the rules in SkillCheck’s knowledge density check.
Wang et al. (arXiv:2602.18914) ran a different experiment across 10,831 MCP servers. They measured how often an agent picks the right tool when faced with a list of MCP tools to choose from. Descriptions that followed a standard schema (purpose, when-to-use, when-not-to-use, examples) got picked 72% of the time. Descriptions that didn’t got picked 20% of the time. The fifty-two-point gap is the entire reason knowledge density exists as a category of its own and not a subsection of the body check.
The same mechanism applies to skill descriptions. A SKILL.md whose description doesn’t tell the agent what the skill does and when to use it gets skipped, even when it’s the right skill for the job. Knowledge density is the check category that catches this.
I cite these papers in the methodology page and in the rule descriptions inside the tool itself, because the alternative is to claim I came up with knowledge density on my own. I did not. Hasan and Wang did the work; I encoded their findings.
Phase V: the marketplace year
Phase V arrived in April 2026 with three categories. Agent integration readiness, marketplace governance, memory governance. These are the categories that became relevant when Anthropic shipped plugin marketplaces and built-in memory in late April. Until those features existed, the categories were theoretical. The day they shipped, a thousand skills suddenly needed to declare things about themselves that they hadn’t previously needed to.
Each Phase V category leans on multiple sources. Agent integration readiness prices MCP servers along Sam Morrow’s four axes (token efficiency, security, unique unlocks, execution environment), populated with Anthropic’s six production patterns. Marketplace governance prices plugin marketplaces against Anthropic’s reference schema, with governance requirements (named maintainers, change-gate evals, deprecation paths) the reference itself doesn’t yet ship. Memory governance is design-locked against Anthropic’s managed-agents memory primitives.
Marketplace governance and memory governance will keep evolving for the rest of 2026. Phase V is the phase that will keep moving.
The scoring math
Every check fires at one of three severities. Critical issues subtract 20 points. Warnings subtract 5. Suggestions subtract 1. Every skill starts at 100.
The reason for the asymmetric weighting is calibration against a known reference. A skill with one critical and no warnings should score 80. A skill with four warnings and no criticals should score 80. The math says those skills are equally broken, and in practice that turns out to be approximately true.
The 57 from the opening is what falls out when you run this scoring across the 2,139 community skills SkillCheck scanned in May 2026. The corpus has three rough population shapes. Skills that don’t validate as agent-readable at all score low (multiple criticals, often missing the description field or the frontmatter that the spec requires). Skills that pass Phase I cleanly but break on Phase II categories score in the middle (many warnings about workflow, eval readiness, orchestration safety). Skills that nail the foundation and most of production score high but still drop points on knowledge density or reference integrity. Most of the corpus sits in the middle shape.
What v3.20 actually means in calendar time
The versioning isn’t decorative. Every minor bump corresponds to a category being added or a rule being refined. v3.18.0 was the version that started scanning the public Claude Code ecosystem. v3.20.0 added marketplace governance and memory governance because the marketplace and memory features had just shipped. By the time you read this, there will probably be a v3.21 or v3.22, because the spec landscape is moving fast and a tool that doesn’t move with it is just a snapshot.
The five months between December 2025 and April 2026 are not a long time. They are also not nothing. The categories that exist at v3.20 are the categories that survived being run against real skills, written by real people, in real codebases.
The methodology page is the audit trail
Every category traces back to something concrete. The agentskills.io spec for structure, body, naming, and token budget. WCAG 2.2 for accessibility rules inside the visual category. MCP best practices for protocol versioning inside the technical category. Hasan and Wang on description quality, measured across thousands of MCP tools. OpenAI’s approach to systematic agent skill evaluation. Google’s ADK agent-type hierarchy. Glama’s industry quality benchmark. Sam Morrow’s framework for pricing MCP servers. Anthropic’s production writeups and the memory and marketplace specs they shipped in late April. The audit logs from my own skills that broke in specific ways.
Fourteen named sources informing twenty-five categories. The categories that don’t have a public source (anti-slop, semantic conflict detection, quality patterns, enterprise readiness, and most of the visual and technical rules) come from my own work. The methodology page lists all of it, mapped to the categories it informed.
The full version is at getskillcheck.com/methodology.
