Claude Code just got a massive upgrade, and it's changing how developers build AI skills forever.
Chase H. explains that until now, creating custom skills in Claude Code came with three persistent problems: no systematic way to test them, unreliable skill triggering, and no method to measure improvement as models evolve. Anthropic's new Skill Creator skill solves all three simultaneously.
What Are Claude Skills
Skills are text prompts that tell Claude Code how to perform specific tasks in specific ways. They're not plugins or integrations—they're instructions that shape how the AI behaves. Think of them as teaching Claude Code a new habit, not adding a new tool.
The key distinction matters because it determines how you'll evaluate them.
The Two Types of Skills
The first type is capability uplift—a skill that makes Claude Code better at something it already struggles with. A good example is the front-end design skill. Without it, Claude Code produces generic AI slob websites. With it, the outputs improve significantly. Most official Anthropic skills fall into this category: PDF creation, PowerPoint Builder, Doc X.
The second type is encoded preference—a workflow skill that tells Claude Code to do things it already knows how to do, but in a specific order or format. The YouTube pipeline skill demonstrates this: searching YouTube, uploading to Notebook LM, analyzing the results, then creating deliverables like infographics or slide decks. Claude Code can do each step individually, but this skill chains them together in a precise sequence.
The distinction matters because capability uplift skills may become obsolete as models improve. An encoded preference skill is more durable, but only as valuable as its fidelity to your actual workflow—and that's what evaluation verifies.
How Testing Works Now
Skill Creator introduces three new testing capabilities that didn't exist before.
First, catching regressions: when a model improves, the skill might suddenly make outputs worse because the underlying capability has shifted. The benchmark mode compares performance with and without the skill, showing exactly where quality changes.
Second, multi-agent support: developers can run multiple tests simultaneously—five, six, seven, eight versions at once—without lengthy manual processes.
Third, trigger optimization: skills aren't preloaded into the system prompt. Claude Code maintains a list of all available skills with roughly 100-word descriptions. As your skill library grows, those descriptions become critical. Too broad causes false triggers; too narrow means the skill never fires. The new tool helps tune these descriptions precisely so triggering becomes reliable.
The PDF form example shows dramatic difference in outcomes—tokens used, pass rate, total time—all now measurable.
Critics might note that for non-developers, the evaluation process still requires understanding test design principles. The tool removes code writing but doesn't remove the need to know what constitutes a good test.
Loading and Using Skill Creator
To install Skill Creator, users navigate to forward/plugin in Claude Code and search for "skill creator." After installation, they exit and restart Claude Code. Once ready, asking "what can the skill creator skill do for me" returns four capabilities: creating new skills from scratch, modifying existing ones, running evals and benchmarking, and optimizing trigger descriptions.
The YouTube pipeline example demonstrates how it works in practice—creating a new skill that chains multiple steps into one cohesive workflow. In plan mode, users see exactly what the skill will build before it builds. Then evaluation runs to verify fidelity: is Claude Code actually following all nine steps in the correct order? The test passed.
For capability uplift skills like front-end design, the testing shows AB comparison—skill versus no skill—which reveals whether the skill still provides value or has been superseded by model improvements.
Anytime we can get away from the black box of AI and actually see what's happening under the hood, the better. You want control. You want consistency.
Bottom Line
The strongest argument: systematic testing removes the guesswork from skill development—you now know whether a skill actually works and whether it still matters as models evolve. The vulnerability: evaluation requires understanding what constitutes a meaningful test, which shifts some burden to the user even while removing code requirements. Watch for Anthropic's official documentation on eval design—it's coming soon—and watch for community benchmarks emerging that will standardize how skills are measured.