← Back to Library

Claude Code Skills Just Got a MASSIVE Upgrade

Claude Code just got a massive upgrade, and it's changing how developers build AI skills forever.

Chase H. explains that until now, creating custom skills in Claude Code came with three persistent problems: no systematic way to test them, unreliable skill triggering, and no method to measure improvement as models evolve. Anthropic's new Skill Creator skill solves all three simultaneously.

What Are Claude Skills

Skills are text prompts that tell Claude Code how to perform specific tasks in specific ways. They're not plugins or integrations—they're instructions that shape how the AI behaves. Think of them as teaching Claude Code a new habit, not adding a new tool.

The key distinction matters because it determines how you'll evaluate them.

The Two Types of Skills

The first type is capability uplift—a skill that makes Claude Code better at something it already struggles with. A good example is the front-end design skill. Without it, Claude Code produces generic AI slob websites. With it, the outputs improve significantly. Most official Anthropic skills fall into this category: PDF creation, PowerPoint Builder, Doc X.

The second type is encoded preference—a workflow skill that tells Claude Code to do things it already knows how to do, but in a specific order or format. The YouTube pipeline skill demonstrates this: searching YouTube, uploading to Notebook LM, analyzing the results, then creating deliverables like infographics or slide decks. Claude Code can do each step individually, but this skill chains them together in a precise sequence.

The distinction matters because capability uplift skills may become obsolete as models improve. An encoded preference skill is more durable, but only as valuable as its fidelity to your actual workflow—and that's what evaluation verifies.

How Testing Works Now

Skill Creator introduces three new testing capabilities that didn't exist before.

First, catching regressions: when a model improves, the skill might suddenly make outputs worse because the underlying capability has shifted. The benchmark mode compares performance with and without the skill, showing exactly where quality changes.

Second, multi-agent support: developers can run multiple tests simultaneously—five, six, seven, eight versions at once—without lengthy manual processes.

Third, trigger optimization: skills aren't preloaded into the system prompt. Claude Code maintains a list of all available skills with roughly 100-word descriptions. As your skill library grows, those descriptions become critical. Too broad causes false triggers; too narrow means the skill never fires. The new tool helps tune these descriptions precisely so triggering becomes reliable.

The PDF form example shows dramatic difference in outcomes—tokens used, pass rate, total time—all now measurable.

Critics might note that for non-developers, the evaluation process still requires understanding test design principles. The tool removes code writing but doesn't remove the need to know what constitutes a good test.

Loading and Using Skill Creator

To install Skill Creator, users navigate to forward/plugin in Claude Code and search for "skill creator." After installation, they exit and restart Claude Code. Once ready, asking "what can the skill creator skill do for me" returns four capabilities: creating new skills from scratch, modifying existing ones, running evals and benchmarking, and optimizing trigger descriptions.

The YouTube pipeline example demonstrates how it works in practice—creating a new skill that chains multiple steps into one cohesive workflow. In plan mode, users see exactly what the skill will build before it builds. Then evaluation runs to verify fidelity: is Claude Code actually following all nine steps in the correct order? The test passed.

For capability uplift skills like front-end design, the testing shows AB comparison—skill versus no skill—which reveals whether the skill still provides value or has been superseded by model improvements.

Anytime we can get away from the black box of AI and actually see what's happening under the hood, the better. You want control. You want consistency.

Bottom Line

The strongest argument: systematic testing removes the guesswork from skill development—you now know whether a skill actually works and whether it still matters as models evolve. The vulnerability: evaluation requires understanding what constitutes a meaningful test, which shifts some burden to the user even while removing code requirements. Watch for Anthropic's official documentation on eval design—it's coming soon—and watch for community benchmarks emerging that will standardize how skills are measured.

If you have spent any time at all inside of Claude Code, then you know that skills are one of its most powerful features by far. But if you start creating your own custom skills, you realize there are still a few drawbacks. Namely, there is no systemized way to test or improve these skills. And we still have the issue where Claude Code doesn't even trigger the right skill half the time, if at all.

But yesterday, Enthropic dropped something that fixes all three of those problems and gives us a simple way to improve all of our current skills and all the ones we're going to build in the future. And that solution is the new and improved skill creator skill. And what this does is it helps us write evals, run benchmarks, and keep our skills working in the future as these models capabilities evolve. Now, this is a big deal because up until now, we had, as I said before, no systematic way to do this.

And what Enthropic is buying us with this new skill creator is the ability to apply some of the rigor of software development, primarily testing, benchmarking, and iterative improvement without you having to write any code. You're no longer in the dark when it comes to asking yourself, is my skill actually doing what it's supposed to do in the most optimal manner possible? We can now test all that. And before we go into the details of how this is all working under the hood, let's look at some of the outcomes this is going to give us.

Here's an example of the sort of benchmarking you can now do. We can run AB tests with the skill and without the skill. We can run parallel agents. So we have multiple tests running at the same time.

And so right here we're taking a look at the PDF creation skill which is one of Cloud Code's official plugins. And you can see how it works with the skill versus we without the skill. We can see the tokens, the pass rate, the total time. We have a bunch of insights now to make real decisions.

Here's another example showing some of the optimizations we can do with this new skill creator tool. Like I alluded to before, you've probably had issues where you tell Claude Code or assume Claude ...