Writing Skills for AI Coding Tools (And Why Most Shouldn't Exist)

2026-02-18 · by Matthew van Bird

The premise

I use three AI tools in a normal week. Claude in the terminal as a daily driver. Copilot inside JetBrains for line-by-line autocomplete. ChatGPT for one-shot questions where I can't be bothered to open anything else. Each of them has its own version of skills - markdown files that teach the tool how you want a thing done, and that fire automatically when they're relevant.

The names are different. The shape is the same. And almost all of the ones I see people write shouldn't exist.

This is the bit nobody tells you when they show you the shiny demo. The mechanism is easy. Writing one that earns its place in your toolchain is much harder. Most of what gets called a "skill" or a "rule" or a "custom instruction" is somebody's pet prompt in a markdown costume.

What follows is what I've worked out from getting it wrong a few times across all of them, then getting annoyed enough to read the actual contracts.

What a skill actually is, across tools

Strip the marketing off each one and you're looking at the same thing in slightly different wrappers:

Claude Code - .claude/skills/<name>/SKILL.md with YAML frontmatter: name, description, optional allowed-tools. Claude reads the description and decides whether to invoke the skill.
Cursor - .cursor/rules/<name>.mdc with frontmatter: description, globs, alwaysApply. The model auto-loads rules whose globs match files in scope, or whose description Cursor decides is relevant.
GitHub Copilot - .github/copilot-instructions.md for repo-wide rules, plus .github/instructions/<name>.instructions.md with an applyTo glob for path-scoped ones.
ChatGPT / Custom GPTs - the Instructions field on a custom GPT, or per-project instructions. No glob - it fires whenever you talk to that GPT.

Three out of four are: a markdown file, with metadata that tells the tool when to fire, and a body that tells the tool what to do when it does. ChatGPT is the simplest because it ditches the routing - you pick the GPT, the rules fire. The others all do their own version of "scan the available skills, decide which (if any) apply right now".

The important word is decides. You aren't loading these manually. The tool decides whether your skill is relevant, based on the trigger you wrote - glob, description, or both. So the trigger is the most important thing in the file, by a wide margin.

A skill with a vague trigger is a ghost: present, never fired. Or worse, it fires when it shouldn't, dragging context into every unrelated request and quietly making the model slower, dumber and more expensive.

What they're useful for

The honest answer is: not as much as you'd hope, but the things they're good at, they're very good at.

Codifying a repeatable workflow you keep re-explaining. How a missing [Authorize] attribute is not actually an issue despite everything telling you it is because our infrastructure deals with it.

Local conventions that aren't in the code itself. "Use this logger, not the framework one." "Don't add new exceptions to that namespace." "All IQueryable extensions go through this helper." Things that would normally live in a CONTRIBUTING.md no one reads, but actually take effect because the tool reads it every time it's relevant.

Domain mappings. Internal acronyms, the difference between the three things called Account in your codebase, which DB column is the canonical user ID. I built a domain-glossary rule for a client and watched a junior dev's onboarding drop from a fortnight to a long afternoon, because the model could answer the dumb questions and they could focus on the real ones.

Tool wrappers. A skill that says "when the user wants to query staging Postgres, here's the SSO dance and here's the read-only role to assume". Bounded, predictable, and on tools that support it you can control which other tools the skill is allowed to call.

Path-scoped rules where the tool supports them. Cursor's globs and Copilot's applyTo are useful. A rule that only fires inside infrastructure/*/.tf can be brutally specific because it doesn't have to disclaim everything else.

What they're not useful for

This is the longer list, and the one most people skip.

Personality and style preferences. "Be concise." "Don't apologise." "Use British English." These belong in your top-level user instructions, not a per-skill file. Skills cost context every time they fire. A skill that exists to remind the model not to say "Certainly!" is not a skill, it is a regret.

General programming knowledge. "How to write good Python." "What dependency injection is." "Use SOLID principles." The model knows. You writing a skill that re-explains it is just noise that displaces useful tokens.

Things that change often. Skills are static markdown. If your "skill" needs to know the current state of the database, the open incidents, or what's in JIRA right now, you don't want a skill - you want a tool, an MCP server, or you want the model to just go and look.

Wikis. This is the failure mode I see most often. Someone writes a 2,000-line SKILL.md or .cursorrules titled engineering-standards with sixteen H1s and the entire internal API reference inlined. Then they're surprised that it makes the model slower, more expensive, and worse at the actual task. Long skills are a smell. If yours is over ~150 lines, it almost certainly wants to be split.

Things that should be in code or tests. If the rule is "never use Newtonsoft.Json", you don't need a skill - you need a Roslyn analyzer or a Cursor rule with a globs so narrow that any .cs file gets it. Better still, a build break. Skills aren't a substitute for enforcement; they're a substitute for reminders.

How to actually write one

There's a discipline to this that took me a few tries to internalise. It's the same on every tool I've used.

Start from the trigger, not the content. Before you write a single line of the body, write the description and (where the tool supports them) the globs. If you can't write a trigger that crisply names when this skill should fire and when it shouldn't, you don't have a skill yet. You have a feeling.

Bad description:

description: Helps with .NET stuff.

Good description:

description: |
  Use when migrating a .NET project between major versions (e.g. .NET 8 → 10).
  Handles target framework moniker, SDK pin, deprecated API rewrites, and
  CI workflow updates. Run this BEFORE editing csproj files manually.

One of those is doing real work. The other is a coin flip.

Write the body for a competent contractor, not a beginner. The model isn't your junior. It knows what an HTTP request is. It does not know that your service expects a custom X-Tenant header and will 500 with a misleading error if you forget. Skip the basics, hammer the specifics.

Use progressive disclosure where the tool lets you. This is the trick almost nobody uses and it changes everything. On Claude and on Cursor (with .mdc references), your top-level file should be lean - under 100 lines is a healthy target - and link out to sibling files for the heavy material:

.claude/skills/release-cut/
├── SKILL.md          (the brain - 60 lines)
├── reference.md      (full changelog format, label semantics)
├── examples/
│   ├── happy-path.md
│   └── hotfix.md
└── scripts/
    └── tag-and-push.sh

Inside SKILL.md:

> For the canonical changelog format and a worked example, read reference.md. > For hotfix scenarios (out-of-band tags, no-changelog commits), read examples/hotfix.md.

The model will pull those in only when they're relevant to the current task. You get the depth without paying for it every time the skill fires. Copilot and ChatGPT don't really do progressive disclosure - for those, ruthless trimming is the only lever.

Be explicit about ordering and stopping conditions. "Do A, then B, then C. If C fails, do not proceed to D - surface the error and stop." Models will keep ploughing forward unless you tell them not to. A good skill has spine.

Narrow the blast radius where you can. On Claude, allowed-tools is the cheapest safety improvement available. On Cursor, narrow globs. On Copilot, narrow applyTo. If a skill should never run shell commands, say so in metadata, not just in prose.

A worked example

Below is the smallest skill that earns its keep, ripped from one I actually use. Shown in Claude format, but the same content maps cleanly to a Cursor .mdc or a Copilot applyTo-scoped instructions file.

---
name: csharp-pr-prep
version: 1.4.0
owner: "@matthewxbird"
description: |
  Use when the user asks to prepare a PR for a C#/.NET change. Verifies
  the diff respects our conventions (no exceptions for control flow, no
  Newtonsoft, public APIs documented), runs `dotnet format`, and writes
  a PR body in our standard template.
allowed-tools:
  - Read
  - Edit
  - Grep
  - Bash(dotnet --version)
  - Bash(dotnet format:*)
  - Bash(git diff:*)
  - Bash(git status)
  - Bash(git add:*)
disallowed-tools:
  - Bash(git push:*)
  - Bash(rm:*)
---

<!--
Changelog
1.4.0 - Add Newtonsoft.Json blocker, narrow Bash allowlist.
1.3.0 - Stop on first failure (was running all checks then summarising).
1.2.0 - Initial path-scoped version.
-->

## Non-goals
- Does **not** push to remote.
- Does **not** touch files outside `src/` or `tests/`.
- Does **not** edit CI workflows. If CI needs changing, surface and stop.

## Preconditions
- Branch is not main/master.
- `dotnet --version` resolves and matches `global.json`.

## Checks (in order, stop on first failure)
1. Run `dotnet format --verify-no-changes`. If it fails, run `dotnet format`
   and stage the result.
2. Grep the diff for `throw new ` inside `try`/`catch` pairs that aren't
   in `*Exception.cs` files. Surface any matches - these usually mean
   control-flow-by-exception. See `reference.md` for our rule.
3. Grep the diff for `Newtonsoft.Json`. We use `System.Text.Json`. Any
   match is a blocker.
4. For every new `public` symbol in `src/`, confirm there's an XML doc
   comment. Missing docs are a blocker.

## Dry-run
Before any `Edit` or `git add`, list the files you will modify and stop
for confirmation. Continue only after the user replies `yes`.

## PR body
Use the template in `templates/pr.md`. Title format: `<area>: <verb-phrase>`,
no Conventional Commit prefixes.

Sixty lines, two of which are non-goals and three of which are version metadata. Routes precisely. Refuses to do work it shouldn't. Cannot push to remote even if asked nicely. Tells the model where to read more if needed. That's the shape.

The same idea in Cursor format. Note version isn't part of Cursor's spec, but the file is still under git - tag it in the body's changelog comment:

---
description: Prepare a C#/.NET PR per our conventions.
globs:
  - "src/**/*.cs"
  - "tests/**/*.cs"
alwaysApply: false
---

<!-- version: 1.4.0 - owner: @matthewxbird -->

(same body)

And in Copilot, where the only metadata is the path scope:

---
applyTo: "src/**/*.cs,tests/**/*.cs"
---

<!-- csharp-pr-prep · version: 1.4.0 · owner: @matthewxbird -->

(same body)

The metadata schema changes. The discipline doesn't.

Other best practices

A handful of habits that took me a while to settle on, all earned the hard way.

Version the file. A version: x.y.z field in frontmatter (or a changelog comment for tools that ignore unknown keys) does two things: it makes your test harness pin against a known-good version, and it forces you to think about whether a change is meaningful before bumping. Bump on every change that could shift trigger or behaviour. Don't bump on a typo fix.

Declare an owner. When a skill misfires at 11pm during a release, whoever's on call needs to know who to wake up. owner: "@team-platform" (or a person, or a Slack handle) saves trees of "does anyone know who wrote this?" messages. It also nudges shared ownership - anonymous skills rot.

Be granular with tool access. allowed-tools: Bash is barely better than no restriction at all. Claude Code accepts patterns like Bash(dotnet format:) - use them. Default to deny, then explicitly list the exact commands the skill needs. Add a disallowed-tools block for the things you want to guarantee never happen - Bash(git push:), Bash(rm:*) - even when the model gets creative. Cursor and Copilot can't do this in metadata; on those, the body has to do the policing, which is weaker but better than nothing.

Write the non-goals down. A ## Non-goals heading is one of the highest-leverage additions you can make to any skill. Models will helpfully expand scope unless told not to. "Does not push to remote. Does not touch CI." is two lines that prevent three classes of incident.

Add a dry-run phase for anything that writes. "Before any Edit or destructive Bash, list what you will change and stop." Costs nothing, prevents the most embarrassing failures. Skip it only for skills that are read-only by construction.

Cap context. "Read at most three sibling files; do not list directories." Without a ceiling, models will read every file they're allowed to and burn your budget. Stick a number on it.

Make steps idempotent where you can, and label them where you can't. Re-running shouldn't duplicate commits, double-add sections, or re-tag releases. When a step can't be idempotent - "run database migration 0042" - say so loudly and gate it behind the dry-run.

Keep a changelog comment at the top of the body. Three lines beats git log for someone reading the skill cold:

<!--
1.4.0 - Add Newtonsoft.Json blocker, narrow Bash allowlist.
1.3.0 - Stop on first failure.
-->

Name skills by trigger, not by content. release-cut is good. our-release-process-and-changelog-format-v2 is a cry for help. The filename is the first thing a teammate sees in a directory listing; make it earn its space.

Co-locate fixtures and tests. If a skill has a test suite (and serious ones should - see below), put it next to the skill, not in a far-off tests/ tree. skills/csharp-pr-prep/SKILL.md, skills/csharp-pr-prep/tests/cases.json. When you edit the skill, the tests are right there. When you delete the skill, the tests go with it.

Testing them (and accepting they'll never be fully deterministic)

This is the part nobody does, and it's the part that separates a skill from a wish.

You cannot make an LLM call deterministic. Same prompt, same model, same temperature - different output. What you can do is squeeze the variance down to a point where the skill is reliable enough to depend on, and then verify that it stays there as you edit the skill, swap models, or change the surrounding tools. The honest target is something like: fires when it should ≥95% of runs, follows the prescribed steps ≥90% of runs, never calls a tool it isn't allowed to call. Pick your numbers; what matters is that you have numbers at all.

There are three categories of test, and you need all three:

1. Trigger tests. Does the skill fire when it should? Does it stay quiet when it shouldn't? This is the most important kind and the one almost everyone skips. Construct ~10 prompts that should activate the skill, ~10 that look superficially similar but shouldn't. Run them. Measure the hit and false-positive rate. If either is below 90%, your description is wrong - fix the trigger, not the body.

2. Behaviour tests. When the skill fires, does it do the right things in the right order? For each canonical scenario (happy path, hotfix, malformed input, missing precondition), run the skill ~10 times and check the trace. What you're asserting on is mostly tool calls: "did it run dotnet format before checking for Newtonsoft.Json? Did it call Edit on any file outside src/? Did it stop when check 2 failed?" Tool-call assertions are far more stable than asserting on prose.

3. Negative tests. Give it a prompt that looks like the trigger but isn't, or an environment where the precondition fails. Assert that it refuses cleanly. "Surface the error and stop" needs to be a thing you actually verify.

The harness can be embarrassingly simple. For Claude, the Agent SDK lets you script a session and inspect the tool calls it made. A 40-line script is enough:

import json, subprocess
from collections import Counter

PROMPTS = json.load(open("tests/csharp-pr-prep.json"))  # {prompt, expect_skill, expect_tools, must_not_call}

results = Counter()
for case in PROMPTS:
    for _ in range(10):  # 10 runs per case
        trace = run_claude(case["prompt"])         # returns parsed tool calls
        fired = "csharp-pr-prep" in trace.skills
        tools = [c["tool"] for c in trace.calls]

        if fired != case["expect_skill"]:           results["trigger_miss"] += 1
        if any(t in case.get("must_not_call", []) for t in tools): results["forbidden_tool"] += 1
        if not set(case["expect_tools"]).issubset(tools):           results["missing_step"] += 1

print(results)
assert results["forbidden_tool"] == 0, "Skill called a tool it shouldn't"
assert results["trigger_miss"] / total < 0.05
assert results["missing_step"] / total < 0.10

Crude, but it's the difference between "the skill works on my laptop" and "the skill works 9 times out of 10 across 30 scenarios, and CI will tell me when that changes".

A few rules I've internalised:

Pin the model version. "Latest Sonnet" is not a thing you can test against. Pin it in the harness, bump it deliberately, re-run the suite.
Run on every skill edit. Tiny wording changes in the description can swing trigger rates by 20%. If you don't notice, you don't fix it.
Assert on tool calls, not prose. "Did it write 'Done.' in the response?" is a flaky test. "Did it call Bash with dotnet format?" is not.
Use a smaller model as judge for fuzzy bits. When you need to score the quality of an output ("is this PR body coherent?"), give the prompt and the output to a smaller model and ask it to grade against a rubric. Cheap, surprisingly reliable, and tells you when quality drifts.
Keep the test suite small. 30 well-chosen scenarios will catch more regressions than 300 lazy ones, and they'll actually run in CI.

Cursor and Copilot don't have a first-class SDK for this kind of automation yet, which is why my serious skills live in Claude. For Cursor specifically, the best I've managed is a shell script that opens a temp project, triggers a rule via the CLI, and greps the chat log. Ugly, but it caught a regression last month where a "tiny wording change" cut the trigger rate from 96% to 41%. You'll do worse without a harness; you won't do better.

Where I've seen them fail

A team I worked with last year wrote a single .cursorrules file that was effectively the company's entire engineering handbook in markdown. Six thousand lines. Triggered on every file. Their token bill tripled and the actual code suggestions got worse, because the model was now drowning in context and couldn't tell what was load-bearing.

We deleted it and replaced it with seven path-scoped rules, each under 80 lines, each with a tight trigger: csharp-review.mdc globbing */.cs, sql-review.mdc globbing */.sql, terraform-review.mdc globbing infra/*/.tf, and so on. Same content, organised by trigger. Bill came down, suggestions got sharper.

The lesson: a skill isn't a place to put knowledge, it's a place to put knowledge that should travel with a specific trigger. If two skills want the same rule, that rule is probably the codebase's problem, not the skill's.

When to not write the skill at all

A non-exhaustive list of things I've seen people make skills of, that should have been something else:

One-shot tasks → just ask the model directly.
"Things I always forget" → that's a personal note, not a skill.
"Things the whole team always forgets" → fix the codebase, the docs, or the CI.
Style preferences → top-level user instructions.
"Reminders" → if it has to fire every time, it's a prompt, not a skill.
Anything that needs live data → MCP server, or just let the model go and look.
Anything that's actually a lint rule → make it a lint rule.

The test I keep coming back to: would I be annoyed if this fired half the time it shouldn't, and didn't fire half the time it should? If yes, the skill doesn't earn its place, because that's exactly the failure mode of a vague trigger.

The tool-by-tool reality check

A short, opinionated take on each one I use.

Claude Code has the cleanest model. Frontmatter does real routing, progressive disclosure works, allowed-tools does real work. The bar to write a good one is the lowest.
Cursor is fine when you use globs properly and resist the urge to write a giant alwaysApply: true rule. The legacy single-file .cursorrules is a trap - migrate to the .cursor/rules/ directory and break things up.
Copilot's repo-wide copilot-instructions.md earns its keep for team-wide conventions, and it works inside both the web Copilot and the editor extensions. The per-path instructions/*.instructions.md is newer and less mature, but it's the right shape.
ChatGPT custom GPTs are the bluntest of the bunch - no routing, no globs, just a single Instructions field. Treat them as one giant skill with no off-switch. They're good for narrow personas ("you are my SQL explainer"), bad for anything that needs to fire conditionally.

If you find yourself writing the same rule across every tool you've got open, you almost certainly want it in your code - a lint rule, an analyzer, a pre-commit hook - and a one-line reminder in the AI tools that points at it. The AI tools are for the things you can't enforce in code.

Closing

Skills are a small feature pretending to be a small feature. The marketing didn't oversell them and the docs didn't undersell them. They're a sharp tool for a narrow job: codifying the when of how your team works, so the model can do it without being asked. Used right, they save real time across whichever assistant you've got open. Used as wikis, they cost real money in every tool you've installed them in.

If you take one thing from this: write the trigger first. If you can't write a good one, you don't have a skill yet - you have a thought, and a thought belongs in a file you're allowed to delete.