Deterministic Tools | Code-First Agents

Problem

An issue has 5 checkboxes, references 3 file paths, and includes acceptance criteria. That's a well-specified issue. The planning approach should be lightweight.

But if you ask the LLM to figure that out from the raw issue body, you're spending tokens on pattern matching that regex can do in milliseconds. Worse, the LLM might miss a checkbox, miscount the signals, or change its assessment depending on how the issue is worded.

Every time you ask the LLM to do deterministic work, you're paying for unpredictability. You can't write a test for "the LLM usually counts checkboxes correctly." You can't debug why it called the same issue "complex" yesterday and "simple" today.

I ran into this building an issue triage agent. The same issue got different complexity ratings on different runs. Moving the signal counting to a script made it deterministic overnight.

Solution

Move deterministic work into CLI tools with a standard contract. The tool takes named parameters, processes them with regular code (regex, scoring, file I/O, validation), and outputs JSON to stdout. The LLM calls the tool and consumes the result.

No API. No server. Just scripts that run locally and print JSON.

Contract

The Tool Contract

The contract is simple:

Input: Named CLI parameters.
Processing: Regular, deterministic code. No LLM calls inside the tool.
Output: JSON to stdout.
Execution: Run with bun (TypeScript) or python3. The skill calls the tool via bash and reads stdout.

$ bun tools/get-issue-signals.ts --owner "acme" --repo "app" --issue 42

{"checkboxes":3,"filePaths":2,"codeBlocks":1,"wordCount":450}

A minimal tool looks like this:

#!/usr/bin/env bun
// tools/get-issue-signals.ts

import { parseArgs } from "util";

const { values } = parseArgs({
  args: Bun.argv.slice(2),
  options: {
    owner: { type: "string" },
    repo:  { type: "string" },
    issue: { type: "string" },
  },
});

// Fetch issue body from GitHub
const res = await fetch(
  `https://api.github.com/repos/${values.owner}/${values.repo}/issues/${values.issue}`,
  { headers: { Authorization: `Bearer ${process.env.GITHUB_TOKEN}` } },
);
const { body } = await res.json();

// Count structural signals with regex
const checkboxes = (body.match(/- \[[ x]\]/g) ?? []).length;
const filePaths  = (body.match(/[\w./]+\.\w{1,4}/g) ?? []).length;
const codeBlocks = (body.match(/```/g) ?? []).length / 2;

console.log(
  JSON.stringify({
    checkboxes,
    filePaths,
    codeBlocks,
    wordCount: body.split(/\s+/).length,
  }),
);

That's it. Named params in, JSON out. Testable with any test runner. Debuggable with bun tools/get-issue-signals.ts --owner acme --repo app --issue 42.

Self-describing

Self-Describing Tools

There's a gap in the pattern so far: how does the LLM know what the tool's output looks like?

You can describe it in the skill. You can hardcode field names. But the moment the tool changes, the skill drifts and you find out at runtime, not at commit time. I added this after a tool changed its output and three skills broke silently.

The fix is to let the tool describe itself. Every deterministic tool supports a --schema flag that prints its output contract:

$ bun tools/analyze-issue.ts --schema
{"type":"object","properties":{"complexity":{"type":"string","enum":["lean","standard","full"]}, ...}}

The trick is where the schema comes from. It's not a separate file. It's not a doc comment. It's the same object the tool uses to validate its own output before printing.

import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";

const Output = z.object({
  complexity: z.enum(["lean", "standard", "full"]),
  score: z.number().int().min(0).max(10),
  instructions: z.string(),
});

const args = parseArgs(process.argv);

if (args.schema) {
  console.log(JSON.stringify(zodToJsonSchema(Output)));
  process.exit(0);
}

const result = await analyze(args);
console.log(JSON.stringify(Output.parse(result)));

One definition, three uses: it validates the output at runtime, it generates the schema on demand, and it types the code. The tool can't lie about its output shape because the shape IS the validator. If someone changes the output without updating the schema, the unit test that runs the tool against its own schema fails in CI.

Python tools do the same with Pydantic. The pattern is framework-agnostic: any library that lets you define a schema once and use it for validation plus JSON Schema generation works.

You pay a dependency (Zod or Pydantic). You get a contract the LLM can discover, a validator that enforces it, and a CI check that catches drift.

Spectrum

The Spectrum

The same problem (deciding how to plan a GitHub issue) can be solved at three levels of sophistication. Each level moves more decision-making from the LLM to code.

Level 1: Data

The tool returns raw signals from the issue body. The LLM interprets them.

#!/usr/bin/env bun
// tools/get-issue-signals.ts

import { parseArgs } from "util";

const { values } = parseArgs({
  args: Bun.argv.slice(2),
  options: {
    owner: { type: "string" },
    repo:  { type: "string" },
    issue: { type: "string" },
  },
});

const res = await fetch(
  `https://api.github.com/repos/${values.owner}/${values.repo}/issues/${values.issue}`,
  { headers: { Authorization: `Bearer ${process.env.GITHUB_TOKEN}` } },
);
const { body, title } = await res.json();
const text = `${title} ${body}`;

console.log(
  JSON.stringify({
    checkboxes: (body.match(/- \[[ x]\]/g) ?? []).length,
    filePaths: (text.match(/[\w./]+\.\w{1,4}/g) ?? []).length,
    codeBlocks: (body.match(/```/g) ?? []).length / 2,
    acceptanceCriteria: /acceptance|criteria|must|should/i.test(body),
    wordCount: body.split(/\s+/).length,
  }),
);

$ bun tools/get-issue-signals.ts --owner acme --repo app --issue 42
{ "checkboxes": 5, "filePaths": 3, "codeBlocks": 1, "acceptanceCriteria": true, "wordCount": 450 }

The LLM gets raw signals and decides what to do. It has full discretion. Good for cases where the data needs interpretation in context.

Level 2: Classification

The tool counts signals and classifies issue complexity. It returns a complexity field that the skill can branch on.

#!/usr/bin/env bun
// tools/classify-issue.ts

import { parseArgs } from "util";

// ... same fetch + signal detection as Level 1 ...

const signals = {
  checkboxes: (body.match(/- \[[ x]\]/g) ?? []).length,
  filePaths: (text.match(/[\w./]+\.\w{1,4}/g) ?? []).length,
  codeBlocks: (body.match(/```/g) ?? []).length / 2,
  acceptanceCriteria: /acceptance|criteria|must|should/i.test(body),
  wordCount: body.split(/\s+/).length,
};

// Deterministic scoring
let score = 0;
if (signals.checkboxes >= 2) score += 3;
if (signals.acceptanceCriteria) score += 2;
if (signals.filePaths >= 1) score += 2;
if (signals.codeBlocks >= 1) score += 1;
if (signals.wordCount >= 200) score += 1;

const complexity = score >= 7 ? "lean" : score >= 4 ? "standard" : "full";

console.log(JSON.stringify({ complexity, score, signals }));

$ bun tools/classify-issue.ts --owner acme --repo app --issue 42
{ "complexity": "lean", "score": 8, "signals": { "checkboxes": 5, ... } }

The classification is deterministic and testable. An issue with 5 checkboxes and acceptance criteria always scores 8+, always routes to "lean." The skill reads complexity and follows the matching procedure. The LLM doesn't decide the complexity level.

Level 3: Instructions

The tool scores, classifies, and builds the complete planning procedure. It returns an instructions field with literal steps the LLM follows verbatim.

#!/usr/bin/env bun
// tools/analyze-issue.ts

import { parseArgs } from "util";

// ... same fetch + signal detection + scoring as Level 2 ...

const complexity = score >= 7 ? "lean" : score >= 4 ? "standard" : "full";

const procedures: Record<string, string> = {
  lean: `## Lean Plan
1. The issue is well-specified. Skip deep analysis.
2. List the files to modify based on the file paths in the issue.
3. Write a 3-5 bullet implementation plan.
4. Start coding immediately.`,

  standard: `## Standard Plan
1. Read the full issue and identify acceptance criteria.
2. Search the codebase for related code.
3. Write a plan covering: files to modify, approach, edge cases.
4. Ask the user to approve the plan before coding.`,

  full: `## Full Plan
1. The issue is underspecified. Gather more context before planning.
2. List what information is missing (acceptance criteria, affected files, scope).
3. Search the codebase for related patterns.
4. Write a detailed plan with alternatives and trade-offs.
5. Ask the user to approve the plan before coding.`,
};

console.log(
  JSON.stringify({
    complexity,
    score,
    instructions: procedures[complexity],
  }),
);

$ bun tools/analyze-issue.ts --owner acme --repo app --issue 42
{
  "complexity": "lean",
  "score": 8,
  "instructions": "## Lean Plan\n1. The issue is well-specified. Skip deep analysis..."
}

At Level 3, the LLM does zero branching. It calls the tool, reads instructions, and follows them. All decision logic, all branching, all procedure selection is in deterministic, testable code. The LLM is a pure executor.

These examples are simplified for illustration. Real tools handle edge cases, validation, and richer output structures.

Levels

When to Use Each Level

Data when the LLM needs facts to make a judgment call. The situation is ambiguous, the data is one input among many, and you want the LLM's ability to synthesize.

Classification when you want testable, deterministic routing but the procedures are simple enough to live in the skill file. You get consistent categorization without building full instruction sets.

Instructions when there are 3+ paths with materially different multi-step procedures, or when invisible failures are unacceptable. This is the highest investment but also the highest reliability.

Trade-offs

The upside is obvious: you can write unit tests for your routing logic, run the tool manually to debug, and get the same output tomorrow that you got today. No tokens wasted on work a function handles better.

The downside: you're writing and maintaining code. When procedures change, you update a script, not just a prompt. And the tool only handles what you coded for. Novel inputs might need a fallback path.

Skill Orchestration covers the consumer side: how skills call these tools, read their output, and orchestrate the workflow.

Code-First Deterministic Tools