Natural-language harness directory
Agent Workspace
Browse verified harness packs with source links, evaluation guidance, and starter instructions for Claude Code, Codex, Cursor, and Convex.
Directory
20 packs
Operator Chat Rail
Shared chat plus traceable assistant rail for high-context workflows.
A proven interface pattern for apps that need collaborative chat in the center and a persistent agent rail for plan trace, tool execution, telemetry, sources, and quality checks.
Not yet measured
~0 installs
Answer Review and Quality Checks
Persist the answer as a reviewable artifact, not just a message string.
A runtime pattern that persists final answers, quality checks, evaluation metadata, and downstream review state as first-class records instead of burying them in message text.
Not yet measured
~0 installs
Four Design Questions
The 2-minute orientation: every coding agent answers the same 4 questions.
A short reference pack that frames every coding-agent architecture as an answer to four recurring questions: where reasoning lives, how many execution engines, the default safety posture, and the binding resource constraint. Sourced from VILA-Lab's Dive into Claude Code (arXiv 2604.14228, CC-BY-NC-SA-4.0). Read this first — every other dive-sourced pack in the catalog is a zoom-in on one row of this table.
Not yet measured
~0 installs·~0 tokens saved
Golden Eval Harness
Golden set + LLM judge + traces + CI gates. Ships evals that stay green.
A complete evaluator-optimizer harness: curated golden set, rubric-based LLM-as-judge, trace observability (Langfuse/Braintrust), pass-rate thresholds, and CI integration. Enforces a contract so every candidate run produces a comparable results.json; the judge re-runs deterministically. Matches the Anthropic eval cookbook and OpenAI Evals conventions.
pass 91%·3.9k tok·$0.02
~0 installs·~0 tokens saved
Claude Code Guide
How the Claude Code harness actually works — AGENTS.md, skills, hooks, subagents.
Reference guide to the Claude Code harness in production use. Covers the AGENTS.md convention (now adopted in 60k+ repos), the .claude/skills directory structure, subagent delegation via the Task tool, PostToolUse hooks for automation, session memory layout, context-window tactics, and the decision tree between Edit and Write. Written for teams onboarding Claude Code into a real codebase, not for demos.
Not yet measured
~0 installs·~0 tokens saved
Claude Advisor Pattern v2
Sonnet executes, Opus advises. Route intelligence by confidence, pay Opus only when it matters.
Evaluator-optimizer harness built on the Anthropic Advisor Pattern. Claude Sonnet 4.6 runs the task; on low-confidence decisions (threshold = 0.7), after two consecutive tool failures, or when the executor explicitly escalates, it calls Claude Opus 4.6 as an advisor. The advisor returns a structured recommendation; Sonnet applies it and continues. Cost split is measured — typical deployments spend 8-12% of tokens on Opus while retaining ~93% of an Opus-only pass-rate.
Not yet measured
~0 installs·~0 tokens saved
Durable Streaming
Stream plan and execution state through stored events, not transient sockets alone.
A transport and persistence pattern for agents that need progressive rendering, reconnect safety, and post-run replay without losing the benefits of token or step streaming.
Not yet measured
~0 installs
Linear-style Command Palette
Cmd+K done right: grouped, keyboard-first, fully accessible.
A production-grade Cmd+K command palette matching the Linear implementation: debounced fuzzy search, grouped sections (Issues, Projects, Actions), recent items, empty state, and WAI-ARIA 1.2 combobox semantics. Drops onto any React app via the cmdk primitive.
Not yet measured
~0 installs·~0 tokens saved
shadcn + TanStack Data Table
The canonical admin-UI table, wired correctly.
A content-complete data table built on TanStack Table v8 and shadcn/ui primitives. Ships with sortable columns, server-side pagination, column visibility toggles, row selection, per-column filters, sticky header, skeleton loading, a11y-correct header semantics, and an empty state. Replaces the 45-minute stub that every project writes twice.
Not yet measured
~0 installs·~0 tokens saved
Hybrid RAG: BM25 + Vector + Rerank
Sparse + dense + cross-encoder. The 2025 retrieval default.
Production-grade hybrid retrieval: BM25 for lexical recall, dense embeddings for semantic recall, Reciprocal Rank Fusion to merge, and a BGE / Cohere cross-encoder rerank stage for precision. Emits cited retrieved_docs ready to pass to a grounded-answer generator. Matches the retrieval stack Anthropic, Pinecone, and Weaviate recommend for >100k-doc corpora.
Not yet measured
~0 installs·~0 tokens saved
Pattern Decision Tree
Which canonical pattern? A constraint-first tree.
A reference pack that helps you pick a canonical agent pattern given your hard constraint. Starts from cost / latency / accuracy / complexity tolerance and walks to prompt-chaining, routing, parallelization, orchestrator-workers, or evaluator-optimizer. Grounded in Anthropic's 'Building effective agents', the Tongyi NLA paper, and Stanford's meta-harness work.
Not yet measured
~0 installs·~0 tokens saved
Injection Surface Audit
Every agent product ships injection surfaces. Audit them before an attacker does.
A systematic prompt-injection surface audit for agent-backed products. Stanford's March 2026 meta-harness study found roughly 1-in-4 community-authored skills contained a practical injection vector. This pack translates that finding into an actionable checklist: tool allow-lists, URL validation and SSRF guards, sanitization of content fetched from external sources, signed skill manifests, scoped permissions, and a review cadence. Target auditor: a senior engineer who owns an agent in production and has one afternoon to harden it.
Not yet measured
~0 installs·~0 tokens saved
Turn Execution Pipeline
The 9-step loop, 5 context shapers, 3 recovery paths. The 98.4% infrastructure under every turn.
Canonical harness reference for Claude Code's turn-execution pipeline, derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228). The paper's anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure; this pack documents the infrastructure. It names the 9 steps (settings resolution, state initialization, context assembly, five pre-model shapers, model call, tool dispatch, permission gate, tool execution, stop-condition check), the 5 pre-model context shapers in cheapest-first order (Budget Reduction, Snip, Microcompact, Context Collapse, Auto-Compact) with their triggers, and the 3 recovery mechanisms (max output token escalation up to 3 retries, reactive compaction firing at most once per turn, prompt-too-long overflow chain: context-collapse → reactive compaction → terminate). The pack maps directly onto the paper's three recurring design commitments: graduated layering over monolithic mechanisms, append-only designs favoring auditability over query power, and model judgment within a deterministic harness. Use it to implement, clone, or critique any production agent loop. Target: a staff engineer rebuilding the loop in a different language (Rust, Python, Bun) or auditing an existing one for compaction-race and stop-condition bugs.
Not yet measured
~0 installs·~0 tokens saved
Seven Safety Layers
Defense-in-depth for tool execution. Deny > ask > allow. All 7 layers, in order, with honest failure modes.
Security reference for Claude Code's 7-layer permission architecture, derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228). The paper's anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure; the safety layers are a large slice of that infrastructure. The pack enumerates the 7 independent layers in order — (1) tool pre-filtering, (2) deny-first rule evaluation, (3) permission mode constraints across 7 modes (plan, default, acceptEdits, auto, dontAsk, bypassPermissions, bubble), (4) auto-mode ML classifier (yoloClassifier.ts two-stage fast-filter + chain-of-thought with a timeout race against a pre-computed classification), (5) shell sandboxing for filesystem and network isolation, (6) non-restoration on resume (permissions re-established each session; trust is not persistent), (7) PreToolUse hooks with permissionDecision return values. It also names the 4-stage authorization pipeline (pre-filter → hooks → rules → handler with 4 branches). The pack carries the paper's three recurring design commitments — graduated layering, append-only auditability, model judgment inside a deterministic harness — and documents the honest failure modes the paper flags: ~50 subcommands bypass shell-layer analysis, the classifier's timeout race can leave decisions ambiguous, and the pre-trust window allowed 4 published CVEs where extensions executed before the trust dialog appeared. Target: a staff engineer or security lead reviewing an agent's tool-execution surface before shipping.
Not yet measured
~0 installs·~0 tokens saved
Nine Context Sources
CLAUDE.md is user context, not system prompt. 9 sources, 4 hierarchy levels, zero embeddings.
Reference for the 9 ordered context sources Claude Code assembles before every model call, and the 4-level CLAUDE.md hierarchy that feeds one of them. Derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228), whose anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure. The pack's entire frame is the paper's explicitly-labeled 'Critical design choice' from §Context Construction: CLAUDE.md is user context (probabilistic compliance by the model), NOT system prompt (deterministic enforcement by the runtime). Permission rules are the deterministic layer; CLAUDE.md instructs but does not enforce. The pack enumerates the 9 sources in order (system prompt → environment info → CLAUDE.md hierarchy → path-scoped rules → auto-memory → tool metadata → conversation history → tool results → compact summaries), breaks the CLAUDE.md hierarchy into its 4 precedence levels (/etc managed, ~/.claude user, project, .local gitignored), and documents the file-based memory model that replaces vector DB / embedding approaches with an LLM-based scan of up to 5 memory-file headers on demand. Fully inspectable, editable, and version-controllable by the user — and one of the paper's three recurring design commitments (append-only auditability over query power) made concrete. Target: an agent engineer who keeps hearing 'just put it in CLAUDE.md' as a fix and needs to understand why that is wrong for safety-critical rules.
Not yet measured
~0 installs·~0 tokens saved
Subagent Delegation — Three Isolation Modes
SkillTool vs AgentTool, 6 built-in types, 3 isolation modes. The knob every harness engineer gets wrong first.
Harness pack covering Claude Code's subagent architecture derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Frames the SkillTool vs AgentTool trade-off as the central decision: SkillTool injects instructions into the current context (cheap, same window); AgentTool spawns a fresh isolated conversation (~7x tokens, context-safe). Documents the 6 built-in subagent types (Explore, Plan, General-purpose, Claude Code Guide, Verification, Statusline-setup), custom `.claude/agents/*.md` with YAML frontmatter (tools, model, permissionMode, hooks, skills, memory scope), the 3 isolation modes (worktree / remote / in-process — default), sidechain transcripts as separate JSONL files, and multi-instance coordination via POSIX flock() with zero external dependencies.
Not yet measured
~0 installs·~0 tokens saved
Extensibility — Four Mechanisms
Hooks → Skills → Plugins → MCP. Graduated context cost, three injection points, one decision tree.
Reference pack covering Claude Code's four extension mechanisms derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Frames the graduated context-cost hierarchy — Hooks (zero context, 27 events, 4 exec types) → Skills (low, SKILL.md with 15+ YAML fields, injected via SkillTool) → Plugins (medium, 10 component types) → MCP Servers (high, 7 transport types). Documents the 5-step tool pool assembly (base enumeration up to 54 tools → mode filtering → deny pre-filter → MCP integration → dedup) and the three injection points in the loop: assemble() (what the model sees), model() (what it can reach), execute() (whether/how it runs). The pack is a decision tree: which mechanism should you reach for, and why reaching for MCP first is the most common mistake.
Not yet measured
~0 installs·~0 tokens saved
Session Persistence — Three Channels
Append-only JSONL across 3 channels. Permissions never restore on resume — the friction IS the safety.
Harness pack covering Claude Code's session persistence design derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Documents the three persistence channels: append-only session JSONL transcripts (full conversation with chain-patched compaction boundaries), global `history.jsonl` for cross-session prompt recall (reverse-read for Up-arrow), and subagent sidechains as separate JSONL per subagent. Frames the critical deliberate non-feature: permissions are never restored on resume — trust is always re-established in the current session. The paper presents this as a design choice, not a UX bug: the user friction is the cost of maintaining the safety invariant. The pack also captures the append-only / chain-patching trade-off — auditability and simplicity over query power.
Not yet measured
~0 installs·~0 tokens saved
Agent Design Space — Six Decisions
The architect's framework: answer these six before picking a pattern.
A reference pack that walks builders through the six recurring design decisions every production coding agent must answer: reasoning placement, safety posture, context management, extensibility, subagent architecture, and session persistence. Each decision pairs Claude Code's answer with realistic alternatives from LangGraph, Devin, SWE-Agent, OpenHands, and Aider, then closes with a Meta-Pattern of three commitments (graduated layering, append-only, model-judgment + deterministic harness). Frame this as 'the pack about picking packs' — answer all six first, then let the catalog serve specific patterns that match your posture. Sourced from VILA-Lab/Dive-into-Claude-Code build-your-own-agent.md (arXiv 2604.14228, CC-BY-NC-SA-4.0).
Not yet measured
~0 installs·~0 tokens saved
CVE: The Pre-Trust Execution Window
4 CVEs share one root cause: extensions execute before the trust dialog renders.
A security pack targeting the pre-trust execution window documented in VILA-Lab's Dive into Claude Code (arXiv 2604.14228). Extensions — plugins, MCP servers, hooks, skill manifests — can execute code during CLI initialization *before* the trust dialog appears, creating a structurally privileged attack window that lives outside the deny-first permission pipeline. The paper's README and Safety section tie four patched CVEs to this root cause. This is NOT prompt injection at runtime — it is code execution at load time, with the user's full OS privileges, before the user has been shown what they are trusting. Pair with `injection-surface-audit` for runtime content attacks; use this pack when you are building or auditing a harness that loads third-party extensions. CC-BY-NC-SA-4.0 source terms apply.
Not yet measured
~0 installs·~0 tokens saved
Recent change traces · Cross-linked to packs
What changed, and why
Every coding session captured as rows — Scenario / Files / Changes / Why. Ctrl+F your own history.