Agent Workspace

Natural-language harness directory

Agent Workspace

Browse verified harness packs with source links, evaluation guidance, and starter instructions for Claude Code, Codex, Cursor, and Convex.

Submit a pack View repo

PacksPacks · 23TracesTraces · 2PublishersPublishers · 3

Verified5

Featured7

TagAll dive-into-claude-code · 9 harness · 6 reference · 6 claude-code · 4 architecture · 3 claude-code-internals · 3 security · 3 a11y · 2

Directory

18 packs

18 shownVerified packs include starter instructions, sources, and evaluation guidance.

Sort: Featured first

Community

Communityreference

Four Design Questions

The 2-minute orientation: every coding agent answers the same 4 questions.

A short reference pack that frames every coding-agent architecture as an answer to four recurring questions: where reasoning lives, how many execution engines, the default safety posture, and the binding resource constraint. Sourced from VILA-Lab's Dive into Claude Code (arXiv 2604.14228, CC-BY-NC-SA-4.0). Read this first — every other dive-sourced pack in the catalog is a zoom-in on one row of this table.

Not yet measured

~0 installs·~0 tokens saved

referencearchitectureentry-level

Agent Workspace2026-04-19

Production-readyevaluator-optimizer

Golden Eval Harness

Golden set + LLM judge + traces + CI gates. Ships evals that stay green.

A complete evaluator-optimizer harness: curated golden set, rubric-based LLM-as-judge, trace observability (Langfuse/Braintrust), pass-rate thresholds, and CI integration. Enforces a contract so every candidate run produces a comparable results.json; the judge re-runs deterministically. Matches the Anthropic eval cookbook and OpenAI Evals conventions.

pass 91%·3.9k tok·$0.02

~0 installs·~0 tokens saved

evalllm-as-judgegolden-set

Agent Workspace2026-04-16

Communityreference

Claude Code Guide

How the Claude Code harness actually works — AGENTS.md, skills, hooks, subagents.

Reference guide to the Claude Code harness in production use. Covers the AGENTS.md convention (now adopted in 60k+ repos), the .claude/skills directory structure, subagent delegation via the Task tool, PostToolUse hooks for automation, session memory layout, context-window tactics, and the decision tree between Edit and Write. Written for teams onboarding Claude Code into a real codebase, not for demos.

Not yet measured

~0 installs·~0 tokens saved

referenceclaude-codeagents-md

Agent Workspace2026-04-17

Communityharness

Production-readyevaluator-optimizer

Claude Advisor Pattern v2

Sonnet executes, Opus advises. Route intelligence by confidence, pay Opus only when it matters.

Evaluator-optimizer harness built on the Anthropic Advisor Pattern. Claude Sonnet 4.6 runs the task; on low-confidence decisions (threshold = 0.7), after two consecutive tool failures, or when the executor explicitly escalates, it calls Claude Opus 4.6 as an advisor. The advisor returns a structured recommendation; Sonnet applies it and continues. Cost split is measured — typical deployments spend 8-12% of tokens on Opus while retaining ~93% of an Opus-only pass-rate.

Not yet measured

~0 installs·~0 tokens saved

harnessevaluator-optimizeradvisor

Agent Workspace2026-04-17

Linear-style Command Palette

Cmd+K done right: grouped, keyboard-first, fully accessible.

A production-grade Cmd+K command palette matching the Linear implementation: debounced fuzzy search, grouped sections (Issues, Projects, Actions), recent items, empty state, and WAI-ARIA 1.2 combobox semantics. Drops onto any React app via the cmdk primitive.

Not yet measured

~0 installs·~0 tokens saved

uicommand-palettecmdk

Agent Workspace2026-04-16

shadcn + TanStack Data Table

The canonical admin-UI table, wired correctly.

A content-complete data table built on TanStack Table v8 and shadcn/ui primitives. Ships with sortable columns, server-side pagination, column visibility toggles, row selection, per-column filters, sticky header, skeleton loading, a11y-correct header semantics, and an empty state. Replaces the 45-minute stub that every project writes twice.

Not yet measured

~0 installs·~0 tokens saved

uidata-tabletanstack

Agent Workspace2026-04-16

Recommendedhybrid

Hybrid RAG: BM25 + Vector + Rerank

Sparse + dense + cross-encoder. The 2025 retrieval default.

Production-grade hybrid retrieval: BM25 for lexical recall, dense embeddings for semantic recall, Reciprocal Rank Fusion to merge, and a BGE / Cohere cross-encoder rerank stage for precision. Emits cited retrieved_docs ready to pass to a grounded-answer generator. Matches the retrieval stack Anthropic, Pinecone, and Weaviate recommend for >100k-doc corpora.

Not yet measured

~0 installs·~0 tokens saved

ragretrievalbm25

Agent Workspace2026-04-16

Communityreference

Pattern Decision Tree

Which canonical pattern? A constraint-first tree.

A reference pack that helps you pick a canonical agent pattern given your hard constraint. Starts from cost / latency / accuracy / complexity tolerance and walks to prompt-chaining, routing, parallelization, orchestrator-workers, or evaluator-optimizer. Grounded in Anthropic's 'Building effective agents', the Tongyi NLA paper, and Stanford's meta-harness work.

Not yet measured

~0 installs·~0 tokens saved

referencedecision-treecanonical-patterns

Agent Workspace2026-04-16

Communitysecurity

Injection Surface Audit

Every agent product ships injection surfaces. Audit them before an attacker does.

A systematic prompt-injection surface audit for agent-backed products. Stanford's March 2026 meta-harness study found roughly 1-in-4 community-authored skills contained a practical injection vector. This pack translates that finding into an actionable checklist: tool allow-lists, URL validation and SSRF guards, sanitization of content fetched from external sources, signed skill manifests, scoped permissions, and a review cadence. Target auditor: a senior engineer who owns an agent in production and has one afternoon to harden it.

Not yet measured

~0 installs·~0 tokens saved

securityprompt-injectionssrf

Agent Workspace2026-04-17

Communityharness

Recommendedorchestrator-workers

Turn Execution Pipeline

The 9-step loop, 5 context shapers, 3 recovery paths. The 98.4% infrastructure under every turn.

Canonical harness reference for Claude Code's turn-execution pipeline, derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228). The paper's anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure; this pack documents the infrastructure. It names the 9 steps (settings resolution, state initialization, context assembly, five pre-model shapers, model call, tool dispatch, permission gate, tool execution, stop-condition check), the 5 pre-model context shapers in cheapest-first order (Budget Reduction, Snip, Microcompact, Context Collapse, Auto-Compact) with their triggers, and the 3 recovery mechanisms (max output token escalation up to 3 retries, reactive compaction firing at most once per turn, prompt-too-long overflow chain: context-collapse → reactive compaction → terminate). The pack maps directly onto the paper's three recurring design commitments: graduated layering over monolithic mechanisms, append-only designs favoring auditability over query power, and model judgment within a deterministic harness. Use it to implement, clone, or critique any production agent loop. Target: a staff engineer rebuilding the loop in a different language (Rust, Python, Bun) or auditing an existing one for compaction-race and stop-condition bugs.

Not yet measured

~0 installs·~0 tokens saved

harnessturn-loopcontext-compaction

Agent Workspace2026-04-19

Communitysecurity

Seven Safety Layers

Defense-in-depth for tool execution. Deny > ask > allow. All 7 layers, in order, with honest failure modes.

Security reference for Claude Code's 7-layer permission architecture, derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228). The paper's anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure; the safety layers are a large slice of that infrastructure. The pack enumerates the 7 independent layers in order — (1) tool pre-filtering, (2) deny-first rule evaluation, (3) permission mode constraints across 7 modes (plan, default, acceptEdits, auto, dontAsk, bypassPermissions, bubble), (4) auto-mode ML classifier (yoloClassifier.ts two-stage fast-filter + chain-of-thought with a timeout race against a pre-computed classification), (5) shell sandboxing for filesystem and network isolation, (6) non-restoration on resume (permissions re-established each session; trust is not persistent), (7) PreToolUse hooks with permissionDecision return values. It also names the 4-stage authorization pipeline (pre-filter → hooks → rules → handler with 4 branches). The pack carries the paper's three recurring design commitments — graduated layering, append-only auditability, model judgment inside a deterministic harness — and documents the honest failure modes the paper flags: ~50 subcommands bypass shell-layer analysis, the classifier's timeout race can leave decisions ambiguous, and the pre-trust window allowed 4 published CVEs where extensions executed before the trust dialog appeared. Target: a staff engineer or security lead reviewing an agent's tool-execution surface before shipping.

Not yet measured

~0 installs·~0 tokens saved

securitypermissionsdefense-in-depth

Agent Workspace2026-04-19

Communityreference

Nine Context Sources

CLAUDE.md is user context, not system prompt. 9 sources, 4 hierarchy levels, zero embeddings.

Reference for the 9 ordered context sources Claude Code assembles before every model call, and the 4-level CLAUDE.md hierarchy that feeds one of them. Derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228), whose anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure. The pack's entire frame is the paper's explicitly-labeled 'Critical design choice' from §Context Construction: CLAUDE.md is user context (probabilistic compliance by the model), NOT system prompt (deterministic enforcement by the runtime). Permission rules are the deterministic layer; CLAUDE.md instructs but does not enforce. The pack enumerates the 9 sources in order (system prompt → environment info → CLAUDE.md hierarchy → path-scoped rules → auto-memory → tool metadata → conversation history → tool results → compact summaries), breaks the CLAUDE.md hierarchy into its 4 precedence levels (/etc managed, ~/.claude user, project, .local gitignored), and documents the file-based memory model that replaces vector DB / embedding approaches with an LLM-based scan of up to 5 memory-file headers on demand. Fully inspectable, editable, and version-controllable by the user — and one of the paper's three recurring design commitments (append-only auditability over query power) made concrete. Target: an agent engineer who keeps hearing 'just put it in CLAUDE.md' as a fix and needs to understand why that is wrong for safety-critical rules.

Not yet measured

~0 installs·~0 tokens saved

referencecontextclaude-md

Agent Workspace2026-04-19

Communityharness

Recommendedparallelization

Subagent Delegation — Three Isolation Modes

SkillTool vs AgentTool, 6 built-in types, 3 isolation modes. The knob every harness engineer gets wrong first.

Harness pack covering Claude Code's subagent architecture derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Frames the SkillTool vs AgentTool trade-off as the central decision: SkillTool injects instructions into the current context (cheap, same window); AgentTool spawns a fresh isolated conversation (~7x tokens, context-safe). Documents the 6 built-in subagent types (Explore, Plan, General-purpose, Claude Code Guide, Verification, Statusline-setup), custom `.claude/agents/*.md` with YAML frontmatter (tools, model, permissionMode, hooks, skills, memory scope), the 3 isolation modes (worktree / remote / in-process — default), sidechain transcripts as separate JSONL files, and multi-instance coordination via POSIX flock() with zero external dependencies.

Not yet measured

~0 installs·~0 tokens saved

harnessparallelizationsubagents

Agent Workspace2026-04-19

Communityreference

Extensibility — Four Mechanisms

Hooks → Skills → Plugins → MCP. Graduated context cost, three injection points, one decision tree.

Reference pack covering Claude Code's four extension mechanisms derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Frames the graduated context-cost hierarchy — Hooks (zero context, 27 events, 4 exec types) → Skills (low, SKILL.md with 15+ YAML fields, injected via SkillTool) → Plugins (medium, 10 component types) → MCP Servers (high, 7 transport types). Documents the 5-step tool pool assembly (base enumeration up to 54 tools → mode filtering → deny pre-filter → MCP integration → dedup) and the three injection points in the loop: assemble() (what the model sees), model() (what it can reach), execute() (whether/how it runs). The pack is a decision tree: which mechanism should you reach for, and why reaching for MCP first is the most common mistake.

Not yet measured

~0 installs·~0 tokens saved

referenceextensibilityhooks

Agent Workspace2026-04-19

Communityharness

Session Persistence — Three Channels

Append-only JSONL across 3 channels. Permissions never restore on resume — the friction IS the safety.

Harness pack covering Claude Code's session persistence design derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Documents the three persistence channels: append-only session JSONL transcripts (full conversation with chain-patched compaction boundaries), global `history.jsonl` for cross-session prompt recall (reverse-read for Up-arrow), and subagent sidechains as separate JSONL per subagent. Frames the critical deliberate non-feature: permissions are never restored on resume — trust is always re-established in the current session. The paper presents this as a design choice, not a UX bug: the user friction is the cost of maintaining the safety invariant. The pack also captures the append-only / chain-patching trade-off — auditability and simplicity over query power.

Not yet measured

~0 installs·~0 tokens saved

harnesspersistencesession

Agent Workspace2026-04-19

Communityreference

Agent Design Space — Six Decisions

The architect's framework: answer these six before picking a pattern.

A reference pack that walks builders through the six recurring design decisions every production coding agent must answer: reasoning placement, safety posture, context management, extensibility, subagent architecture, and session persistence. Each decision pairs Claude Code's answer with realistic alternatives from LangGraph, Devin, SWE-Agent, OpenHands, and Aider, then closes with a Meta-Pattern of three commitments (graduated layering, append-only, model-judgment + deterministic harness). Frame this as 'the pack about picking packs' — answer all six first, then let the catalog serve specific patterns that match your posture. Sourced from VILA-Lab/Dive-into-Claude-Code build-your-own-agent.md (arXiv 2604.14228, CC-BY-NC-SA-4.0).

Not yet measured

~0 installs·~0 tokens saved

referencearchitecturedesign-space

Agent Workspace2026-04-19

Communitysecurity

CVE: The Pre-Trust Execution Window

4 CVEs share one root cause: extensions execute before the trust dialog renders.

A security pack targeting the pre-trust execution window documented in VILA-Lab's Dive into Claude Code (arXiv 2604.14228). Extensions — plugins, MCP servers, hooks, skill manifests — can execute code during CLI initialization *before* the trust dialog appears, creating a structurally privileged attack window that lives outside the deny-first permission pipeline. The paper's README and Safety section tie four patched CVEs to this root cause. This is NOT prompt injection at runtime — it is code execution at load time, with the user's full OS privileges, before the user has been shown what they are trusting. Pair with `injection-surface-audit` for runtime content attacks; use this pack when you are building or auditing a harness that loads third-party extensions. CC-BY-NC-SA-4.0 source terms apply.

Not yet measured

~0 installs·~0 tokens saved

securitycvepre-trust-window

Agent Workspace2026-04-19

Communityharness

Experimentalhybrid

Workflow Elicitation

Capture operator judgment before you automate it.

A future-facing pattern for teams that need to externalize tacit expertise into reusable operating instructions before they try to scale an agent across a domain.

Not yet measured

~0 installs

elicitationoperating-systemknowledge-capture

Open Workflow Lab2026-04-15

Recent change traces · Cross-linked to packs

What changed, and why

Every coding session captured as rows — Scenario / Files / Changes / Why. Ctrl+F your own history.

View all traces