Agent Workspace

Natural-language harness directory

Agent Workspace

Browse verified harness packs with source links, evaluation guidance, and starter instructions for Claude Code, Codex, Cursor, and Convex.

Submit a pack View repo

PacksPacks · 23TracesTraces · 2PublishersPublishers · 3

Verified5

Featured7

TagAll dive-into-claude-code · 9 harness · 6 reference · 6 claude-code · 4 architecture · 3 claude-code-internals · 3 security · 3 a11y · 2

Featured

Start here

Communityreference

Four Design Questions

The 2-minute orientation: every coding agent answers the same 4 questions.

A short reference pack that frames every coding-agent architecture as an answer to four recurring questions: where reasoning lives, how many execution engines, the default safety posture, and the binding resource constraint. Sourced from VILA-Lab's Dive into Claude Code (arXiv 2604.14228, CC-BY-NC-SA-4.0). Read this first — every other dive-sourced pack in the catalog is a zoom-in on one row of this table.

Not yet measured

~0 installs·~0 tokens saved

claude-codecursorcodex

Agent Workspace2026-04-19

Production-readyevaluator-optimizer

Golden Eval Harness

Golden set + LLM judge + traces + CI gates. Ships evals that stay green.

A complete evaluator-optimizer harness: curated golden set, rubric-based LLM-as-judge, trace observability (Langfuse/Braintrust), pass-rate thresholds, and CI integration. Enforces a contract so every candidate run produces a comparable results.json; the judge re-runs deterministically. Matches the Anthropic eval cookbook and OpenAI Evals conventions.

pass 91%·3.9k tok·$0.02

~0 installs·~0 tokens saved

claude-codecursorpython-3.11

Agent Workspace2026-04-16

Communityreference

Claude Code Guide

How the Claude Code harness actually works — AGENTS.md, skills, hooks, subagents.

Reference guide to the Claude Code harness in production use. Covers the AGENTS.md convention (now adopted in 60k+ repos), the .claude/skills directory structure, subagent delegation via the Task tool, PostToolUse hooks for automation, session memory layout, context-window tactics, and the decision tree between Edit and Write. Written for teams onboarding Claude Code into a real codebase, not for demos.

Not yet measured

~0 installs·~0 tokens saved

claude-codecursorcodex-cli

Agent Workspace2026-04-17

Communityharness

Production-readyevaluator-optimizer

Claude Advisor Pattern v2

Sonnet executes, Opus advises. Route intelligence by confidence, pay Opus only when it matters.

Evaluator-optimizer harness built on the Anthropic Advisor Pattern. Claude Sonnet 4.6 runs the task; on low-confidence decisions (threshold = 0.7), after two consecutive tool failures, or when the executor explicitly escalates, it calls Claude Opus 4.6 as an advisor. The advisor returns a structured recommendation; Sonnet applies it and continues. Cost split is measured — typical deployments spend 8-12% of tokens on Opus while retaining ~93% of an Opus-only pass-rate.

Not yet measured

~0 installs·~0 tokens saved

claude-codecursorpython-3.11

Agent Workspace2026-04-17

Verifiedharness

Production-readyhybrid

Operator Chat Rail

Shared chat plus traceable assistant rail for high-context workflows.

A proven interface pattern for apps that need collaborative chat in the center and a persistent agent rail for plan trace, tool execution, telemetry, sources, and quality checks.

Not yet measured

~0 installs

Claude CodeCodexCursor

Agent Workspace2026-04-15

Verifiedharness

Recommendedhybrid

Planning and Worker Flow

Tiered orchestration for questions that are too broad for a single call.

A multi-step harness pattern that uses a narrow planning pass, typed worker calls, and a constrained synthesis pass instead of letting one prompt attempt everything at once.

Not yet measured

~0 installs

Claude CodeCodexLangGraph

Agent Workspace Labs2026-04-15

Verifiedharness

Production-readyhybrid

Answer Review and Quality Checks

Persist the answer as a reviewable artifact, not just a message string.

A runtime pattern that persists final answers, quality checks, evaluation metadata, and downstream review state as first-class records instead of burying them in message text.

Not yet measured

~0 installs

ConvexClaude CodeCodex

Agent Workspace2026-04-15

Directory

23 packs

23 shownVerified packs include starter instructions, sources, and evaluation guidance.

Sort: Featured first

No active filters

Verifiedharness

Production-readyhybrid

Operator Chat Rail

Shared chat plus traceable assistant rail for high-context workflows.

A proven interface pattern for apps that need collaborative chat in the center and a persistent agent rail for plan trace, tool execution, telemetry, sources, and quality checks.

Not yet measured

~0 installs

chattracesources

Agent Workspace2026-04-15

Verifiedharness

Recommendedhybrid

Planning and Worker Flow

Tiered orchestration for questions that are too broad for a single call.

A multi-step harness pattern that uses a narrow planning pass, typed worker calls, and a constrained synthesis pass instead of letting one prompt attempt everything at once.

Not yet measured

~0 installs

planningworkersparallelism

Agent Workspace Labs2026-04-15

Verifiedharness

Production-readyhybrid

Answer Review and Quality Checks

Persist the answer as a reviewable artifact, not just a message string.

A runtime pattern that persists final answers, quality checks, evaluation metadata, and downstream review state as first-class records instead of burying them in message text.

Not yet measured

~0 installs

qualitypacketseval

Agent Workspace2026-04-15

Communityreference

Four Design Questions

The 2-minute orientation: every coding agent answers the same 4 questions.

A short reference pack that frames every coding-agent architecture as an answer to four recurring questions: where reasoning lives, how many execution engines, the default safety posture, and the binding resource constraint. Sourced from VILA-Lab's Dive into Claude Code (arXiv 2604.14228, CC-BY-NC-SA-4.0). Read this first — every other dive-sourced pack in the catalog is a zoom-in on one row of this table.

Not yet measured

~0 installs·~0 tokens saved

referencearchitectureentry-level

Agent Workspace2026-04-19

Production-readyevaluator-optimizer

Golden Eval Harness

Golden set + LLM judge + traces + CI gates. Ships evals that stay green.

A complete evaluator-optimizer harness: curated golden set, rubric-based LLM-as-judge, trace observability (Langfuse/Braintrust), pass-rate thresholds, and CI integration. Enforces a contract so every candidate run produces a comparable results.json; the judge re-runs deterministically. Matches the Anthropic eval cookbook and OpenAI Evals conventions.

pass 91%·3.9k tok·$0.02

~0 installs·~0 tokens saved

evalllm-as-judgegolden-set

Agent Workspace2026-04-16

Communityreference

Claude Code Guide

How the Claude Code harness actually works — AGENTS.md, skills, hooks, subagents.

Reference guide to the Claude Code harness in production use. Covers the AGENTS.md convention (now adopted in 60k+ repos), the .claude/skills directory structure, subagent delegation via the Task tool, PostToolUse hooks for automation, session memory layout, context-window tactics, and the decision tree between Edit and Write. Written for teams onboarding Claude Code into a real codebase, not for demos.

Not yet measured

~0 installs·~0 tokens saved

referenceclaude-codeagents-md

Agent Workspace2026-04-17

Communityharness

Production-readyevaluator-optimizer

Claude Advisor Pattern v2

Sonnet executes, Opus advises. Route intelligence by confidence, pay Opus only when it matters.

Evaluator-optimizer harness built on the Anthropic Advisor Pattern. Claude Sonnet 4.6 runs the task; on low-confidence decisions (threshold = 0.7), after two consecutive tool failures, or when the executor explicitly escalates, it calls Claude Opus 4.6 as an advisor. The advisor returns a structured recommendation; Sonnet applies it and continues. Cost split is measured — typical deployments spend 8-12% of tokens on Opus while retaining ~93% of an Opus-only pass-rate.

Not yet measured

~0 installs·~0 tokens saved

harnessevaluator-optimizeradvisor

Agent Workspace2026-04-17

Verifiedharness

Recommendedhybrid

Hybrid Runtime

Use rules for known packets, use synthesis only when ambiguity is real.

A disciplined runtime pattern that reserves deterministic rendering for tightly scoped known cases and uses an LLM only for bounded synthesis, not as the default execution path for every question.

Not yet measured

~0 installs

deterministichybridrouting

Agent Workspace Labs2026-04-15

Verifiedharness

Production-readyhybrid

Durable Streaming

Stream plan and execution state through stored events, not transient sockets alone.

A transport and persistence pattern for agents that need progressive rendering, reconnect safety, and post-run replay without losing the benefits of token or step streaming.

Not yet measured

~0 installs

streamingeventsdurability

Agent Workspace2026-04-15

Linear-style Command Palette

Cmd+K done right: grouped, keyboard-first, fully accessible.

A production-grade Cmd+K command palette matching the Linear implementation: debounced fuzzy search, grouped sections (Issues, Projects, Actions), recent items, empty state, and WAI-ARIA 1.2 combobox semantics. Drops onto any React app via the cmdk primitive.

Not yet measured

~0 installs·~0 tokens saved

uicommand-palettecmdk

Agent Workspace2026-04-16

shadcn + TanStack Data Table

The canonical admin-UI table, wired correctly.

A content-complete data table built on TanStack Table v8 and shadcn/ui primitives. Ships with sortable columns, server-side pagination, column visibility toggles, row selection, per-column filters, sticky header, skeleton loading, a11y-correct header semantics, and an empty state. Replaces the 45-minute stub that every project writes twice.

Not yet measured

~0 installs·~0 tokens saved

uidata-tabletanstack

Agent Workspace2026-04-16

Recommendedhybrid

Hybrid RAG: BM25 + Vector + Rerank

Sparse + dense + cross-encoder. The 2025 retrieval default.

Production-grade hybrid retrieval: BM25 for lexical recall, dense embeddings for semantic recall, Reciprocal Rank Fusion to merge, and a BGE / Cohere cross-encoder rerank stage for precision. Emits cited retrieved_docs ready to pass to a grounded-answer generator. Matches the retrieval stack Anthropic, Pinecone, and Weaviate recommend for >100k-doc corpora.

Not yet measured

~0 installs·~0 tokens saved

ragretrievalbm25

Agent Workspace2026-04-16

Communityreference

Pattern Decision Tree

Which canonical pattern? A constraint-first tree.

A reference pack that helps you pick a canonical agent pattern given your hard constraint. Starts from cost / latency / accuracy / complexity tolerance and walks to prompt-chaining, routing, parallelization, orchestrator-workers, or evaluator-optimizer. Grounded in Anthropic's 'Building effective agents', the Tongyi NLA paper, and Stanford's meta-harness work.

Not yet measured

~0 installs·~0 tokens saved

referencedecision-treecanonical-patterns

Agent Workspace2026-04-16

Communitysecurity

Injection Surface Audit

Every agent product ships injection surfaces. Audit them before an attacker does.

A systematic prompt-injection surface audit for agent-backed products. Stanford's March 2026 meta-harness study found roughly 1-in-4 community-authored skills contained a practical injection vector. This pack translates that finding into an actionable checklist: tool allow-lists, URL validation and SSRF guards, sanitization of content fetched from external sources, signed skill manifests, scoped permissions, and a review cadence. Target auditor: a senior engineer who owns an agent in production and has one afternoon to harden it.

Not yet measured

~0 installs·~0 tokens saved

securityprompt-injectionssrf

Agent Workspace2026-04-17

Communityharness

Recommendedorchestrator-workers

Turn Execution Pipeline

The 9-step loop, 5 context shapers, 3 recovery paths. The 98.4% infrastructure under every turn.

Canonical harness reference for Claude Code's turn-execution pipeline, derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228). The paper's anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure; this pack documents the infrastructure. It names the 9 steps (settings resolution, state initialization, context assembly, five pre-model shapers, model call, tool dispatch, permission gate, tool execution, stop-condition check), the 5 pre-model context shapers in cheapest-first order (Budget Reduction, Snip, Microcompact, Context Collapse, Auto-Compact) with their triggers, and the 3 recovery mechanisms (max output token escalation up to 3 retries, reactive compaction firing at most once per turn, prompt-too-long overflow chain: context-collapse → reactive compaction → terminate). The pack maps directly onto the paper's three recurring design commitments: graduated layering over monolithic mechanisms, append-only designs favoring auditability over query power, and model judgment within a deterministic harness. Use it to implement, clone, or critique any production agent loop. Target: a staff engineer rebuilding the loop in a different language (Rust, Python, Bun) or auditing an existing one for compaction-race and stop-condition bugs.

Not yet measured

~0 installs·~0 tokens saved

harnessturn-loopcontext-compaction

Agent Workspace2026-04-19

Communitysecurity

Seven Safety Layers

Defense-in-depth for tool execution. Deny > ask > allow. All 7 layers, in order, with honest failure modes.

Security reference for Claude Code's 7-layer permission architecture, derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228). The paper's anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure; the safety layers are a large slice of that infrastructure. The pack enumerates the 7 independent layers in order — (1) tool pre-filtering, (2) deny-first rule evaluation, (3) permission mode constraints across 7 modes (plan, default, acceptEdits, auto, dontAsk, bypassPermissions, bubble), (4) auto-mode ML classifier (yoloClassifier.ts two-stage fast-filter + chain-of-thought with a timeout race against a pre-computed classification), (5) shell sandboxing for filesystem and network isolation, (6) non-restoration on resume (permissions re-established each session; trust is not persistent), (7) PreToolUse hooks with permissionDecision return values. It also names the 4-stage authorization pipeline (pre-filter → hooks → rules → handler with 4 branches). The pack carries the paper's three recurring design commitments — graduated layering, append-only auditability, model judgment inside a deterministic harness — and documents the honest failure modes the paper flags: ~50 subcommands bypass shell-layer analysis, the classifier's timeout race can leave decisions ambiguous, and the pre-trust window allowed 4 published CVEs where extensions executed before the trust dialog appeared. Target: a staff engineer or security lead reviewing an agent's tool-execution surface before shipping.

Not yet measured

~0 installs·~0 tokens saved

securitypermissionsdefense-in-depth

Agent Workspace2026-04-19

Communityreference

Nine Context Sources

CLAUDE.md is user context, not system prompt. 9 sources, 4 hierarchy levels, zero embeddings.

Reference for the 9 ordered context sources Claude Code assembles before every model call, and the 4-level CLAUDE.md hierarchy that feeds one of them. Derived from the VILA-Lab Dive-into-Claude-Code paper (arXiv 2604.14228), whose anchor finding is that Claude Code is ~1.6% AI decision logic and ~98.4% deterministic infrastructure. The pack's entire frame is the paper's explicitly-labeled 'Critical design choice' from §Context Construction: CLAUDE.md is user context (probabilistic compliance by the model), NOT system prompt (deterministic enforcement by the runtime). Permission rules are the deterministic layer; CLAUDE.md instructs but does not enforce. The pack enumerates the 9 sources in order (system prompt → environment info → CLAUDE.md hierarchy → path-scoped rules → auto-memory → tool metadata → conversation history → tool results → compact summaries), breaks the CLAUDE.md hierarchy into its 4 precedence levels (/etc managed, ~/.claude user, project, .local gitignored), and documents the file-based memory model that replaces vector DB / embedding approaches with an LLM-based scan of up to 5 memory-file headers on demand. Fully inspectable, editable, and version-controllable by the user — and one of the paper's three recurring design commitments (append-only auditability over query power) made concrete. Target: an agent engineer who keeps hearing 'just put it in CLAUDE.md' as a fix and needs to understand why that is wrong for safety-critical rules.

Not yet measured

~0 installs·~0 tokens saved

referencecontextclaude-md

Agent Workspace2026-04-19

Communityharness

Recommendedparallelization

Subagent Delegation — Three Isolation Modes

SkillTool vs AgentTool, 6 built-in types, 3 isolation modes. The knob every harness engineer gets wrong first.

Harness pack covering Claude Code's subagent architecture derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Frames the SkillTool vs AgentTool trade-off as the central decision: SkillTool injects instructions into the current context (cheap, same window); AgentTool spawns a fresh isolated conversation (~7x tokens, context-safe). Documents the 6 built-in subagent types (Explore, Plan, General-purpose, Claude Code Guide, Verification, Statusline-setup), custom `.claude/agents/*.md` with YAML frontmatter (tools, model, permissionMode, hooks, skills, memory scope), the 3 isolation modes (worktree / remote / in-process — default), sidechain transcripts as separate JSONL files, and multi-instance coordination via POSIX flock() with zero external dependencies.

Not yet measured

~0 installs·~0 tokens saved

harnessparallelizationsubagents

Agent Workspace2026-04-19

Communityreference

Extensibility — Four Mechanisms

Hooks → Skills → Plugins → MCP. Graduated context cost, three injection points, one decision tree.

Reference pack covering Claude Code's four extension mechanisms derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Frames the graduated context-cost hierarchy — Hooks (zero context, 27 events, 4 exec types) → Skills (low, SKILL.md with 15+ YAML fields, injected via SkillTool) → Plugins (medium, 10 component types) → MCP Servers (high, 7 transport types). Documents the 5-step tool pool assembly (base enumeration up to 54 tools → mode filtering → deny pre-filter → MCP integration → dedup) and the three injection points in the loop: assemble() (what the model sees), model() (what it can reach), execute() (whether/how it runs). The pack is a decision tree: which mechanism should you reach for, and why reaching for MCP first is the most common mistake.

Not yet measured

~0 installs·~0 tokens saved

referenceextensibilityhooks

Agent Workspace2026-04-19

Communityharness

Session Persistence — Three Channels

Append-only JSONL across 3 channels. Permissions never restore on resume — the friction IS the safety.

Harness pack covering Claude Code's session persistence design derived from the VILA-Lab architectural analysis (arXiv 2604.14228). Documents the three persistence channels: append-only session JSONL transcripts (full conversation with chain-patched compaction boundaries), global `history.jsonl` for cross-session prompt recall (reverse-read for Up-arrow), and subagent sidechains as separate JSONL per subagent. Frames the critical deliberate non-feature: permissions are never restored on resume — trust is always re-established in the current session. The paper presents this as a design choice, not a UX bug: the user friction is the cost of maintaining the safety invariant. The pack also captures the append-only / chain-patching trade-off — auditability and simplicity over query power.

Not yet measured

~0 installs·~0 tokens saved

harnesspersistencesession

Agent Workspace2026-04-19

Communityreference

Agent Design Space — Six Decisions

The architect's framework: answer these six before picking a pattern.

A reference pack that walks builders through the six recurring design decisions every production coding agent must answer: reasoning placement, safety posture, context management, extensibility, subagent architecture, and session persistence. Each decision pairs Claude Code's answer with realistic alternatives from LangGraph, Devin, SWE-Agent, OpenHands, and Aider, then closes with a Meta-Pattern of three commitments (graduated layering, append-only, model-judgment + deterministic harness). Frame this as 'the pack about picking packs' — answer all six first, then let the catalog serve specific patterns that match your posture. Sourced from VILA-Lab/Dive-into-Claude-Code build-your-own-agent.md (arXiv 2604.14228, CC-BY-NC-SA-4.0).

Not yet measured

~0 installs·~0 tokens saved

referencearchitecturedesign-space

Agent Workspace2026-04-19

Communitysecurity

CVE: The Pre-Trust Execution Window

4 CVEs share one root cause: extensions execute before the trust dialog renders.

A security pack targeting the pre-trust execution window documented in VILA-Lab's Dive into Claude Code (arXiv 2604.14228). Extensions — plugins, MCP servers, hooks, skill manifests — can execute code during CLI initialization *before* the trust dialog appears, creating a structurally privileged attack window that lives outside the deny-first permission pipeline. The paper's README and Safety section tie four patched CVEs to this root cause. This is NOT prompt injection at runtime — it is code execution at load time, with the user's full OS privileges, before the user has been shown what they are trusting. Pair with `injection-surface-audit` for runtime content attacks; use this pack when you are building or auditing a harness that loads third-party extensions. CC-BY-NC-SA-4.0 source terms apply.

Not yet measured

~0 installs·~0 tokens saved

securitycvepre-trust-window

Agent Workspace2026-04-19

Communityharness

Experimentalhybrid

Workflow Elicitation

Capture operator judgment before you automate it.

A future-facing pattern for teams that need to externalize tacit expertise into reusable operating instructions before they try to scale an agent across a domain.

Not yet measured

~0 installs

elicitationoperating-systemknowledge-capture

Open Workflow Lab2026-04-15

Recent change traces · Cross-linked to packs

What changed, and why

Every coding session captured as rows — Scenario / Files / Changes / Why. Ctrl+F your own history.

View all traces