Autonomous Work — Supervisor Core Checklist

Read fully every session. This is the always-context operating checklist for overnight / unsupervised Claude Code runs. It holds the rules; verbatim templates, schemas, and rationale live in side files:

Launch detail (before-launch checklist, prompt template, 4-artifact agreement, goal-tree ASCII, run-state schema) → docs/reference/autonomous/LAUNCH.md
Finish / cleanup (gated on Seva acceptance) → docs/reference/autonomous/FINISH.md
Autonomy economics, failure-mode postmortems, sources → docs/reference/autonomous/WHY.md

Gate definitions are owned by OPERATING-PRINCIPLES.md §0a/§0b + docs/specs/*; artifact-lifecycle/budget mechanics by ARCHITECTURE.md §13 + OPERATING-PRINCIPLES.md §0 rule 8. This file points at them; it does not restate them.

Last updated: 2026-05-31.

1. Three Laws

State lives in files, not context. Context compacts; files don't. Every piece of state the next session needs must be in a file before this one ends.
Scope is pre-defined and closed. The launch prompt lists which goals are in scope and names what is out of scope. Unlisted features don't happen; unfinished in-scope work is not a "follow-up" — finish it.
Progress is machine-verifiable. "It looks done" is not done. Every goal has an acceptance gate a script, grep, browser click, or deployed proof can check.

The alternative — an agent improvising in one long context window — produced 42 runs in a night, a production overwrite, an empty heartbeat, and no checkpoint. It worked by luck, not design. (Story → WHY.md.)

2. On wake / post-compaction — recovery sequence ★

The compacted chat summary is hints only, never truth. Canonical docs are normally injected fresh-from-disk by the OpenClaw host plugin; if any required doc is not present in context, read it from disk. Before deciding anything, re-read IN ORDER:

AGENTS.md including §0 Current Autonomous Run / Scoreboard.
The §1.1 read-fully set: GOALS-PIPELINE.md, OPERATING-PRINCIPLES.md, ARCHITECTURE.md, this file, docs/specs/operator-journeys.md.
Workspace HEARTBEAT.md (absolute path /Users/sevaustinovassistant/.openclaw/workspace/HEARTBEAT.md — outside this repo).
The active data/run-state/<run_id>.json.
The latest direct worker output (/tmp/cc-*.txt) the wake references.
Run-reports the checkpoint cites (data/run-reports/*) — only those.

Then decide: accept / relaunch / escalate / stop. Compare the durable artifacts against the original goal and the stop-time contract, not against compacted memory. If a required file won't read → say BLOCKED and name it. Crash recovery follows the same order: read last checkpoint, check git log, resume from the next uncompleted task, never re-run verified-idempotent work for free.

3. The wake decision turn ★

Compare now vs stop_time FIRST. If now < stop_time and no worker is active → the default is to launch the next bounded slice, not to summarize success.
Re-derive the highest-level user goal before any stop/DONE decision. Ask whether the current artifacts solve the business/operator decision the user actually needed, not merely whether the last task produced files. If fit is uncertain, mark the relevant goal PARTIAL and launch a reframing / strategic-fit / research-quality slice.
Reject non-transferable evidence before calling research or strategy DONE. If the result relies on incumbent brands, prior fame, non-replicable startup-metrics spikes, wrong ICPs, luck catalysts, or otherwise non-applicable examples, treat that as evidence failure and launch a relevance-audit slice rather than stopping. For non-research runs, apply the analogous transferability test: does the evidence prove the real target workflow/environment/user, not a convenient proxy?
Quantitative target miss is not impossibility. If the goal was "find/build/prove N" and a worker returns fewer than N, the default supervisor status is PARTIAL, not DONE. A worker's "not found", "not possible", "not reachable", or "not worth continuing" is only a method-exhaustion claim, never evidence of impossibility by itself. Before accepting such a claim, require: (a) the original target restated exactly; (b) achieved count vs target; (c) methods tried with evidence/output for each; (d) search/solution-space coverage estimate; (e) major untried method classes; (f) the next method that changes source class, query strategy, seed set, data provider, language/community, tool, or evidence standard. Unless at least five genuinely different method classes have failed, or a hard blocker prevents all safe methods, the next action is mark PARTIAL + launch a different-method slice. A strategic pivot may be added as a useful branch, but it must not silently replace the original target.
Before DONE-self-assessed, launch an independent fresh Claude Code verifier when strategic fit matters. The verifier's job is to challenge whether the result truly solves the highest-level goal, is fully applicable to Seva's case, and works end-to-end; if it fails, the next branch is repair/reframe, not summary.
Write the VISIBLE GOAL TREE before launch (template → LAUNCH.md): four goal rows with NOT_STARTED/PARTIAL/BLOCKED/DONE + evidence, exactly one active branch, the acceptance gate that branch advances, and the expected evidence. It must match the §0 scoreboard. Silent relaunch is a supervisor defect — record it in the run report if it happens.
A green checklist or empty task list is NOT a stop reason (see §5).
Forbidden pre-stop phrasing: "mission complete", "all stages done", "full cycle closed", "checklist_complete" — none is a valid stop reason before stop_time.
Valid stop reasons ONLY: explicit Seva stop · safety boundary · missing credential/access that blocks all branches · a required live main-account X action · repeated documented non-progress (quantified in §10).

4. Slice rule

Every slice names the acceptance gate it advances before it starts. "Small and reversible" is necessary, not sufficient. A slice that advances no gate must say why it is still needed and name the next gate-closing slice. No pointless tiny busywork; an observability/guardrail-only phase cannot be the final phase while any goal is NOT_STARTED/PARTIAL. Plans are written in DoD / acceptance-gate terms, not descriptive phases (mirrors AGENTS §0 R3).

5. When the task list empties before stop_time

The task list is the current workfront, not the ceiling of useful work. If it empties before stop_time:

Return to the goal and DoD. Ask what is still missing in the implementation, not in the checklist.
Pick the next safe bounded slice from: deployed-regression audit · old-vs-new surface diff · operator-path E2E for the changed surface · repeated generation testing (§9) · fresh reviewer/gap review · smallest rollback-safe repair · doc/run-state reconciliation.
Gap-review escalation: launch a fresh CC reviewer (no inherited context) to find gaps the worker missed. Up to three clean independent passes, then move one level up inside the same goal (reliability / quality / cleanup / observability). A verified-green checkpoint mid-run is a baseline, not a terminator.
Stay in scope. Infinite improvement goals are not a license for unrelated features.

6. Delegated product authority — don't manufacture blockers ★

GOALS-PIPELINE.md, ARCHITECTURE.md, and docs/specs/operator-journeys.md are Seva-approved product direction. If they define the goal, target architecture, taxonomy, operator journey, or acceptance evidence, the supervisor must not ask Seva to re-decide it — execute it.

Default to autonomous resolution. Small/medium reversible product & implementation choices inside the documented scope are yours: choose the safest reversible option that best serves the canonical goal, document the default, test it, continue.
Escalate ONLY a required live public X action on Seva's main account, or an intentional change to the documented North Star / scope. Destructive-looking, schema-related, or product-significant work that is still inside scope proceeds with a rollback plan — it is not a blocker.
When uncertain, launch analysis (a fresh reviewer comparing options against the docs) instead of asking.
One-question rule: if escalation is genuinely necessary, ask exactly one concise blocking question and state the default you'll take if Seva delegates it back. Never rest on a half-disassembled product (§14).

7. Stop-time supervisor contract

A worker can finish; the supervisor is responsible for noticing there is still time left. Maintain an explicit relaunch contract in run-state (schema → LAUNCH.md):

data/run-state/<run_id>.json ⇄ AGENTS.md §0 ⇄ workspace HEARTBEAT.md must agree — reconcile before launching another worker.
launches_until_stop: true means no idle resting state before stop_time; if no worker is active, launch the next bounded slice.
stop_time limits NEW launches, not active workers. Let an already-launched worker finish, write its artifacts, and deliver its result.
Do not encode stop_time as a worker --timeout. It is a supervisor relaunch policy. Use worker timeouts only for genuine runaway protection, comfortably longer than the useful work window.
Do NOT switch off the heartbeat, remove relaunch crons, or flip §0 active:false while the run is live or DONE is only self-assessed. Teardown happens only after Seva acceptance (§14) — see FINISH.md.
Use a wake safety net: for long windows, at least one periodic cron/system wake near the next decision point. Heartbeat alone is useful but not a durable contract. On every wake, run §2 then §3.

8. Execution cadence & checkpoints

One task at a time: complete → verify → record → next, no interleaving. Write a checkpoint after each task to data/run-state/<date>-<slug>.json (tasks_completed, tasks_remaining, cc_runs, failures, decisions, artifacts_created, summary; full key list → LAUNCH.md). Every worker writes a final summary to a known path; the supervisor reads the checkpoint, it does not depend on a notification firing (silent-completion mitigation). Append a one-line event to data/run-state/run-log.jsonl at each launch/wake/decision/stop/cleanup (schema → §17).

9. Testing doctrine — test, test, test ★

Green local smokes are not acceptance. A pipeline is reliable only when proven by repeated runs, adversarial cases, and deployed evidence. Before any goal is DONE, run the proofs that match what changed:

Generation changed (prompts, recommender, post/reply/quote/rework writers, selectors): run generation multiple times — default target 10 representative runs when feasible — across realistic inputs, and inspect output quality and failure patterns, not just exit codes. One happy-path generation proves nothing. Look for the banned patterns Seva has flagged (generic broad-people claims, agree+question formula, confabulated stack claims) and for empty/degenerate output.
UI changed: click through all main operator scenarios in a browser against the deployed / production-shaped surface, then the important edge / non-main scenarios. The bar is: after the run, Seva can open the site and it actually works — not that a render smoke passed. No "Force Reload" or hidden refresh as a substitute for a real operator gesture.
Product goal says it should work for Seva ("open the interface and it works", "deploy", "live", "I should see it in the UI"): source-only checks cannot close it. At least one deploy must contain the slice, and the operator-path gate must be re-run against that deployment URL (local builds do not transfer). Classify proof level per docs/reference/testing/TEST-DEPLOY-GATE.md (L0–L5); a PASS without a named level is incomplete.
Eval gate, minimum: at least one machine-checkable gate before any task is "done" — script exit code · file existence/content · grep · dry-run shape · build success. "I reviewed it and it looks correct" is not an eval gate.

The pre-DONE verification gate pointers are in §13; the generation reliability campaign target (10 analyzed full-cycle runs) is in GOALS-PIPELINE.md.

10. Watchdog — when "repeated non-progress" is real ★

Repeated non-progress is a high bar, not an early exit:

Fewer than 5 principled, meaningfully different solution attempts generally cannot prove non-progress. Retrying the same command, or making cosmetic variations, does not count. Same-command repeated failure is a signal to change approach, not to give up.
After 5 failed, genuinely different approaches to the same gate, stop only to document / diagnose / escalate / choose a new strategy — do not silently idle, and do not abandon the goal. Write what was tried, why each failed, and the next strategy or the one blocking question (§6).
Token/staleness hygiene (operational, not stop triggers): ≥60% context → force a checkpoint; ≥80% → consider compaction or handoff; last checkpoint >30 min old with no new commits → likely stuck, investigate.

This is the only numeric definition of the "repeated non-progress" hard-stop referenced in §3.

11. Budget guards & safety defaults — guardrails, not stop permission

Max 10 tasks / 15 CC runs per session, 1 retry per failed task, plus token + wall-clock ceilings. These exist to prevent runaway cost and chaos — not as a reason to stop early. If the system is still making real progress toward the goal, continue inside the boundaries; a limit reached mid-progress is a prompt to checkpoint, compact, and (if needed) hand to a fresh session, not to declare done. (Cost-runaway story → WHY.md.)

Safety defaults: --dry-run is the default; external writes are fail-closed; never overwrite a production file without an explicit task instruction. For runs > ~4 h or > ~10 CC runs, insert compaction phases (cadence detail → ARCHITECTURE.md §13).

12. Parallel-run safety

No shared mutable state. Parallel tasks must not write the same file; if two need the same checkpoint/doc, run them sequentially.
Separate output paths per task; independent checkpoints (data/run-state/<date>-<task-label>.json).
Never kill a sibling worker. Run ps aux | grep cc- before any process management; a worker must never terminate another.

13. Pre-DONE verification gate (pointer-shaped)

Before reporting DONE, run — or launch a fresh verifier for — each gate that applies. Gate definitions are owned elsewhere; this is the index:

Highest-level-goal fit gate: independently re-check that the result solves the user's real business/operator objective, not only the artifact/task description. When strategic fit, applicability, or research quality matters, launch a fresh Claude Code verifier to challenge relevance, transferability, completeness, and end-to-end usefulness before DONE-self-assessed.
Operator-path gate (no PASS without the real operator gesture) → OPERATING-PRINCIPLES.md §0a item 1.
Real browser clicks on the DEPLOYED site (not helper/render smokes) → OP §0a item 3 + §9 above.
Deployed-shape proof on the deploy that contains the slice → docs/reference/testing/TEST-DEPLOY-GATE.md.
Concept-duplication scan (no parallel implementation of a §11 concept) → OP §0b + ARCHITECTURE.md §11.
Spec-coverage (feature/primitive has an updated docs/specs/<feature>.md + smokes) → GOALS-PIPELINE.md §3 + ARCHITECTURE.md §18.
Supabase load-budget (node scripts/smoke-supabase-egress-budget.mjs exit 0) for any new hot-path query → OP §0a item 5 + docs/specs/supabase-load-budget.md.
Live-E2E test-account residue check (paired undo rows, no production writes, sentinels not co-armed) → OP §0c + ARCHITECTURE.md §12.
Fresh-CC documentation bootstrap probe if the run names doc reliability: a clean session reads only AGENTS/read-list, performs a task, leaves no stray docs; G3 PASS needs a trailing clean probe.

"Looks right / appears to work / verified by smoke / Claude says done" are forbidden as substitutes for an executed, asserted gate.

14. Definition of "готово / работает" for an autonomous run

DONE only when all hold:

(a) every §0 goal is DONE:<evidence> or BLOCKED:<named hard-stop>; (b) operator-visible goals have a deployed surface + a passing browser/operator-path proof (§9, §13); (c) the product is not left half-disassembled — no broken old path with an unproven/undeployed new path; (d) artifacts exist at expected paths and a final checkpoint summary is written; (e) Seva has ACCEPTED. Until acceptance the run is DONE-self-assessed / awaiting acceptance and all scaffolding stays up (heartbeat, crons, §0 active:true) per §7. Self-DONE ≠ accepted; cleanup is gated on acceptance (FINISH.md).

15. Artifact & read-tier discipline — one consolidated plan

Per run, exactly: one launch prompt (prompts/autonomous-YYYY-MM-DD-<slug>.md), one run-state checkpoint + summary, exact-path Evidence reports only when a task produces durable proof, local commits after verified units, and the append-only run-log.jsonl.

FORBIDDEN (status sprawl): root PROGRESS.md / ROADMAP.md / HANDOFF.md / NEXT.md or timestamped root summaries; new permanent docs/*PLAN.md / docs/*STATUS.md; multiple competing "current" prompts/checkpoints; committed /tmp/cc-* dumps, raw logs, HTML, or huge JSON. Keep one consolidated plan, not multiple status/progress truth files. The only live status surfaces are the active run-state JSON + AGENTS §0 + HEARTBEAT.

Read-tiers (context budget at wake): read fully = canonical docs + active launch prompt + latest checkpoint summary; read selectively = the task's source/tests + latest worker output + the cited evidence artifact; reference only = old reports, old prompts, raw logs (paths + one-line relevance, not loaded). Detail → ARCHITECTURE.md §13.

16. Status discipline

Mark implemented-vs-not in the canonical docs with the legend [ENFORCED] / [PARTIAL] / [TARGET] / [DEBT]. Do not invent a new status doc to hold it. → OP §0, AGENTS §2.

17. Launching a worker — intention chain & run-log

Every launch passes the full intention chain, not a ticket-sized step: the product goal, Seva's risk/time posture, prior decisions, the current DoD, safety boundaries, out-of-scope areas, and how this slice feeds the next. Omit secrets and unrelated transcript noise — preserve the chain needed to reason like an owner.

For Claude Code workers, the worker prompt must say FIRST: read the canonical docs. cd into the repo, read AGENTS.md incl. §0, run node scripts/preflight-context-check.mjs <task-tags>, then read the §1.1 canonical set + task-specific docs before editing or testing. The plugin usually injects them; if not present, the worker reads them from disk. Completion reports name files read / changed, checks run, commits, deploy URL, and whether any live X actions occurred.

Run-log (data/run-state/run-log.jsonl, append-only, never force-loaded): one JSON object per line, { "ts", "run_id", "event", "detail" } where event ∈ launch | wake | decision | stop | cleanup. It is event evidence, not another current-status source — current status stays in the run-state JSON (§15).

18. Pointers

docs/reference/autonomous/LAUNCH.md — before-launch checklist, 4-artifact agreement, prompt template, goal-tree ASCII, full run-state schema.
docs/reference/autonomous/FINISH.md — post-completion cleanup, gated on Seva acceptance.
docs/reference/autonomous/WHY.md — autonomy economics, the 7 failure-mode postmortems, sources.
prompts/autonomous-run-template.md — concrete launch-prompt template.
OPERATING-PRINCIPLES.md §2 (bounded-session theses), §0a/§0b (gate definitions) · ARCHITECTURE.md §13 (artifact lifecycle/budget) · GOALS-PIPELINE.md (Daily Run Reliability Campaign).

Autonomous Work Operating Guide